Lecture Notes 1/22

Chapter 1

Syntax - rules for what a syntactically program looks like ( the structure of the language is correct )

Semantics - meaning of the program

In terms of a compiler, the compiler would output the target code. That target code.

Frontend - syntax checking, semantic checking

Backend - create target program (code generation) and optimization

Frontend -> i (intermediate code or parse tree) -> Backend

Lexical Analysis - chop inputs into tokens

Semantic Analysis - type checking, declaration checking

Intermediate Code - Immediate Code generation

Chapter 3

Lexical Analysis/Scanner - Stream of characters are converted into a stream of tokens.

Steps for Lexical Analysis:

  1. Find all terminals in the grammar (categorize these tokens into their respective lexeme)
  2. Write scanner

Chomsky Hierarchy:

  1. Regular Expressions - recognized by NFA/DFA
  2. Context-Free Grammar - recognized by PDA
  3. Context Sensitive -
  4. Unrestricted - Turing Machine

Both regular expressions and context-free grammar have:

  1. non terminals on the left
  2. any combination of terminals on the right.

Examples: s -> x x -> axb | d // is not regular since it is a^ndb^n

s-> x x -> ax | b // is regular since a*b

s -> x x -> ay | $\epsilon$ y -> bx // is not regular s -> ay -> abx -> abay -> .. (ab)*

If we’re writing the scanner, we must consider token types to use either NFA/DFA or PDA. ??

; , + « for while are its own regular expresions.

Chapter 1 Reading

Syntax - Set of rules that describe how symbols can be combined to form a correctly structured program.

NOTE: In most cases, syntatic rules for a programming language are described using a context-free grammar written in Backus-Nauer Form

Semantics - Meaning of a syntatically correct program.

NOTE: Denotational and operational semantics are the most used formal semantics for compiler and interpreter writing.

Compiling and Interpreting

Two ways to execute programs:

  1. Compilation - Source program is translated to a target program, which can then be executed directly by an OS (running on hardware).
  2. Interpretation - Source program is interpreted by another program that is run on the hardware.

NOTE: Java is a hybrid language–it can be both compiled and interpreted. Javac (compiler) produces a Java class file, which then will be interpreted by the Java Virtual Machine

Compiler is divided into two parts:

  1. frontend - performs syntatic and semantic checks.
  2. backend - If the program is both syntactically and semantically correct, the front end will produce some representation i of the source program. i will then be used as an input in the backend and will then perform a number of tasks that are related to the target architecture of the machine on which the program will run. It also produces a target program t which can be executed on the target architecture.

NOTE: For backend, the ‘tasks’ could either be for optimization or analyses of uninitialized variables and code that can never be executed (dead code).

Frontend -> i -> Backend

Stages of a Compiler

  1. Source Program (initial)
  2. Lexical Analysis - Scans source code one chracter at a time (reads from a character stream)
  3. Syntax Analysis -
  4. Semantic Analysis -
  5. Intermediate Code -
  6. Optimizer - New version of the program in the intermediate language
  7. Code Generator -
  8. Target Program (final)

Single vs. Multi-pass Compilers

Number of parse tree traversals is proportional to its runtime.

NOTE: Less fewer passes through a parse tree = harder to read and maintain.

For efficiency, it is wise for a compiler to be organized into many smaller stages (micro-passes)–this would make it easy to make changes and add/remove specific tasks. It would also improve readability and writability, as well as making the compiler be less buggy.

Frontend

The compiler front end has a number of stages.

NOTE: The frontend is responsible for making an ‘intermediate representation’ of the program if it is BOTH semantically and syntactically correct.

Parts of the frontend:

  1. Scanner/Tokenizer/Lexer - Inputs a source program one character at a time from a string/stream of characters (basically, compiler reads the file first).

NOTE: A token is a substring of one or more characters. It represents a terminal in the grammar of the language (Ex: keywords, literals [0 and 3.14], separators [; and ,], operators [+, -, *, /], identifiers [names of methods/classes], etc.)

Keywords: int, float, double, etc.

Identifiers: main, foo, bar,

Separators/Delimiters: ‘(’, ‘)’, ‘;’

Operators: ‘+’, ‘-’, ‘*’, ‘/’

Literals (fixed values): ‘42’, ‘3.14’, ‘true’, ‘false’, ‘null’, ‘[1, 2, 3]’, ‘Hello world’

After a token is made, it is then given a type known as a lexeme.

  1. Parsing and Syntax Analysis The parser makes sure that the source program is syntactically correct.

NOTE: A syntacticlly correct program is defined as a program that is in the language that the compiler is written in (basically, it is a program that can be generated by the context-free grammar that specifies that language).

If the parser can crate a parse tree for the program (that is, the program is accepted by the parser), then the program is deemed to be syntactically correct.

Tree walkers - traverses the parse tree and does various checks to see if the source program is semantically correct.

Parser will take the tokens as input, which will then have the parser output a syntax tree.

  1. Semantic Checking Ensures that the translation of the source program to the intermediate/target code can be performed (basically, if program is not semantically correct, then it is not possible to perform the code generation) so that the semantics of the language is preserved.

For semantic checks, multiple passes of the parse tree are performed (however, we can have micro-passes).

Micro-passes - multiple passes that have individual tasks (some can be large tasks, while others can be small to ensure the source program is semantically correct.

ex: a large task can be type checking; small task can be declaration checking

  1. Name Resoultion and Checking It ensures that variables and other named entities are declared/defined before they are used. This makes sure that the entities are referenced properly (not out of scope).

forward referenced - names can be used before they have been declared

Ex: prototype declarations in C++. This informs the compiler that a proper implementation is to follow later. prototype declarations are used to avoid multiple passes through the parse tree in the name resolution stage.

Second Pass: Resolve the use of names

Name Resolution can be used at compile time for statically scoped languages.

  1. Type Checking Type Checking at compile time is only possible for statically typed languages.

NOTE: If a language is statically typed. then type checking is possible at compile time.

Why type checking is important:

  • Ensures that the assignment is valid with respect to type so that no information is lost (ex: assigning a value 3.14 to a float type is valid; however, assigning it to an int is not since .14 is lost in the assignment).
  • Resolve method invocations (executing/calling a function) (ex: assigning a variable of type int to a function of type str).

Middle End

The middle end operates on the internal representation.

It consists of an optimizer that is independent of the architecture (middle end does not know anything about the target architecture, so the optimizations performed are generic).

Backend

Consists of two parts:

  1. Code Generation - generates the final (or near final) code (target program). Depending on architecture, number of tasks must be performed before the code generates.
  2. Target Code Optimization - Can be performed at the level of the syntax tree and the level of the intermediate language (if architecture is known beforehand).

Evaluation and Interpreters

Compiler - translates source code to a different program w/ the same semantics in a target language where it can be executed on a hardware architecture.

Interpreter - virtual arcitecture and executes the source code directly.

The compiler will check if the source program is both syntactically and semantically correct before producing the target program.

The interpreter will perform checks while the program is being evaluation. If the code is syntactically incorrect, it cannot be evaluated by the interpreter, and if it is semantically incorrect, it can never be evaluated by the interpreter (it is never reported).

NOTE: One may favor an interpreter over a compiler is portability–an interpreted language can be evaluated on any machine that has an interpreter.

Summary

Syntax is a set of rules that form a correctly structured program.

Semantics is the meaning of a syntactically correct program.

A compiler inputs the source program, then produces machine/target code.

An interpreter inputs the source program, then interprets

A token consists of one or more characters and is a label that assigns a lexeme its meaning (think of verbs, nouns, adjectives, etc. in a english sentence).

A lexeme is the representation of individual words (think of car as a noun, run as a verb, nice as an adjective, etc. ).

The front end consists of three parts: Lexical analysis, Syntax analysis, and Semantic analysis.

The back end consists of the code generator and target code optimization.

Phases to a Compiler:

  1. Lexical Analysis (Scanner) - Converts stream of characters into tokens.
  2. Syntax Analysis (Parser) - Determines the structure of the program (validifies if grammar is correct or not). Outputs a parse/syntax tree.
  3. Semantic Analysis - Makes sure that the program follows the rule of the programming language so that it can execute correctly.
  4. Intermediate Code - Code that is between source and object code. It is machine dependent–it can run on any machine.
  5. Optimizer -
  6. Code Generator - Takes intermediate code and generates code for the target machine.