Additional guide here.
FIRST Sets *
Terminal - lowercase letters; symbols of a language Non-terminal - uppercase letters; each letter represents a language.
- if X -> epsilon, then FIRST(X) = {epsilon}
- For any terminal a, then FIRST(a) = {a}
- For a production rule $X \rightarrow Y_1Y_2Y_3$:
- if $\epsilon \notin FIRST(Y_1)$, then $FIRST(X) = FIRST(Y_1)$
- if $\epsilon \in FIRST(Y_1)$ then $FIRST(X) = {FIRST(Y_1) - \epsilon } \cup FIRST(Y_2Y_3)$
Eliminating Left Recursion
Direct Left Recursion
- Find bodies of all nonterminal A production w/ left recursion and delete them.
- Add deleted bodies to a new non-terminal A'
- For each non-terminal A':
- Delete the first recursive call A
- Add A' at the end
- Add epsilon production to A'
- Add A' at the end for all remaining non-terminal production A.
FOLLOW Sets *
- For start symbol $S$, place dollar sign in $FOLLOW(S)$.
- For any production rule $A \rightarrow \alpha B$, then $FOLLOW(B) = FOLLOW(A)$
- For any production rule $A \rightarrow \alpha B \beta$:
- If $\epsilon \notin FIRST(\beta)$, then $FOLLOW(B) = FIRST(\beta)$
- If $\epsilon \in FIRST(\beta)$, then $FOLLOW(B) = {FIRST(\beta) - \epsilon} \cup FOLLOW(A)$
For FOLLOW sets, epsilon CANNOT exist within a set.
PREDICT Sets *
Parsing Tables
Grammars and Derivations
Left-most
Right-most
Ambiguity
A grammar is said to be ambiguous if a string produces more than one:
- Parse tree
- or derivation tree
- or syntax tree
- or leftmost derivation
- or rightmost derivation
Parser Conflicts
Regular Expressions
Types of Parsers
Top Down Parsing
Parser starts from start symbol and tries to tranform it to the input. Types of Top Down:
- Rescursive Descent Parsing - makes parse tree from the top an dinput is read from left to right.
- Backtracking - if one derivation fails, the syntax analyzer restarts the process using different rules of the same production. Could process the input string more than once to determine the correct production.
- Non-Back Tracking
- Predictive Parser - form of recursive-descent parsing that does not use back-tracking to predict which production needs to be used to replace the input string.
- LL Parser - only accepts LL grammar (LL grammar is a subset of CFG). LL grammar can be implemented using recursive-descent or table-driven.
Example of Top Down Parsing: input string: $a + b * c$ $S \rightarrow E$ $E \rightarrow E + T$ $E \rightarrow E * T$ $E \rightarrow T$ $T \rightarrow id$
Bottom Up Parsing
Starts with the input symbols and contructs the parse tree up to the start symbol. Types of Bottom Up:
- Shift-reduce Uses two steps:
- shift step - advance to the next input symbol using dot. Symbol is pushed onto the stack and shifted symbol is treated as a single node of the parse tree.
- reduce step - if parser finds a complete grammar rule (RHS) and replaces it with LHS. Top of the stack contains a handle; reducing requires to POP on the stack to remove handle and replace it with LHS non-term symbol.
- LR Parsing - non-recursive, shift-reduce, bottom up parser. Three LR parsers exist:
- SLR Parsing
- LALR Parsing
- LR Parser
Example of Bottom Up Parsing: input string: $a + b * c$ $a + b * c$ $T + b * c$ $E + b * c$ $E + T * c$ $E * c$ $E * T$ $E$ $S$
Parsers Weakest to Strongest
- LL(1)
- SLR(1)
- LALR(1)
- LR(1)
How can a CFG be LL(1)?
- Must be unambiguous
- No left recursion (indirect or immediate)
- No FIRST-FIRST Conflict
- No FIRST-FOLLOW Conflict
LR(0) *
A bottom-up parser that does the left-to-right scanning of the input and constructs a rightmost derivation in reverse.
LR is table-driven, similar to non-recursive LL parsers.
Pros of LR:
- can be constructed to recognize all programming language constructs for which CFG can be written.
- is the most general nonbacktracking shift-reduce parsing methid known, and can be implemented efficiently.
- can detect syntatic errors as soon as it is possible to do so on a left-to-right scan of the input
How it works:
A LR parser makes shift-reduce decisions by mainintaing states to keep track of where we are in a parse.
NOTE: States are a set of items in this case.
A LR(0) item of some grammar G is a production of G with a dot at some position of the body. ex: The grammar $A \rightarrow XYZ$ produces four items: $A \rightarrow \boldsymbol\cdot XYZ$ $A \rightarrow X\boldsymbol\cdot YZ$ $A \rightarrow XY \boldsymbol\cdot Z$ $A \rightarrow XYZ \boldsymbol\cdot$
NOTE: n + 1 LR(0) are created, hence the reason why 4 items are produced. If A -> epsilon is a production, then 1 item exists.
The dot in LR(0) indicates which part of the rule has already been parsed and what is expected next. So, for example: $A \rightarrow X\boldsymbol\cdot YZ$ says that $X$ has been parsed successfully and it is expected that a string can be derived from $YZ$ in the upcoming parse.
Basically, an LR(0) item separates the RHS of production into two parts: parsed and not-yet-parsed part.
At any given time during parsing, we indicate how much of the RHS of a production has been matched to input and how much is still to be matched.
NOTE: Shifting is taking the next token from the input and adding it to the current sequence; Reduce is taking several tokens that were shifted and combining them into a larger unit based on the set of rules. For example, if you have the word ‘New York’, you might reduce them to ‘city’ because ‘New York" is a type of city.
LR(0) DOES NOT HAVE A REDUCTION LOOKAHEAD (that is, it does not have the ability for the parser to decide when to apply certain grammar rules based on the input token it sees)!!!
IMPORTANT TO KNOW: For LR(0), reductions are done on FOLLOW sets; LR(1) on reduction lookahead.
LR(0) Parsing
For a grammar to be LR(0), it must be able to determine what action to take without looking at the input (hence the reason why it does not have a reduction lookahead). Decisions to add the next input symbol onto the stack or to perform a reduction are made soley on what’s currently on the table (stack) w/o peeking at what comes next.
Determine if a grammar is LR(0)
A language L has an LR(0) grammar iff L is a deterministic CFL w/ a prefix property.
Prefix Property - no proper prefix of any valid word is also a valid word in the same language.
EXAMPLE: ‘apple’ is a valid word; however, ‘app’ is not valid in the same language.
So for example, consider the valid word ‘(())':
- ‘(’ is not valid because it does not have a matching closing parenthesis.
- ‘((’ is not valid because it is missing two matching closing parenthesis.
- ‘(()’ is not valid because it is not complete–it is missing a matching closing parenthesis.
So the valid word ‘(())’ satisfies the prefix propert because it has proper matching.
LR(1)
LR(1) items uses one lookahead symbol. Items have information about the next symbol that the parser expects to see after the dot (lookahead symbol). A parsing table is created so that the parser can make decisions during parsing, which will consider the current state and the next input symbol.
SLR Parsing
Known as Simple LR, it is the construction from the gramar of the LR(0) automaton. States are the set of items from canonical LR(0) collection; transitions are given by a GOTO function.
SLR(1) *
Is a shift-reduce parser.
Determine if it’s Context-free or Not
Recursive Descent Parsing
A technqiue that uses the recursion stack as the equivalent to the stack found in push-down automaton needed for recognizing the language of a CFG. page 129
Lexical Analyzers
Left Factorization
Predictive Parsers
Basic Block and Control Flow Graphs *
Basic block - consecutive piece of code that is always executed together. Example of basic block: iload_1 iload_2 iload_3 iload_4 idiv iload_1 imul iadd ishr istore5
NOTE: The above is bytecode of q = a » b + ( c/d * a )
What is a leader? a leader is one of the following instructions:
- the first instruction is a leader
- any labeled instruction is a leader
- any instruction following a jump is a leader.
TIP: First make basic blocks, then find its leaders.
Type Checking *
Involves having the compiler assign a type expression to each component of the source program. It then must determine if type expressions conform to the collection of logical rules (type system).
It is used to catch errors in programs.
Ensures all operators (+, -, *, etc.) are compatible with their operands.
There are two types of type checking:
-
Static type checking - performs type checking at compile time (program is translated into a language that the computer’s hardware can understand and execute, known as machine code). Variables and parameters are bound to types at compile time. Uses are checked at compile time to assure no type errors. Variables are declared before they are used, and when declared, are bound to a type.
-
Dynamic type checking - performs type checking at runtime (program is running after it has been compiled ). No binding of variables or parameters are done. Variable’s type is determined by vaules assigned to it, which happens at runtime. Also means that tye checking of expressions happen at runtime when variables’ types are known–leads to more type errors at runtime than statically typed languages.
Strongly Typed - ALL type errors are detected at compile time–no type-related errors can occur at runtime. Programs will run w/o type errors. Weakly Typed - Type errors cannot be typed checked at compile time. Variables can be assigned to any type without declaring it beforehand.
Coercion
Automatic conversion of a data type (implicit). Compilers allow coercion for types when there is no loss of precision.
EXAMPLE: adding an integer to a double is only legal if the language supports coercion. The integer mus tbe converted to a double before the addition will happen.
Casting
Manual conversion of a data type (explicit).
EXAMPLE: if you want to convert an integer to a double, then you must manually do it. For example, int i = (int)doubleType will convert a double type into an int type.
NOTE: Any OOP language with inheritance cannot be strongly typed.
Purpose of type checking:
- Type checking at compile time is possible only for statically typed languages.
NOTE: statically typed languages - must declare the kind of data that each variable will hold before you use it–they are given at declaration time.
Ex: assigning value 3.14 to a variable of integral type is not legal since the decimal part ‘.14’ will be lost in the assignment.
- Apart from checking numeric types, a type checker can also be used to check assignments involving object references.
NOTE: a superclass type can be assigned a reference to an object of a subclass type. So, if A is a superclass of B, then a reference to an object of class B may be stored in a variable of type A.
- It is also the type checker’s task to resolve method invocations (if the language is statically typed)
Type Compatability of Classes
Use the formula $A <:_T B$ where B is a subtype of the supertype A. Any expression of type B can be used anywhere that an expression of type A is expected
NOTE: a class is not a subclass of itself.
If $A <:_T B$ and $B <:_T C$, then $A <:_T C$ (transitive) but NOT SYMMETRIC ($A <:_T B$ != $B <:_T A$)
Ex:
A v = new B(...); // is legal iff A = B or $A <:_T B$
$T_v :=T_e \Longleftrightarrow (A=B) \lor (A <:_t B)$
Basically says “variable must be equal or higher type than expression to establish assignment compatability.”