Lecture Notes

RECALL: A Parser’s job is to make sure if the program is syntactically correct.

Types of Parsers:

  1. LL(k) - First L says to scan from Left to Right; Second L says to use left-most derivation (always develop the left-most non-terminal in a sentential form).

NOTE: A good thing about a LL(k) parser is that it is linear–there is no backtracking.

  1. LR(k) - First L says to scan from Left to Right; Second R says to use reverse right-most derivation (start w/ input string, find appropriate grammar rule, then start at the right).

NOTE: The k represents the number of tokens.

Example: $a^b^$

$L(a^b^)$ = {$\epsilon, a, b, aa, ab, bb, aaa, aab, abb, bbb, …$} S -> A A -> aA | B B -> bB | $\epsilon$

A string w $\in$ $L(a^b^)$ <=> $S->^* w$

Let’s try $w = aabb$

$S->{LM} A->{LM} aA ->{LM} aaA ->{LM}aaB ->{LM}aabB ->{LM}aabbB ->_{LM}aabb$

A Question to Consider: What if B could generate strings starting w/ a? S -> A A -> aA | B B -> bB | a | $\epsilon$

NOTE: The above cannot perform with just 1 lookahead.

w = aabb (w/ just 1 lookahead)

In general: A -> $\alpha_1 | \alpha_2$

A is not LL(1) if: $\exist w_1 \in L(\alpha_1)$ ^ $\exist w_2 \in L(\alpha_2)$ ^ . . . $w_1 = aw_1'$ ^ $w_2 = aw_2'$ ^

Define: a set representing tokens that a sentential form can start with: Call it FIRST(.)

$\epsilon: FIRST(\epsilon) = {\epsilon}$ $a \in \sum: FIRST(a) = {a}$ a mixed string $Y_1, Y_2, … Y_n$ $FIRST(Y_1, Y_2, …) \subset FIRST(Y_1) / {\epsilon}$

if $\epsilon\epsilon FIRST(Y_1):$ $Y_1, Y_2, … Y_n ->^* Y_2 … Y_n$ $Y_1 ->^* \epsilon $

if $\epsilon\epsilon FIRST(Y_1):$ $FIRST(Y_1 … Y_n) \subset FIRST(Y_2) / {\epsilon}$

. . . if all $Y_i$ have $\epsilon$ in their FIRST sets, then add $\epsilon$ to $FIRST(Y_1 … Y_n)$

NOTE: LOOK AT PICTURE FOR EXAMPLE 4.3. ALSO LOOK AT CHAPTER 4

In more general.. $A -> \alpha_1 | \alpha_2 | … | \alpha_n$

if $\exist i,j: i != j : FIRST( \alpha_i ) INTERSECT FIRST(\alpha_j) != \empty -> NOT LL(1)$

In most general… $A_1 -> \alpha_{11} | … | \alpha_{1n_1}$ . . . $A_m -> \alpha_{m1} | … | \alpha_{mn_m}$

Pick a non-terminal: $\forall i : 1 \leq i \leq m: $ Pick 2 RHS of $A_i$ $\forall j,k: 1 \leq j < k \leq n_i$

Reading

Chapter 4: Syntax Analysis and Parsers

In a compiler’s POV, the process of lexical analysis (parsing) will determine if a given source program (or string) is syntactically correct.

A syntactically correct program is a member of a language defined by a grammar–there exists a derivation starting at the start symbol of the grammar (for the language) and ending in the program (string).

There are two types of parsing:

  1. LL(k) - top-down parsing where input is read left to right and the derivation is a leftmost derivation, which is guided by a lookahead of k tokens.
  2. LR(k) - bottom-up parsing where input is read left to right and the derivation is a rightost derivation in reverse, which is guided by a lookahead of k tokens.

Example of LL Parsing: $L(a^b^) = {\epsilon, a, b, aa, ab, bb, aaa, … }$ This regular expression can be written as a CFG: $S$ -> $A$ $A$ -> $aA | B$ $B$ -> $bB | \epsilon$

To define the parsing problem in terms of derivations, let $w$ be a string that is syntactically correct w/ respect to the language of a grammar iff there exists a derivation $S$ ->$^*$ $w$ where $S$ is the start symbol of the grammar.

How to derive the string w: In order to see if the string $w$ is syntactically correct, you must derive it. Approach to derive:

  1. Derive all strings of the same length as $w$ then check if $w$ is in that set. (Not recommended since this would lead to exponential time complexity since there are many choices to choose from on the right-hand side)
  2. Guide the derivation in a way that we do NOT take steps that would NOT lead to the correct string (only taking the correct path). This would have a better time complexity than the first. Ex: string $w = bbb$ We would need to derive it from $S$, then we can start by $S$ -> $A$ Since $w$ starts w/ $b$, (where $b$ is the lookahead), we negate the rule $S$ -> $aA$ since it only generates a string starting with $a$.

TLDR; Basically, we are just guiding the leftmost derivation by using the lookahead. By matching the lookahead with the terminals/token strings generated in the right-hand side, we pick the correct right-hand side to get the correct path.

FIRST Sets pg 112

FIRST Sets would tell us which terminals for all possible strings start with. For example, the set: $L(G) = {\epsilon, a, bb} \cup {c^nbd^nb|n \ge 1 }$

would give the FIRST set: FIRST(A) = ${\epsilon, a, b, c }$ This says that we get $a$ from the string $a$, $b$ from the string $bb$, and $c$ from all the strings $c^nbd^nb$

Grammar Transformations

Summary