Lexical Analysis (Chapter 3)

  • Syntax - rules for a syntactically correct program
  • Semantics - meaning of the program.

Lexical Analysis is the process that reads a stream of characters and groups them into tokens based on a set of rules (these rules will create keywords, literals, separators, etc.)

The process for lexical analysis:

  1. Find all terminals in the grammar
  2. Write the scanner

ex:

for( int i = 0; i < 100; ++i )
kind lexeme line column
Token.FOR for 37 54
Token.LPAREN ( 37 58

The lexical analysis uses a scanner that reads a stream of characters, which will in turn output a stream of tokens.

Regular Expressions

In terms of regular expressions, there exists a chomsky language hierarchy. It goes: Regular Expressions > Context-Free > Context Sensitive > Unrestricted

Both regular expressions and context-free grammars have:

  1. non terminals on the left
  2. any combination of terminals/non-terminals on the right

NOTE: IMPORTANT TO REMEMBER THIS.

  • Context-free Grammar - the way symbols are combined to form larger structures does NOT depend on the symbols around them.

Regular expressions are recognized by an NFA or a DFA–there is also a one-to-one correspondence between regular languages and regular expressions.

Strings that make up the tokens are regular expressions. Therefore, we can specify a set of regular expressions, one for each token type, and combine them as a NFA. After converting it into an NFA, we can transform it into a DFA to minimize it. In order to do this, it is advised to known Thompson’s Construction Algorithm on page 75.

Parsing

Parsing determines whether the start symbol can derive the program or not.

MUST REMEMBER: a finite automaton recognizes regular languages.

A LL(1) parser is a Top Down Parser.

  • Top Down Parser - Starting from the highest lvl of structure and breaking it down into smaller components until it reaches the individual components.

TOP DOWN PARSING EXPANDS THE START SYMBOL TO THE WHOLE PROGRAM.

A LL(1) parser mimics a leftmost derivation of the input string by reading it from left to right with one single lookahead.

It is parsed with just one lookahead

To prevent the derivation from leading to the incorrect string, we rely on a lookahead. Ex: consider a string $w = bbb$

Since $w$ starts with $b$, we don’t have to consider the rule $S->aA$

So to derive a string, $B$ must be able to generate a string starting with $b$. Look at page 111

  • Bottom-Up Parser - Reduce the program to the start symbol. Probably on page 181, 294

NOTE: If parsing is successful, then the program is a valid program.

First Sets

The use of a parser is to see if the stream of tokens is syntactically correct or not

For what the parser produces, it will output a parse tree. There are two types of parsers:

  1. LL(k) - input is read from left to right, where derivation is a leftmost derivation
  2. LR(k) - input is read from left to right, where derivation is a rightmost derivation in reverse

For both L’s, they represent the left-to-right scanning. For LL(k), the second L represents left-most derivation. For LR(k), the R represents reverse right-most derivation.

k is usually 1 in most cases.

Follow Sets

On page 135

They are a set of terminals–which can never be epsilon–that indicate what can follow a nonterminal in a sentential form.

Basically, FOLLOW Sets tell you what you are allowed to do after applying a particular rule. So, if you are going to follow a rule for a certain symbol (ex: X), then the FOLLOW Set for that rule (X in this case) would tell you what symbols would come after that symbol.

EX: If you see a symbol X, you can put a symbol Y right after it. Y would be in the FOLLOW Set for X.

Recursive Descent Parsers

On page 129

It is a technique that uses recursion stack of whatever langauge it is implemented in as the equivalent to the stack found in push-down automaton needed for recognizing the language of a context-free grammar.

It basically breaks down the grammar into smaller parts (production rules), where the rules describe how the different parts of the language fit together.

Name Checking

On page 358? Somewhere on Chapter 6

It ensures that all identifies (names) used in a program refer to valid entities (ex: variables, functions, types, etc.) where they adhere to the language’s rules.

Somewhere on Chapter 6

Predictive Parsing & Recursive Descent Parsers

On page 146

Predictive Parsing - parsing technique that analyzes and understands the structure of a language, which is defined by a context-free grammar. It is a top-down parsing approach where the parser predicts the production rule to apply based on the current input symbol and a fixed number of lookahead symbols.

LL(k) Grammars

On page 160

LL Stands for Left-to-right, Leftmost Derivation. It is a parsing method where the parser reads from left to right and constructs the leftmost derivation of the input string.

IT EXPANDS THE NON-TERMINALS FROM THE LEFT SIDE OF THE PRODUCTION RULES.

(K) indicates the number of lookahead symbols the parser considers when deciding which production rule to apply.

LR Parsing

On page 181

It produces a rightmost derivation and build the parse tree from the bottom up by comining smaller trees to form bigger trees.

NFA to DFA

On page 46

Use subset construction to do this.

LR(0)

On page 194

Must be able to determine what action to take in any situation w/o looking at the input (no lookahead). Must be able to determine if we should shift the next input symbol onto the stack or if there is a handle already on top of the stack, which would call for a reduction by the production of which the handle is a right-hand side.

Bottom-up parsing that analyzes and understand the structure of a language defined by context-free grammar.

LR stands for Left-to-right, Rightmost Derivation. It is a parsing method where the parser reads from left to right and constructs the rightmost derivation of the input string.

(0) indicates the parsing doesn’t use any lookahead symbols. It makes parsing decisions based on the current input symbol w/o looking ahead.

SLR(1) Automaton

On page 199

It is a parser that parses context-free grammars. It is the simplified version of LR(1) Parsing

In SLR(1) Parsining, it uses 1 lookahead to determine what it should do (decides to either shift, reduce, accept, or error).

Action-Goto Table

On page 196?

NOTE: DANGLING ELSE PROBLEM ON PAGE 155, 224

Summary