Compilers By Stanford
This is my study notes of CS143 Compilers offered by Stanford Online and the resources can also be found on the web homepage. The notes below contain my summaries, lecture materials, and some assessments answers.
Intro
The Structure of a Compiler
 Lexical Analysis  Syntactic
 Parsing  Syntactic
 Semantic Analysis  Types scope
 Optimization
 Code Generation  Translation
Lexical Analysis
Tokens
correspond to sets of strings.
Identifier
: strings of letters or digits, starting with a letterInteger
: a nonempty string of digitsKeyword
: “else” or”if” or “begin” or …Whitespace
: a nonempty sequence of blanks, newlines, and tabs
An implementation must do two things:
 Recognize substrings corresponding to tokens
 Return the value or lexeme of the token – The lexeme is the substring
The goal of lexical analysis is to:
 Partition the input string into lexemes
 Identify the token of each lexeme
Lefttoright scan => lookahead sometimes required
Regular Languages
Def. The regular expressions over $\Sigma$ are the smallest set of expressions including:
 ε = {“”}
 ‘c’ = {“c”} where c $\in$ $\Sigma$
 A + B = A U B where A, B are rexp over $\Sigma$
 A* = A is a rexp over $\Sigma$
L:Exp > Sets of Strings
Meaning is many (syntax) to one (semantic).
Lexical Specification
Regular expressions to Lexial specification  Ambiguities:
Rule: Pick longest possible string in L(R) – The “maximal munch”
Finite Automata
Regular expressions = specification
Finite automata = implementation
Deterministic Finite Automata (DFA)
– One transition per input per state – No εmoves
Nondeterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a given state – Can have εmoves
Execution of Finite Automata:
 A DFA can take only one path through the state graph – Completely determined by input

NFAs can choose
 Whether to make εmoves
 Which of multiple transitions for a single input to take
NFA vs. DFA
 NFAs and DFAs recognize the same set of languages (regular languages)
 DFAs are faster to execute – There are no choices to consider (but less compact)
 For a given language NFA can be simpler than DFA (slower but more concise)
 DFA can be exponentially larger than NFA
Highlevel sketch: Lexical Specification => RegExp => NFA => DFA => Tabledriven Implementation of DFA
NFA to DFA
The Trick
 Simulate the NFA
 Each state of DFA = a nonempty subset of states of the NFA
 Start state = the set of NFA states reachable through εmoves from NFA start state
 Add a transition S →a S’ to DFA iff S’ is the set of NFA states reachable from any state in S after seeing the input a, considering εmoves as well
Remark
 An NFA may be in many states at any time
 How many different states?  N states
 If there are N states, the NFA must be in some subset of those N states
 How many subsets are there? – ${2}^N$  1 (finite set of possible configurations) = finitely many
Parsing
Phase  Input  Output 

Lexer  String of characters  String of tokens 
Parser  String of tokens  Parse tree 
Contextfree grammars (CFG’s)
A CFG consists of:
 A set of terminals T
 A set of nonterminals N
 A start symbol S (a nonterminal)
 A set of productions: $X \to dY_1Y_2 … Y_n$ where X $\in$ N and $Y_i$ $\in$ T $\cup$ N $\cup$ {$\epsilon$}
The language of a CFG:
Let G be a contextfree grammar with start symbol S. Then the language of G is: {$a_1$ … $a_n$  S ${\to}^*$ $a_1$ … $a_n$ and every $a_i$ is a terminal } 
Terminals
are socalled because there are no rules for replacing them;
 Once generated, terminals are permanent;
 Terminals ought to be tokens of the language.
Derivations
Notes on derivations:

A parse tree has
 Terminals at the leaves
 Nonterminals at the interior nodes
 An inorder traversal of the leaves is the original input
 The parse tree shows the association of operations, the input string does not
Ambiguity
A grammar is ambiguous
if it has more than one parse tree for some string; Equivalently, there is more than one rightmost or leftmost derivation for some string.
Error Handling
Purpose of the compiler is
 To detect nonvalid programs
 To translate the valid ones
Many kinds of possible errors (e.g. in C)
Error kind  Example  Detected by 

Lexical  … $ …  Lexer 
Syntax  … x *% …  Parser 
Semantic  … int x; y = x(3); …  Type checker 
Correctness  your favorite program  Tester/User 
Syntax Error Handling
Error handler should:
 Report errors accurately and clearly
 Recover from an error quickly
 Not slow down compilation of valid code
Approaches  From simple to complex
 Panic mode
 Error productions
 Automatic local or global correction
TopDown Parsing & Recursive Decent Parsing
Start with toplevel nonterminal E – Try the rules for E in order
Left Recursion
In general S → S α1 … S αn β1 … βm ; All strings derived from S start with one of β1,…,βm and continue with several instances of α1,…,αn; Rewrite as:
S → β1 S’  …  βm S’
S’ → α1 S’  …  αn S’  ε
BottomUp Parsing
Like recursivedescent but parser can “predict” which production to use
 By looking at the next few tokens
 No backtracking
Predictive parsers
accept LL(k)
grammars
 L means “lefttoright” scan of input
 L means “leftmost derivation”
 k means “predict based on k tokens of lookahead”
 In practice, LL(1) is used
LL(1) vs. Recursive Descent

In recursivedescent,
 At each step, many choices of production to use
 Backtracking used to undo bad choices

In LL(1),
 At each step, only one choice of production

That is
 When a nonterminal A is leftmost in a derivation
 The next input symbol is t
 There is a unique production A → α to use – Or no production to use (an error state)

LL(1) is a recursive descent variant without backtracking
LeftFactoring
: Factor out common prefixes of productions
Notes on LL(1) Parsing Tables
If any entry is multiply defined then G is not LL(1)
 If G is ambiguous
 If G is left recursive
 If G is not leftfactored
 And in other cases as well
Most programming language CFGs are not LL(1)
Bottomup parsing
is more general than topdown parsing
 And just as efficient
 Builds on ideas in topdown parsing
Bottomup is the preferred method.
ShiftReducing Parsing
Bottom Up Parsers
construct rightmost derivations.
SLR
: Simple LR parser > LR(0).
Input from Left; derivation from Rightmost
Handles
formalize the intuition:
 A handle is a reduction that allows further reductions back to the start symbol.
 We only want to reduce at handles.
To recognise a viable prefixes, we must:
 recognize a sequence of partial rhs’s of productions, where
 each partial rhs can eventually reduce to part of the missing suffix of its predecessor
Important Fact about bottomup parsing:
(1). A bottomup parser traces a rightmost derivation in reverse
(2). In shiftreduce parsing, handles appear only at the top of the stack, never inside
(3). For any grammar, the set of viable prefixes is a regular language.
An item
is a production with a "."
somewhere on the rhs.
The states of the DFA are “canonical collections of items” or “canonical collections of LR(0) items”.
X > b.r is valid for a viable prefix ab if S’ > aXw > abrw by a rightmost derivation; after parsing ab, the valid items are the possible tops of the stack of items.