编译原理-语法分析器
Syntax Parser Report
1、Motivation & Content description
Build a syntax parser program to read stream of characters and CFG and show the sequence of devirations by using top-down syntax analyzing method.
Requirements:
(1) Input
Stream of characters CFG(Combination of CFGs of some classes of sentences)
(2) Output
Sequence of derivations if top-down syntax analyzing methods are used. Sequence of reductions if bottom-up syntax analyzing methods are used.
(3)Types of sentencial forms are defined by yourself
(4)Error handling may be included
2、Method
Programming based on LL(1) parsing table:
(a)Construct LL(1) parsing table based on the CFG
(b)Design the program using LL(1) paring table
Firstly, read stream of characters one by one till it cannot be read. In this way we can get a word or a symbol. By reading LL(1) table, the program will decide what should do next, including shift, reduce and accept.
In reading process, analyzer will make sure what is reading (word, number, string, ...). Analyzer will read and write in different way at each state.
To handle errors, when analyzer meets problem, it will stop analyzing and throw the reason of the error. The type of error includes illegal attributes, unclosed strings and others.
3、Assumptions
Note: Sentencial forms in this program mainly come from Java. The types of sentential forms include below:
assignment, condition control, loop(for, while), boolean expression, algebraic expression, comparison...
Here is an example, which is similar to the running case.
Based on those types of sentential forms, define some CFGs initially:
={ identifier, value, if, else, while, do, (, ), {, }, true, false, !, ||, &&, +, -, *, /, %, <, >, <=, >=, ==, !=, =}
=
S → if(A)S else S | if(A) S | while(A) S | do S while(A) | {S} | R=E
A → A || B | B
B → B && C | C
C → !C | (A) | true | false | E op E
E → E + F | E - F | F
F → F * G | F / G | G
G → identifier | value
op → < | > | <= | >= | == | !=
R → identifier
4、build LL(1) parsing table
(1)CFG handle For the initial CFGs, we need to eilminate left recursion and extract maximum common left factors.
S → if(A) S S' | while(A) S | do S while(A)| R=E
S' → else S | $\epsilon$
A → BA'
A' → || BA' | $\epsilon$
B → CB'
B' → && CB' | $\epsilon$
C → !C | (A) | true | false | E op E
E → FE'
E' → + FE' | - FE' | $\epsilon$
F → GF'
F' → * GF' | / GF' | $\epsilon$
G → identifier | value
op → < | > | <= | >= | == | !=
R → identifier
(2)Compute the First Set and Follow Set
| First | Follow | |
|---|---|---|
| S | if, do, while, identifier, { | else, while, #, } |
| S' | else, | # |
| A | !, (, true, false, identifier, value | ) |
| A' | ||, | # |
| B | !, (, true, false, identifier, value | ||, # |
| B' | &&, | # |
| C | !, (, true, false, identifier, value | &&, # |
| E | identifier, value | <, >, <=, >=, ==, !=, # |
| E' | +, -, | # |
| F | identifier, value | +, -, # |
| F' | *, /, | # |
| G | identifier, value | *, /, # |
| op | <, >, <=, >=, ==, != | identifier, value |
| R | identifier | = |
(3)Construct LL(1) parsing table Note:Because the table is too large, some columns are combined(different production will be combined too), and the whole table is devided into 2 tables."id" is "identifier", "v" is "value", "i" is "if", "e" is "else", "d" is "do", "w" is "while", "t" is "true", "f" is "false"
Table1:
| id | v | i | e | d | w | t | f | |
|---|---|---|---|---|---|---|---|---|
| S | S→R=E | S→i(A)SS' | S→dSw(A) | S→w(A)S | ||||
| S' | S'→eS | |||||||
| A | A→BA' | A→BA' | A→BA' | A→BA' | ||||
| A' | ||||||||
| B | B→CB' | B→CB' | B→CB' | B→CB' | ||||
| B' | ||||||||
| C | C→E op E | C→E op E | C→t | C→f | ||||
| E | E→FE' | E→FE' | ||||||
| E' | ||||||||
| F | F→GF' | F→GF' | ||||||
| F' | ||||||||
| G | G→id | G→v | ||||||
| op | ||||||||
| R | R→id |
Table2:
| +,- | *,/ | <,>,...,!= | || | && | = | ( | ) | ! | # | |
|---|---|---|---|---|---|---|---|---|---|---|
| S | acc | |||||||||
| S' | S'→ | |||||||||
| A | A→BA' | A→BA' | ||||||||
| A' | A'→||BA' | A'→ | ||||||||
| B | B→CB' | B→CB' | ||||||||
| B' | B'→&&CB' | B'→ | ||||||||
| C | C→(A) | C→!C | ||||||||
| E | ||||||||||
| E' | E'→ +FE' | -FE' | E'→ | ||||||||
| F | ||||||||||
| F' | F'→ *GF' | /GF' | F'→ | ||||||||
| G | ||||||||||
| op | op→<|...|!= | |||||||||
| R |
5、Code Description
Note: This lab's code is based on Java (1) Data Structures In this program, there is a ParseTable class used to store LL(1) table. In this class, I use 2 ArrayList to store the symbol of lines and rows, and a 2-D array to store the table content. To calculate LL(1) table, I use 2 HashMaps to store First set and Follow set. The map's key is nonterminal, and the value is an arraylist of terminals.
(2) Algorithm: a.Handle productions: For each initial production, we need to eilminate left recursion and extract maximum common left factors. First to elimate left recursion, find the first symbol in the production's content, if it is same to the begin symbol, elimate it by the rule. This should be applied on every nonterminals and get temporary productions. Then to find new productions' maximum common left factors and combine them. Old productions should be deleted and new modified productions should be added.
b.Calculate First Set: For each nonterminal, use a method to find its first symbol in productions. If the first symbol is terminal, add it to its first set, otherwise find the symbol's first set recursively, and add the whole symbol's first set into self's first set. Every nonterminal's first set should be calculated.
c.Calculate Follow Set: For each production, read it and find nonterminals, if a nonterminal is before a terminal, add the terminal to its follow set. If a nonterminal is before a nonterminal, add the nonterminal's first set (except "#") to its follow set. If a production ends with a nonterminal, add (...->#) to its follow set. Every production should be read. Then read again, if a production ends with a nonterminal, add the follow set of the produciton's begin to that nonterminal's follow set.
d.Build LL(1) parsing table: Follow the rule, firstly fill productions into parsetable by first set. Then, if a first set includes "#", add (...->#) into the table by follow set.
e.Analyze: Firstly, read input and devided the input into terminals and store them into an ArrayList (will be used as a queue).Then push "#" and "S" into the stack, then start analyzing. During the analyze, the "OUTPUT" is defined by LL(1) table, if the top of the stack is equal to the queue's first, we'll get "MATCH". If the stack and queue are both empty (only "#" left), the analyze will end with "ACCEPT".
(3) Running Case
In the CFG.txt, first line is terminals, second line is nonterminals, and the rest is the whole CFG. CFG.txt
identifier, value, if, else, while, do, (, ), {, }, true, false, !, ||, &&, +, -, *, /, %, <, >, <=, >=, ==, !=, =
S, A, B, C, E, F, G, op, R
S -> if(A)S else S
S -> if(A)S
S -> while(A) S
S -> do S while(A)
S -> {S}
S -> R=E
A -> A || B
A -> B
B -> B && C
B -> C
C -> !C
C -> (A)
C -> true
C -> false
C -> E op E
E -> E + F
E -> E - F
E -> F
F -> F * G
F -> F / G
F -> G
G -> identifier
G -> value
op -> <
op -> >
op -> <=
op -> >=
op -> ==
op -> !=
R -> identifier
input.txt
if(a<10 && true){
b=a+2*2/4;
}
else{
while(b>=0 || false){
b=0;
}
}
(4) Running Result
output.txt
OUTPUT: S->if(A)SS'
MATCH: if
MATCH: (
OUTPUT: A->BA'
OUTPUT: B->CB'
OUTPUT: C->EopE
OUTPUT: E->FE'
OUTPUT: F->GF'
OUTPUT: G->identifier
MATCH: identifier
OUTPUT: F'->#
OUTPUT: E'->#
OUTPUT: op-><
MATCH: <
OUTPUT: E->FE'
OUTPUT: F->GF'
OUTPUT: G->value
MATCH: value
OUTPUT: F'->#
OUTPUT: E'->#
OUTPUT: B'->&&CB'
MATCH: &&
OUTPUT: C->true
MATCH: true
OUTPUT: B'->#
OUTPUT: A'->#
MATCH: )
OUTPUT: S->{S}
MATCH: {
OUTPUT: S->R=E
OUTPUT: R->identifier
MATCH: identifier
MATCH: =
OUTPUT: E->FE'
OUTPUT: F->GF'
OUTPUT: G->identifier
MATCH: identifier
OUTPUT: F'->#
OUTPUT: E'->+FE'
MATCH: +
OUTPUT: F->GF'
OUTPUT: G->value
MATCH: value
OUTPUT: F'->*GF'
MATCH: *
OUTPUT: G->value
MATCH: value
OUTPUT: F'->/GF'
MATCH: /
OUTPUT: G->value
MATCH: value
OUTPUT: F'->#
OUTPUT: E'->#
MATCH: }
OUTPUT: S'->elseS
MATCH: else
OUTPUT: S->{S}
MATCH: {
OUTPUT: S->while(A)S
MATCH: while
MATCH: (
OUTPUT: A->BA'
OUTPUT: B->CB'
OUTPUT: C->EopE
OUTPUT: E->FE'
OUTPUT: F->GF'
OUTPUT: G->identifier
MATCH: identifier
OUTPUT: F'->#
OUTPUT: E'->#
OUTPUT: op->>=
MATCH: >=
OUTPUT: E->FE'
OUTPUT: F->GF'
OUTPUT: G->value
MATCH: value
OUTPUT: F'->#
OUTPUT: E'->#
OUTPUT: B'->#
OUTPUT: A'->||BA'
MATCH: ||
OUTPUT: B->CB'
OUTPUT: C->false
MATCH: false
OUTPUT: B'->#
OUTPUT: A'->#
MATCH: )
OUTPUT: S->{S}
MATCH: {
OUTPUT: S->R=E
OUTPUT: R->identifier
MATCH: identifier
MATCH: =
OUTPUT: E->FE'
OUTPUT: F->GF'
OUTPUT: G->value
MATCH: value
OUTPUT: F'->#
OUTPUT: E'->#
MATCH: }
MATCH: }
ACCEPT
6、Problems and Comments
(1)Problems and Solutions: In this lab, the hardest part is to build LL(1) table. To make it easier, I transfer the CFG's productions into unified formatted productions. There are different methos with different using in the program, such as get first symbol, get common factor. When I'm about to finish the lab, I realize that I can use an easier way to store the information of CFG, use a hashmap or arraylist to store CFG so that I don't have to read the character to find the symbol again and again.
(2)Feelings and Comments: In this lab, I use the code to finish the process of building LL(1) parsing table and LL(1) analyzing, of course is that it's much harder than solve it by writing. During the coding, I met some problem that never occurs in writing so it took me a lot time to make it correct. So it's important to learn the method in both writing and coding.
After finishing the lab, I have more understanding of top-down analyzing, and I improve the ability of calculating first set and follow set to build a correct LL(1) parsing table.
7.How to Use?
(1). You can get the source code here
(2). In the CFG.txt, give your own CFG. Line 1 should contain terminal and Line 2 should contain nonterminal, then each line contains one production.
(3). In the input.txt, give your code.
(3). Check the file address in the code whether it is right.
(5). Run.
TIPS
懒得翻译第二弹,鬼知道我当时怎么写的这一坨