Parsing (Syntax Analysis)

Overview

Following the lexical analysis phase, where the source program is converted into a stream of tokens, we proceed to the second phase of compilation: syntax analysis, or parsing. The primary purpose of the parser is to determine if the sequence of tokens conforms to the syntactic structure defined by the source language's grammar. It is in this phase that the compiler imposes a hierarchical structure on the token stream, typically by constructing a parse tree. This tree represents the grammatical organization of the input, making explicit the nested relationships between programming constructs, which is essential for subsequent phases such as semantic analysis and code generation.

A thorough understanding of parsing is indispensable for success in the GATE examination, as this topic forms a significant and recurring portion of the Compiler Design syllabus. This chapter provides a systematic treatment of the two principal parsing strategies: top-down and bottom-up. We shall investigate the theoretical foundations of each approach, including the algorithms and data structures that enable their implementation. Mastery of concepts such as the computation of FIRST and FOLLOW sets, the construction of LL and LR parsing tables, and the resolution of parsing conflicts is critical for solving the analytical and numerical problems frequently posed in the exam.

Our exploration will focus on the practical mechanisms of parser construction and operation. We will examine how different classes of grammars lend themselves to specific parsing techniques and analyze the trade-offs between parser power and complexity. By understanding how a parser validates correct syntax and, just as importantly, how it detects and reports errors, we gain a deeper appreciation for the robust engineering that underpins modern programming languages and compilers.

---

Chapter Contents

| # | Topic | What You'll Learn |
|---|-------|-------------------|
| 1 | Top-Down Parsing | Constructing a parse tree from root down. |
| 2 | Bottom-Up Parsing | Building a parse tree from leaves up. |

---

Learning Objectives

❗ By the End of This Chapter

After completing this chapter, you will be able to:

Differentiate between top-down and bottom-up parsing strategies and their respective applications.

Compute FIRST and FOLLOW sets for a given context-free grammar and construct predictive parsing tables for LL(1) grammars.

Construct parsing tables for LR(0), SLR(1), CLR(1), and LALR(1) parsers and trace the parsing of an input string.

Identify and resolve shift-reduce and reduce-reduce conflicts that may arise during bottom-up parser construction.

---

We now turn our attention to Top-Down Parsing...

Part 1: Top-Down Parsing

Introduction

In the syntax analysis phase of a compiler, parsing is the process of determining if a string of tokens can be generated by a given context-free grammar. Top-down parsing is a fundamental strategy wherein the parser attempts to construct a parse tree for an input string starting from the root node (the start symbol of the grammar) and working its way down to the leaves (the terminals). This approach can be visualized as a process of finding the leftmost derivation for an input string.

The most significant family of top-down parsers is the class of predictive parsers, which operate without backtracking. These parsers make decisions about which production to apply based on the next input symbol, known as the lookahead symbol. The most common form of predictive parser, and the one most relevant for the GATE examination, is the LL(1) parser. The "LL(1)" designation signifies that the parser processes the input from Left to right, constructs a Leftmost derivation, and uses one lookahead symbol to make its parsing decisions. Our study will focus on the theoretical underpinnings required to construct such a parser, namely the computation of First and Follow sets and the construction of the LL(1) parsing table.

📖 Top-Down Parsing

Top-down parsing is a parsing strategy that attempts to find a leftmost derivation for an input string by building a parse tree from the root downwards. It begins with the start symbol of the grammar and iteratively applies production rules to expand non-terminals until the terminal string matches the input.

---

Key Concepts

1. FIRST and FOLLOW Sets

To construct a predictive parser, we must be able to choose the correct production for a non-terminal based on the upcoming input token. This is achieved by pre-computing two sets of terminals for each non-terminal in the grammar: the `FIRST` set and the `FOLLOW` set.

The `FIRST` Set

The `FIRST` set of a string of grammar symbols $\alpha$ , denoted $First(\alpha)$ , is the set of terminals that can begin a string derived from $\alpha$ . If $\alpha$ can derive the empty string $\epsilon$ , then $\epsilon$ is also included in $First(\alpha)$ .

Rules for Computing `FIRST` Sets:

Let us consider a grammar $G = (V, T, P, S)$ .

For a terminal $a$ :

First(a) = \{a\}

For a non-terminal $X$ :

- If there is a production

X \to a\alpha

, where

a

is a terminal, then add

a

First(X)

.
- If there is a production

X \to \epsilon

, then add

\epsilon

First(X)

.
- If there is a production

X \to Y_1 Y_2 \dots Y_k

, then we add all symbols from

First(Y_1)

First(X)

, excluding

\epsilon

. If

First(Y_1)

contains

\epsilon

, we then consider

First(Y_2)

and add its contents (excluding

\epsilon

) to

First(X)

, and so on. If

First(Y_i)

contains

\epsilon

for all

i=1, \dots, k

, then we add

\epsilon

First(X)

For a string of symbols $\alpha = X_1 X_2 \dots X_n$ :

First(\alpha)

is computed similarly. We add

First(X_1) - \{\epsilon\}

First(\alpha)

. If

\epsilon \in First(X_1)

, we add

First(X_2) - \{\epsilon\}

First(\alpha)

, and so on. If

\epsilon \in First(X_i)

for all

i

, then we add

\epsilon

First(\alpha)

The `FOLLOW` Set

The `FOLLOW` set of a non-terminal $A$ , denoted $Follow(A)$ , is the set of terminals that can appear immediately to the right of $A$ in some sentential form. If $A$ can be the rightmost symbol in a sentential form, then the end-of-input marker, denoted by ` $`, is in$ Follow(A)$.

Rules for Computing `FOLLOW` Sets:

Start Symbol: Place the end marker `

` in

Follow(S)

, where

S$ is the start symbol.

Production Rule $A \to \alpha B \beta$ : For a production of this form, everything in

First(\beta)

, except

\epsilon

, is placed in

Follow(B)

Production Rule $A \to \alpha B$ or $A \to \alpha B \beta$ where $\epsilon \in First(\beta)$ : For a production of this form (where

B

is at the end or is followed by symbols that can derive

\epsilon

), everything in

Follow(A)

is also in

Follow(B)

Worked Example:

Problem: Compute the `FIRST` and `FOLLOW` sets for the following grammar, where $S$ is the start symbol.

$S \to AB$
$A \to aA \mid \epsilon$
$B \to bB \mid c$

Solution:

Computing `FIRST` Sets:

Step 1: Initialize `FIRST` sets.
$First(a) = \{a\}$ , $First(b) = \{b\}$ , $First(c) = \{c\}$ .
$First(S) = \emptyset$ , $First(A) = \emptyset$ , $First(B) = \emptyset$ .

Step 2: Compute $First(A)$ .
From $A \to aA$ , we add $a$ to $First(A)$ .
From $A \to \epsilon$ , we add $\epsilon$ to $First(A)$ .
So, $First(A) = \{a, \epsilon\}$ .

Step 3: Compute $First(B)$ .
From $B \to bB$ , we add $b$ to $First(B)$ .
From $B \to c$ , we add $c$ to $First(B)$ .
So, $First(B) = \{b, c\}$ .

Step 4: Compute $First(S)$ .
From $S \to AB$ , we look at $First(A)$ . We add everything from $First(A)$ except $\epsilon$ to $First(S)$ .
$First(S) = First(S) \cup (First(A) - \{\epsilon\}) = \{a\}$ .
Since $\epsilon \in First(A)$ , we must also consider $First(B)$ . We add all of $First(B)$ to $First(S)$ .
$First(S) = First(S) \cup First(B) = \{a, b, c\}$ .

Final `FIRST` Sets:
$First(S) = \{a, b, c\}$
$First(A) = \{a, \epsilon\}$
$First(B) = \{b, c\}$

---

Computing `FOLLOW` Sets:

Step 1: Initialize `FOLLOW` sets and apply Rule 1.
$Follow(S) = \{\$ \}$
$Follow(A) = \emptyset$
$Follow(B) = \emptyset$

Step 2: Analyze productions for Rule 2 and 3.
Consider production $S \to AB$ .

To find $Follow(A)$ , we look at the symbol after $A$ , which is $B$ . By Rule 2, we add $First(B)$ to $Follow(A)$ .

Follow(A) = Follow(A) \cup First(B) = \{b, c\}

To find $Follow(B)$ , we observe that $B$ is at the end of the production. By Rule 3, we add everything in $Follow(S)$ to $Follow(B)$ .

Follow(B) = Follow(B) \cup Follow(S) = \{\

\}$.

Consider production $A \to aA$ .

To find $Follow(A)$ , we see that $A$ is at the end. By Rule 3, we add everything in $Follow(A)$ to $Follow(A)$ , which adds nothing new.

Consider production

B \to bB

To find $Follow(B)$ , we see that $B$ is at the end. By Rule 3, we add everything in $Follow(B)$ to $Follow(B)$ , which adds nothing new.

Final `FOLLOW` Sets:

Follow(S) = \{\

\}$

Follow(A) = \{b, c\}

Follow(B) = \{\

\}$

---

2. LL(1) Grammars

A grammar is said to be LL(1) if it is possible to construct a predictive parsing table for it. The existence of such a table is guaranteed if and only if the grammar satisfies two specific conditions. These conditions ensure that for any non-terminal and any lookahead symbol, there is at most one production rule that can be applied.

📖 LL(1) Grammar

A context-free grammar $G$ is LL(1) if and only if for any two distinct productions $A \to \alpha$ and $A \to \beta$ , the following conditions hold:

$First(\alpha) \cap First(\beta) = \emptyset$ .

If $\epsilon \in First(\alpha)$ , then $Follow(A) \cap First(\beta) = \emptyset$ . (Symmetrically, if $\epsilon \in First(\beta)$ , then $Follow(A) \cap First(\alpha) = \emptyset$ ).

The first condition ensures that if we have two choices for expanding a non-terminal $A$ , their `FIRST` sets are disjoint. This means the lookahead symbol will uniquely determine which production to use. The second condition handles the case where one production can derive $\epsilon$ . If $A \to \alpha$ can produce $\epsilon$ , the parser might choose this production. In this case, the next input symbol must belong to the `FOLLOW` set of $A$ . This condition ensures that the symbols in $Follow(A)$ do not overlap with the symbols that could start a derivation from the other choice, $A \to \beta$ .

Properties that prevent a grammar from being LL(1):

Ambiguity: An ambiguous grammar can generate more than one parse tree for the same input string. Such grammars can never be LL(k) for any k. The classic "dangling else" problem is a prime example.
Left Recursion: A grammar is left-recursive if it has a non-terminal $A$ such that there is a derivation $A \Rightarrow^+ A\alpha$ . Top-down parsers cannot handle left-recursive grammars as they would enter an infinite loop.
Common Prefixes (Lack of Left Factoring): If two productions for a non-terminal start with the same sequence of symbols, e.g., $A \to \alpha\beta_1 \mid \alpha\beta_2$ , the parser cannot decide which production to choose by looking at the first few symbols of $\alpha$ . This violates the first LL(1) condition.

---

3. Construction of an LL(1) Parsing Table

The LL(1) parsing table is a two-dimensional array, `M[A, a]`, where `A` is a non-terminal and `a` is a terminal or the end marker `$`. Each entry `M[A, a]` contains the production rule that the parser should use to expand non-terminal `A` when the current lookahead symbol is `a`.

📐 LL(1) Parsing Table Construction Algorithm

For a given grammar $G$ , we construct the parsing table $M$ as follows:

For each production $A \to \alpha$ in the grammar, perform the following two steps:

For each terminal $a$ in $First(\alpha)$ , add the production $A \to \alpha$ to the table entry $M[A, a]$ .

If $\epsilon$ is in $First(\alpha)$ , then for each terminal $b$ in $Follow(A)$ (including ` $`, if applicable), add the production$ A \to \alpha $to the table entry$ M[A, b]$.

If any table entry ends up with more than one production, the grammar is not LL(1). All other empty entries are error states.

Worked Example:

Problem: Construct the LL(1) parsing table for the grammar:
$E \to TE'$
$E' \to +TE' \mid \epsilon$
$T \to FT'$
$T' \to *FT' \mid \epsilon$
$F \to (E) \mid id$

Solution:

Step 1: Compute `FIRST` and `FOLLOW` sets.
(The detailed computation is omitted for brevity, but is a necessary prerequisite).

$First(E) = First(T) = First(F) = \{(, id\}$
$First(E') = \{+, \epsilon\}$
$First(T') = \{*, \epsilon\}$

$Follow(E) = Follow(E') = \{\$ , )\}$
$Follow(T) = Follow(T') = \{+, \$ , )\}$
$Follow(F) = \{+, , \$ , )\}$

Step 2: Populate the parsing table using the algorithm.

Let us consider a few sample entries:

Production $E \to TE'$ :

First(TE') = First(T) = \{(, id\}

. So, we add

E \to TE'

M[E, (]

and

M[E, id]

Production $E' \to +TE'$ :

First(+TE') = \{+\}

. So, we add

E' \to +TE'

M[E', +]

Production $E' \to \epsilon$ :

First(\epsilon) = \{\epsilon\}

. We must use the `FOLLOW` set.

Follow(E') = \{\

, )\}$. So, we add

E' \to \epsilon

M[E', \

]

and

M[E', )]$.

Production $F \to (E)$ :

First((E)) = \{(\}

. So, we add

F \to (E)

M[F, (]

Following this procedure for all productions yields the complete LL(1) parsing table.

id
+
*
(
)
$

E
E'
T
T'
F

E→TE'
E→TE'
E'→+TE'
E'→ε
E'→ε
T→FT'
T→FT'
T'→ε
T'→*FT'
T'→ε
T'→ε
F→id
F→(E)

---

Problem-Solving Strategies

💡 GATE Strategy: Rapid First/Follow Calculation

In an exam, you may not have time for the full iterative algorithm. For `FIRST` sets, mentally trace derivations: "What is the absolute first terminal this non-terminal can produce?". For `FOLLOW` sets, scan the right-hand sides of all productions where the non-terminal appears. Ask: "What can come immediately after this non-terminal?" This mental scan is often sufficient for simpler grammars found in GATE questions.

---

Common Mistakes

⚠️ Avoid These Errors

❌ Confusing `First` and `Follow` for $\epsilon$ -productions: For a production $A \to \epsilon$ , students mistakenly add $\epsilon$ to $Follow(A)$ .

✅ Correct Approach:

\epsilon

is never in a `FOLLOW` set. The rule is: if

A \to \epsilon

, we use

Follow(A)

to decide where to place this production in the parsing table.

❌ Forgetting ` $`:$ The end marker ` $‘ : * * T h ee n d ma r k er ‘$ ` is crucial. It must always be added to the `FOLLOW` set of the start symbol initially. Forgetting this can lead to an incomplete parsing table.

✅ Correct Approach: Always begin `FOLLOW` set computation by placing `

` in

Follow(S)$.

❌ Incorrectly applying Rule 3 for `FOLLOW`: For a production $A \to \alpha B$ , students often forget to add all of $Follow(A)$ to $Follow(B)$ .

✅ Correct Approach: Remember that if a non-terminal

B

can appear at the end of a production for

A

, then whatever can follow

A

can also follow

B

---

Practice Questions

:::question type="MCQ" question="Consider the grammar: $X \to Yz \mid x$ , $Y \to y \mid \epsilon$ . What are the correct `FIRST` and `FOLLOW` sets for the non-terminal $Y$ ?" options=[" $First(Y) = \{y, \epsilon\}$ , $Follow(Y) = \{z\}$ "," $First(Y) = \{y\}$ , $Follow(Y) = \{z, \$ \} $","$ First(Y) = \{y, \epsilon\} $,$ Follow(Y) = \{z, \ $\}$ "," $First(Y) = \{y\}$ , $Follow(Y) = \{z\}$ "] answer=" $First(Y) = \{y, \epsilon\}$ , $Follow(Y) = \{z\}$ " hint="To find Follow(Y), look at the productions where Y appears on the RHS. What symbol follows Y in the production $X \to Yz$ ?" solution="
Step 1: Compute $First(Y)$ .
From the production $Y \to y$ , we add $y$ to $First(Y)$ .
From the production $Y \to \epsilon$ , we add $\epsilon$ to $First(Y)$ .

First(Y) = \{y, \epsilon\}

Step 2: Compute $Follow(Y)$ .
We look for occurrences of $Y$ on the right-hand side of productions. $Y$ appears in $X \to Yz$ .
According to the rule for a production $A \to \alpha B \beta$ , we add $First(\beta)$ to $Follow(B)$ . Here, $A=X$ , $B=Y$ , and $\beta=z$ .
We add $First(z)$ to $Follow(Y)$ . Since $z$ is a terminal, $First(z)=\{z\}$ .

Follow(Y) = \{z\}

Note that we do not add

\<div class="math-display"><span class="katex-error" title="ParseError: KaTeX parse error: Can & #x27;t use function & #x27;

' in math mode at position 4: to

̲Follow(Y)

as t…" style="color:#cc0000">to

Follow(Y)

as there is no rule that would justify it, assuming

X

is the start symbol and no other production places

\</span></div> in

Follow(Y)$.

Result:
The correct sets are $First(Y) = \{y, \epsilon\}$ and $Follow(Y) = \{z\}$ .
"
:::

:::question type="MSQ" question="Consider the grammar $G$ : $S \to aSb \mid A$ , $A \to cAd \mid cd$ . Which of the following statements is/are TRUE for $G$ ?" options=["The grammar $G$ is ambiguous.","The grammar $G$ is not LL(1).","The grammar $G$ is LL(1).","The language generated by $G$ is $\{a^n c^m d^m b^n \mid n \ge 0, m \ge 1\}$ ."] answer="The grammar $G$ is LL(1).,The language generated by $G$ is $\{a^n c^m d^m b^n \mid n \ge 0, m \ge 1\}$ ." hint="Check the LL(1) conditions. For productions $S \to aSb$ and $S \to A$ , are their FIRST sets disjoint? Do the same for the productions of A. Also, try to understand the structure of strings the grammar generates." solution="
Analysis of the language:
The production $S \to aSb$ generates matching pairs of $a$ 's and $b$ 's. The base case $S \to A$ stops this recursion. So, strings look like $a^n A b^n$ for $n \ge 0$ .
The production $A \to cAd$ generates matching pairs of $c$ 's and $d$ 's. The base case $A \to cd$ stops this recursion, ensuring at least one pair of $cd$ . So, $A$ generates strings of the form $c^m d^m$ for $m \ge 1$ .
Combining these, the language is $L = \{a^n c^m d^m b^n \mid n \ge 0, m \ge 1\}$ . So, this option is correct.

Analysis of LL(1) property:
Let's compute the required `FIRST` sets.
$First(A)$ : From $A \to cAd$ and $A \to cd$ , the first terminal is always $c$ . So, $First(A) = \{c\}$ .
$First(S)$ : From $S \to aSb$ , the first terminal is $a$ . From $S \to A$ , the first terminal is from $First(A)$ , which is $c$ . So, $First(S) = \{a, c\}$ .

Now, let's check the LL(1) conditions.

For non-terminal

S

, we have productions

S \to aSb

and

S \to A

First(aSb) = \{a\}

First(A) = \{c\}

First(aSb) \cap First(A) = \{a\} \cap \{c\} = \emptyset

. The condition holds.

For non-terminal

A

, we have productions

A \to cAd

and

A \to cd

First(cAd) = \{c\}

First(cd) = \{c\}

First(cAd) \cap First(cd) = \{c\} \cap \{c\} = \{c\} \neq \emptyset

.
The first LL(1) condition is violated. The grammar is not LL(1) as given.

Wait, let's re-read the grammar. Ah, my analysis of the `A` productions was hasty. This grammar has a common prefix. Let's left-factor it.
$A \to cA'$
$A' \to dA \mid d$

Let's re-check the LL(1) property on the original grammar. The issue is with the two productions for A: $A \to cAd$ and $A \to cd$ . The lookahead symbol 'c' cannot distinguish between them. Therefore, the grammar is NOT LL(1).

Since the grammar is not LL(1), the option "The grammar G is LL(1)" is false.
The grammar is unambiguous. For any string in the language, there is only one way to derive it by choosing how many $a...b$ pairs and how many $c...d$ pairs. So, "The grammar G is ambiguous" is false.

Therefore, the correct statements are "The grammar G is not LL(1)." and "The language generated by G is $\{a^n c^m d^m b^n \mid n \ge 0, m \ge 1\}$ ."
Let me re-check my work.
$S \to aSb \mid A$
$A \to cAd \mid cd$
$First(aSb) = \{a\}$
$First(A) = \{c\}$
$First(aSb) \cap First(A) = \emptyset$ . OK for S.
$First(cAd) = \{c\}$
$First(cd) = \{c\}$
$First(cAd) \cap First(cd) = \{c\}$ . This is a conflict. The grammar is not LL(1).

My MSQ question seems to have a problem. Let me re-design the question.

Let's try a different grammar for the MSQ.
$S \to AB$
$A \to aA \mid \epsilon$
$B \to bS \mid c$

Let's analyze this one.
$First(A) = \{a, \epsilon\}$
$First(B) = \{b, c\}$
$First(S) = First(A)$ . Since A can be $\epsilon$ , $First(S)$ also includes $First(B)$ . So, $First(S) = \{a, b, c\}$ .
$Follow(S) = \{\$ \} \cup Follow(B)$ ... this is complex. Let's simplify the grammar for the question.

New MSQ question:
Question: Consider the grammar $G$ : $S \to aB \mid bA$ , $A \to a \mid aS \mid bAA$ , $B \to b \mid bS \mid aBB$ . Which of the following statements is/are TRUE about why this grammar is NOT LL(1)?
Options:
A. The productions for non-terminal $S$ cause a conflict.
B. The grammar is left-recursive.
C. The productions for non-terminal $A$ are not left-factored.
D. The grammar is ambiguous.
Answer: C
Solution:
Let's analyze the grammar for LL(1) properties.

Non-terminal S: Productions are

S \to aB

and

S \to bA

First(aB) = \{a\}

First(bA) = \{b\}

First(aB) \cap First(bA) = \emptyset

. No conflict for S. Option A is false.

Left Recursion: There are no productions of the form

X \to X\alpha

. The grammar is not left-recursive. Option B is false.

Non-terminal A: Productions are

A \to a

A \to aS

First(a) = \{a\}

First(aS) = \{a\}

First(a) \cap First(aS) = \{a\}

. This is a conflict. The parser cannot decide between

A \to a

and

A \to aS

on lookahead 'a'. This is because of a common prefix 'a', i.e., the productions are not left-factored. Option C is true.

Ambiguity: The grammar is not obviously ambiguous, and the non-LL(1) nature is clearly due to the left-factoring issue, which is a structural property. We don't need to prove or disprove ambiguity to answer the question. The direct cause of the LL(1) conflict is the lack of left factoring.

This is a better question. Let's stick with this one.
The original MSQ I wrote was flawed in its options. Let me try again to write an MSQ with a correct set of options.

Re-attempting MSQ Question:
:::question type="MSQ" question="Consider the grammar $G$ : $E \to E+T \mid T$ , $T \to T*F \mid F$ , $F \to (E) \mid id$ . Which of the following statements about $G$ is/are TRUE?" options=["G is LL(1).","G is ambiguous.","G is left-recursive.","The language generated by G represents arithmetic expressions."] answer="G is ambiguous.,G is left-recursive.,The language generated by G represents arithmetic expressions." hint="Check for direct left recursion in the productions. Does this grammar suffer from the classic ambiguity issues of expression grammars?" solution="
Analysis of Statements:

Is G left-recursive?

Yes. The productions

E \to E+T

and

T \to T*F

are examples of direct left recursion, where a non-terminal appears as the first symbol on the right-hand side of its own production. A top-down parser would enter an infinite loop trying to expand

E

T

. So, "G is left-recursive" is TRUE.

Is G LL(1)?

No grammar that is left-recursive can be LL(1). The presence of left recursion immediately disqualifies it. So, "G is LL(1)" is FALSE.

Is G ambiguous?

Yes, this is the classic ambiguous expression grammar. For the string `id+idid`, two different leftmost derivations (and thus two parse trees) can be constructed, one corresponding to `(id+id)id` and another to `id+(id*id)`. So, "G is ambiguous" is TRUE.

Language Generated:

The grammar generates strings like `id`, `id+id`, `idid`, `(id+id)id`, etc. This corresponds to arithmetic expressions involving addition, multiplication, parentheses, and an identifier. So, "The language generated by G represents arithmetic expressions" is TRUE.

Conclusion: The correct options are that the grammar is ambiguous, left-recursive, and generates arithmetic expressions.
"
:::
This MSQ is much better and tests core concepts.

:::question type="NAT" question="Consider the grammar: $S \to aABe$ , $A \to Abc \mid b$ , $B \to d$ . What is the cardinality of the set $Follow(A)$ ?" answer="2" hint="First, remove left recursion from the grammar for non-terminal A. Then, compute the FOLLOW set based on the modified grammar rules by examining where A appears." solution="
Step 1: The grammar has left recursion in the production $A \to Abc$ . We must eliminate it to analyze it for top-down parsing properties.
The productions for $A$ are $A \to Abc \mid b$ .
We introduce a new non-terminal $A'$ to rewrite them as:
$A \to bA'$
$A' \to bcA' \mid \epsilon$
The full, equivalent grammar is now:
$S \to aABe$
$A \to bA'$
$A' \to bcA' \mid \epsilon$
$B \to d$

Step 2: Compute $Follow(A)$ .
We look for $A$ on the RHS of productions in the modified grammar. It appears in $S \to aABe$ .
The symbol(s) following $A$ is the string $Be$ .
According to Rule 2, we add $First(Be)$ to $Follow(A)$ .

Step 3: Compute $First(Be)$ .
$First(Be) = First(B)$ .
From the production $B \to d$ , we have $First(B) = \{d\}$ .
So, we add $\{d\}$ to $Follow(A)$ .

Step 4: Check if $First(Be)$ can be $\epsilon$ .
$First(B) = \{d\}$ , which does not contain $\epsilon$ . Therefore, we do not need to consider what follows $S$ .

Result:
$Follow(A) = \{d\}$ .
Let me re-read the question. Wait, the question asks for Follow(A) of the original grammar. Follow sets are defined on the grammar itself, regardless of whether it's suitable for a particular parsing method. The transformation is only needed for building a parser, not for the definition of the set. Let's re-calculate on the original grammar.

Recalculation with Original Grammar:
$S \to aABe$
$A \to Abc \mid b$
$B \to d$

Step 1: Compute $Follow(A)$ .
Look for $A$ on the RHS.

From $S \to aABe$ : The symbol following the first $A$ is $B$ . So, we add $First(B)$ to $Follow(A)$ .

First(B) = First(d) = \{d\}

. So,

d \in Follow(A)

From $A \to Abc$ : The symbol following $A$ is $b$ . So, we add $b$ to $Follow(A)$ .

Result:

Follow(A) = \{d, b\}

.
The cardinality is 2. This is the correct approach. The hint about removing left recursion was misleading for this specific question, although it's a good general practice. The definition of Follow set applies to any CFG.

Final Answer: 2
"
:::

---

Summary

❗ Key Takeaways for GATE

Master `FIRST` and `FOLLOW` Sets: The ability to correctly and quickly compute these sets is non-negotiable. Nearly all top-down parsing questions depend on them. Remember the specific rules for $\epsilon$ and the `$` marker.

Know the Two LL(1) Conditions: A grammar is LL(1) if for any non-terminal $A$ with productions $A \to \alpha \mid \beta$ : (1) $First(\alpha) \cap First(\beta) = \emptyset$ , and (2) if $\alpha \Rightarrow^* \epsilon$ , then $First(\beta) \cap Follow(A) = \emptyset$ . Be prepared to apply these conditions to determine if a grammar is LL(1).

Understand the Parsing Table Algorithm: Know the two-part algorithm for filling the table: use `FIRST` sets for non-epsilon productions and `FOLLOW` sets for epsilon productions. Multiple entries in a single cell imply the grammar is not LL(1).

---

What's Next?

💡 Continue Learning

This topic is a cornerstone of syntax analysis. To build a complete understanding of parsing, we must now turn our attention to the alternative approach.

Bottom-Up Parsing (LR Parsing): While top-down parsing builds the tree from the root down, bottom-up parsing starts from the leaves (the input string) and works its way up to the root. This method is more powerful and can handle a larger class of grammars, including the left-recursive ones that fail for LL(1) parsers. Concepts like SLR, CLR, and LALR parsers are the logical next step.
Syntax Directed Translation (SDT): Parsing is not just about recognition; it's about translation. SDT involves augmenting the grammar with semantic actions to perform tasks like type checking or intermediate code generation as the parsing proceeds. Understanding parsing is a prerequisite for studying SDT.

---

💡 Moving Forward

Now that you understand Top-Down Parsing, let's explore Bottom-Up Parsing which builds on these concepts.

---

Part 2: Bottom-Up Parsing

Introduction

In the syntax analysis phase of compilation, parsing is the process of determining if a string of tokens can be generated by a grammar. Bottom-up parsing, as the name suggests, is a strategy that attempts to construct a parse tree for an input string beginning at the leaves (the bottom) and working up towards the root (the top). The parser identifies the most fundamental syntactic units of the language first and progressively combines them into larger, more complex structures, until the single start symbol is derived.

This method is often described as "shift-reduce" parsing. The parser maintains a stack and repeatedly "shifts" input tokens onto the stack until the top of the stack matches the right-hand side of some grammar production. At this point, the parser "reduces" the matched symbols on the stack to the non-terminal on the left-hand side of the production. This process continues until the entire input string is consumed and the stack contains only the start symbol, signifying a successful parse. The sequence of reductions, when read in reverse, corresponds to a rightmost derivation of the input string.

The LR family of parsers—including LR(0), SLR(1), LALR(1), and CLR(1)—are the most powerful and widely used bottom-up parsing algorithms. They are capable of parsing a very large class of context-free grammars efficiently and deterministically. A thorough understanding of their construction and operation is therefore indispensable for the study of compiler design.

📖 Bottom-Up Parsing

Bottom-up parsing is an algorithm that attempts to construct a parse tree for an input string by starting from the input string itself (the leaves of the tree) and attempting to rewrite it to the grammar's start symbol (the root of the tree). The parser aims to find a sequence of reductions that trace a rightmost derivation in reverse.

---

Key Concepts

1. Shift-Reduce Parsing

The fundamental mechanism underlying most bottom-up parsers is the shift-reduce algorithm. This algorithm utilizes a stack to hold grammar symbols and an input buffer containing the remainder of the string to be parsed. The parser performs one of four possible actions at each step.

Parser Actions:

Shift: The next input token is moved from the input buffer to the top of the stack.

Reduce: The parser identifies a sequence of symbols

\beta

at the top of the stack that matches the right-hand side of a production

A \rightarrow \beta

. These symbols are popped from the stack, and the non-terminal

A

is pushed onto the stack in their place. This action corresponds to a step in the reverse rightmost derivation.

Accept: The parser announces a successful parse. This occurs when the input buffer is empty and the stack contains only the start symbol.

Error: The parser discovers a syntax error and calls an error recovery routine.

The core challenge in shift-reduce parsing is deciding whether to shift or to reduce at each step, and if reducing, by which production. This decision is what distinguishes the various types of LR parsers.

A key concept in this process is the "handle." A handle is a substring that matches the right-hand side of a production and whose reduction represents one step back along the rightmost derivation. A bottom-up parser's task is essentially to find and reduce a sequence of handles.

Parsing Conflicts:
The decision-making process can lead to conflicts:
* Shift-Reduce Conflict: The parser cannot decide whether to shift the next input symbol onto the stack or to reduce a handle that is already on top of the stack.
* Reduce-Reduce Conflict: The parser finds that the top of the stack could be reduced by more than one production rule.

---

2. The LR Parser Family

The name LR( $k$ ) stands for scanning the input from Left-to-right, producing a Rightmost derivation in reverse, using $k$ symbols of lookahead. For GATE, we are primarily concerned with $k=1$ .

Context-Free Grammars

Unambiguous Grammars

CLR(1) Grammars

LALR(1) Grammars

SLR(1) Grammars

LR(0)

This diagram illustrates the power hierarchy of LR parsers. A parser in an inner set can parse any grammar in a set further inside it. For example, any grammar that is SLR(1) is also LALR(1) and CLR(1). It is also important to note that all LR( $k$ ) grammars are unambiguous. However, there exist unambiguous grammars that are not LR( $k$ ) for any $k$ .

---

3. Construction of LR Parsing Tables

The behavior of an LR parser is driven by a parsing table constructed from the grammar. This construction relies on two fundamental operations: `CLOSURE` and `GOTO`.

Augmented Grammar:
Before construction begins, we augment the grammar with a new start production $S' \rightarrow S$ , where $S$ is the original start symbol. This gives the parser a clear accept state: when it reduces by $S' \rightarrow S$ .

LR(0) Items:
An LR(0) item is a production from the grammar with a dot ' $\bullet$ ' at some position on the right-hand side. For instance, the production $A \rightarrow XYZ$ gives rise to four LR(0) items:

$A \rightarrow \bullet XYZ$

$A \rightarrow X \bullet YZ$

$A \rightarrow XY \bullet Z$

$A \rightarrow XYZ \bullet$

The dot indicates how much of the production's right-hand side we have seen so far.

The CLOSURE Operation

The `CLOSURE` operation identifies the complete set of productions we might expect to see at a certain point in the parse. If we have an item $[A \rightarrow \alpha \bullet B \beta]$ , it means we expect to see a string derivable from $B$ next. Therefore, we must add all items of the form $[B \rightarrow \bullet \gamma]$ to our set. This process is repeated until no new items can be added.

📐 CLOSURE(I) Algorithm

\begin{aligned} & \textbf{function } CLOSURE(I): \\ & \quad J = I \\ & \quad \textbf{repeat} \\ & \quad \quad \textbf{for each} \text{ item } [A \rightarrow \alpha \bullet B \beta] \in J \\ & \quad \quad \quad \textbf{for each} \text{ production } B \rightarrow \gamma \text{ in the grammar} \\ & \quad \quad \quad \quad J = J \cup \{ [B \rightarrow \bullet \gamma] \} \\ & \quad \textbf{until no new items can be added to } J \\ & \quad \textbf{return } J\end{aligned}

Variables:

$I$ : A set of LR(0) items

$J$ : The resulting closure set

When to use: To compute the full set of items for a parser state (a canonical collection).

The GOTO Operation

The `GOTO(I, X)` function defines the transition from a state $I$ on seeing a grammar symbol $X$ (either a terminal or a non-terminal). It is computed by taking all items in state $I$ where the dot is immediately followed by $X$ , advancing the dot over $X$ , and then taking the closure of this new set of items.

📐 GOTO(I, X) Algorithm

\begin{aligned} & \textbf{function } GOTO(I, X): \\ & \quad \text{Initialize } J \text{ to be empty} \\ & \quad \textbf{for each} \text{ item } [A \rightarrow \alpha \bullet X \beta] \in I: \\ & \quad \quad \text{Add item } [A \rightarrow \alpha X \bullet \beta] \text{ to } J \\ & \quad \textbf{return } CLOSURE(J)\end{aligned}

Variables:

$I$ : The current set of items (a state)

$X$ : A grammar symbol

$J$ : An intermediate set of items before closure

When to use: To compute the transitions between parser states.

Worked Example:

Problem: For the augmented grammar below, compute $I_1 = \text{GOTO}(\text{CLOSURE}(\{E' \rightarrow \bullet E\}), E)$ .

E' \rightarrow E

E \rightarrow E + T \ | \ T

T \rightarrow T * F \ | \ F

F \rightarrow id

Solution:

Step 1: Compute the initial state $I_0 = \text{CLOSURE}(\{E' \rightarrow \bullet E\})$ .
We start with $\{E' \rightarrow \bullet E\}$ . The dot is before non-terminal $E$ , so we add all productions for $E$ .

I_0 = \{ E' \rightarrow \bullet E, E \rightarrow \bullet E + T, E \rightarrow \bullet T \}

Now, in the new items, the dot is before $E$ (already processed) and $T$ . We add all productions for $T$ .

I_0 = \{ E' \rightarrow \bullet E, E \rightarrow \bullet E + T, E \rightarrow \bullet T, T \rightarrow \bullet T * F, T \rightarrow \bullet F \}

Finally, the dot is before $F$ . We add the production for $F$ .

I_0 = \{ E' \rightarrow \bullet E, E \rightarrow \bullet E + T, E \rightarrow \bullet T, T \rightarrow \bullet T * F, T \rightarrow \bullet F, F \rightarrow \bullet id \}

This set is now closed.

Step 2: Compute the kernel of $\text{GOTO}(I_0, E)$ .
We look for items in $I_0$ where the dot is before $E$ . These are $E' \rightarrow \bullet E$ and $E \rightarrow \bullet E + T$ . We advance the dot over $E$ in these items.

\text{Kernel of } I_1 = \{ E' \rightarrow E \bullet, E \rightarrow E \bullet + T \}

Step 3: Compute the closure of this kernel to get the final set $I_1$ .
The first item, $E' \rightarrow E \bullet$ , has the dot at the end, so no new items are added from it.
The second item, $E \rightarrow E \bullet + T$ , has the dot before a terminal `+`, so no new items are added.
Therefore, the closure is just the kernel itself.

I_1 = \text{GOTO}(I_0, E) = \{ E' \rightarrow E \bullet, E \rightarrow E \bullet + T \}

Answer: The set $\text{GOTO}(\text{CLOSURE}(\{E' \rightarrow \bullet E\}), E)$ contains 2 items.

---

4. SLR, LALR, and CLR Parsers

The differences between the main types of LR parsers lie in how they decide to reduce.

SLR(1) - Simple LR

The SLR(1) parser constructs its states using the LR(0) items (as shown above). When a state contains an item of the form $[A \rightarrow \alpha \bullet]$ , the parser decides to reduce by this production if the next input symbol, say $a$ , is in the FOLLOW set of the non-terminal $A$ .

Conflict: An S-R or R-R conflict occurs if, for a state $I$ :

- It contains an item

[A \rightarrow \alpha \bullet]

and another item

[B \rightarrow \beta \bullet a \gamma]

, where

a \in \text{FOLLOW}(A)

. (Shift-Reduce) - It contains two items

[A \rightarrow \alpha \bullet]

and

[B \rightarrow \beta \bullet]

, where

\text{FOLLOW}(A) \cap \text{FOLLOW}(B) \neq \emptyset

. (Reduce-Reduce)

CLR(1) - Canonical LR

The CLR(1) parser uses a more powerful item, the LR(1) item, which is of the form $[A \rightarrow \alpha \bullet \beta, a]$ . Here, $a$ is a terminal symbol called the lookahead. This item implies that we are looking for a string derivable from $\beta$ , and it must be followed by the terminal $a$ .

The `CLOSURE` and `GOTO` operations are modified to propagate these lookaheads. The reduction rule is much more precise:

Reduction Rule: In a state containing the item $[A \rightarrow \alpha \bullet, a]$ , reduce by $A \rightarrow \alpha$ only if the next input symbol is exactly $a$ .

This precision resolves many conflicts that an SLR(1) parser would have, but it results in a much larger number of states.

LALR(1) - Look-Ahead LR

The LALR(1) parser is a compromise between SLR(1) and CLR(1). It is constructed by taking the CLR(1) collection of states and merging any states that have the same set of core LR(0) items (i.e., ignoring the lookaheads). The lookaheads for the merged items are simply the union of the original lookaheads.

Advantage: The number of states is the same as in an SLR(1) parser.
Power: It is more powerful than SLR(1) but less powerful than CLR(1).
Conflict Propagation: Merging states can potentially introduce reduce-reduce conflicts that were not present in the original CLR(1) states. However, it cannot introduce new shift-reduce conflicts. If a grammar is CLR(1), it is LALR(1) only if no such new R-R conflicts are introduced.

❗ Must Remember

The power hierarchy is $LR(0) \subset SLR(1) \subset LALR(1) \subset CLR(1)$ . All LR grammars are unambiguous. The key difference lies in the information used for reduction decisions:

SLR(1): FOLLOW sets (context-insensitive).

CLR(1): Precise lookaheads computed for each item (context-sensitive).

LALR(1): A practical merge of CLR(1) states, providing a good balance of power and size.

---

5. Operator Precedence Parsing

This is a specialized form of shift-reduce parsing applicable only to a class of grammars known as operator grammars. It does not require building a full LR table. Instead, it uses precedence relations between terminal symbols to decide whether to shift or reduce.

The three precedence relations are:

$a \lessdot b$ : $a$ yields precedence to $b$ .

$a \gtrdot b$ : $a$ takes precedence over $b$ .

$a \doteq b$ : $a$ has the same precedence as $b$ .

The parser uses a stack and these relations to find handles. A handle is located when the symbol at the top of the stack has the `takes precedence` relation (

\gtrdot

) with the next input symbol.

Worked Example:

Problem: Evaluate the expression $5 * 2 + 8 - 4$ using the following custom operator precedence and associativity rules.

| Operator | Precedence | Associativity |
| :---: | :---: | :---: |
| + | High | Right |
| * | Medium | Left |
| - | Low | Left |

Solution:

We can think of this as a shift-reduce process guided by precedence. The highest precedence operator forms a handle first.

Step 1: Analyze the expression $5 * 2 + 8 - 4$ .
The operators are `*`, `+`, `-`. According to the table, `+` has the highest precedence.

Step 2: Identify the first handle based on highest precedence.
The sub-expression around the `+` operator is $2 + 8$ . We evaluate this first.

5

Step 3: Analyze the remaining expression $5 * 10 - 4$ .
The remaining operators are `` (Medium) and `-` (Low). The `` operator has higher precedence.

Step 4: Evaluate the next handle.
The sub-expression around the `` operator is $5$ 10 $5 * 10$ . We evaluate this next.

(5 * 10) - 4 = 50 - 4

Step 5: Evaluate the final expression.

50 - 4 = 46

Answer: 46

(Note: Associativity rules (Left/Right) are used to resolve ambiguity when multiple operators of the same precedence appear consecutively. For instance, if `+` were left-associative, in an expression `a + b + c`, `(a+b)` would be evaluated first.)

---

Problem-Solving Strategies

💡 GATE Strategy: Analyzing Grammars for Conflicts

When asked to determine if a grammar is SLR(1), LALR(1), or CLR(1), you rarely need to construct the full parser. Instead, look for classic conflict patterns.

For SLR(1): Find a state with a reduce item $A \rightarrow \alpha \bullet$ and a shift item $B \rightarrow \beta \bullet a \gamma$ . If $a$ is in $\text{FOLLOW}(A)$ , there is a shift-reduce conflict.

For LALR(1) vs. CLR(1): The classic example is a grammar where two different CLR(1) states, say $\{[A \rightarrow \alpha \bullet, a]\}$ and $\{[B \rightarrow \beta \bullet, b]\}$ , have the same core LR(0) item. When merged into an LALR(1) state $\{[A \rightarrow \alpha \bullet, a], [B \rightarrow \beta \bullet, b]\}$ , if the next input can be either $a$ or $b$ , you have a potential conflict. If $a=b$ , it becomes a reduce-reduce conflict in the LALR(1) parser.

---

Common Mistakes

⚠️ Avoid These Errors

❌ Incorrect CLOSURE Calculation: Forgetting to recursively add items. If you add $[B \rightarrow \bullet C \delta]$ , you must then check for productions starting with $C$ .

✅ Correct Approach: Systematically add new items and re-scan the set until no new items can be added.

❌ Confusing FOLLOW with Lookaheads: Using FOLLOW sets for CLR(1) or LALR(1) reduction decisions.

✅ Correct Approach: Remember that SLR(1) uses the general FOLLOW set of a non-terminal. CLR(1) and LALR(1) use specific, item-dependent lookahead symbols.

❌ Forgetting the Augmented Production: Starting the `CLOSURE` process with $\{S \rightarrow \bullet \alpha\}$ instead of $\{S' \rightarrow \bullet S\}$ .

✅ Correct Approach: Always add the new start production

S' \rightarrow S

first. This ensures a unique, unambiguous accept state.

❌ Assuming Unambiguous implies LR: Believing that any unambiguous grammar is parseable by an LR parser.

✅ Correct Approach: All LR grammars are unambiguous, but the converse is not true. There are unambiguous grammars that require more than a fixed lookahead and are thus not LR(

k

) for any

k

---

Practice Questions

:::question type="MCQ" question="Which of the following parser types can be classified as a bottom-up parser?" options=["Recursive Descent Parser","LL(1) Parser","Operator Precedence Parser","Predictive Parser"] answer="Operator Precedence Parser" hint="Consider the fundamental strategy of each parser. Does it build the parse tree from leaves to root or root to leaves?" solution="Recursive Descent, LL(1), and Predictive Parsers are all Top-Down parsers; they start from the grammar's start symbol and try to derive the input string. An Operator Precedence Parser is a type of Shift-Reduce parser, which is a bottom-up technique."
:::

:::question type="NAT" question="Consider the following augmented grammar, where the set of terminals is {x, y, z}.

S' \to S

S \to Lz

L \to xL \mid y

Let

I_0 = \text{CLOSURE}(\{S' \to \bullet S\})

. The number of items in the set

\text{GOTO}(I_0, x)

is ______. " answer="3" hint="First, compute the set

I_0

by applying the CLOSURE operation. Then, identify all items in

I_0

where the dot is before an 'x'. Move the dot past 'x' for these items and compute the closure of the resulting set." solution="
Step 1: Compute

I_0 = \text{CLOSURE}(\{S' \to \bullet S\})

.
Starting set:

\{S' \to \bullet S\}

Add for

S

\{S' \to \bullet S, S \to \bullet Lz\}

Add for

L

\{S' \to \bullet S, S \to \bullet Lz, L \to \bullet xL, L \to \bullet y\}

This set is closed. So,

I_0 = \{S' \to \bullet S, S \to \bullet Lz, L \to \bullet xL, L \to \bullet y\}

Step 2: Identify the kernel of $\text{GOTO}(I_0, x)$ .
Look for items in $I_0$ with $\bullet x$ . There is one: $L \to \bullet xL$ .
Advancing the dot gives the kernel: $\{L \to x \bullet L\}$ .

Step 3: Compute $\text{CLOSURE}(\{L \to x \bullet L\})$ .
The dot is before the non-terminal $L$ . We must add all productions for $L$ with the dot at the beginning.
The new items are $L \to \bullet xL$ and $L \to \bullet y$ .
The final set is $\{L \to x \bullet L, L \to \bullet xL, L \to \bullet y\}$ .

Result: The set $\text{GOTO}(I_0, x)$ contains 3 items.
"
:::

:::question type="MSQ" question="Consider the hierarchy of LR parsers. Which of the following statements is/are TRUE?" options=["Every grammar that is LALR(1) is also CLR(1).","A grammar that has a reduce-reduce conflict in its LALR(1) parsing table might not have a reduce-reduce conflict in its CLR(1) parsing table.","The number of states in an LALR(1) parser is always less than or equal to the number of states in a CLR(1) parser for the same grammar.","Every unambiguous grammar is SLR(1)."] answer="A,B,C" hint="Analyze the power relationship and the construction method for each parser type. Recall how LALR(1) states are formed from CLR(1) states." solution="

A: This is true by definition of the hierarchy. The set of LALR(1) grammars is a proper subset of the set of CLR(1) grammars.

B: This is true. The LALR(1) construction merges CLR(1) states with the same core. This merging can combine two reduce items with different lookaheads into a single state, potentially creating a new reduce-reduce conflict.

C: This is true. The LALR(1) parser is constructed by merging states of the CLR(1) parser, so its state count can only be less than or equal to the CLR(1) parser's state count.

D: This is false. There are many unambiguous grammars that are not SLR(1). A classic example is one that requires more precise lookaheads than what the FOLLOW set can provide.

Therefore, statements A, B, and C are correct.
"
:::

:::question type="NAT" question="An expression `10 / 5 * 4 + 8 / 2` is evaluated according to the following rules:

`+` has the lowest precedence and is left-associative.

`*` has medium precedence and is right-associative.

`/` has the highest precedence and is left-associative.

The value of the expression is ______." answer="18" hint="Evaluate sub-expressions in order of precedence, from highest to lowest. Use associativity to resolve ties for operators with the same precedence." solution="
Step 1: The expression is `10 / 5 * 4 + 8 / 2`. The highest precedence operator is `/`. There are two instances. Since it is left-associative, we evaluate them from left to right.

Step 2: Evaluate the first `/` operation: `10 / 5`.

(10 / 5)

Step 3: Evaluate the second `/` operation: `8 / 2`.

2

Step 4: The remaining operators are `` (medium) and `+` (low). We evaluate `` next.

(2 * 4) + 4 = 8 + 4

Step 5: Evaluate the final `+` operation.

8 + 4 = 12

Let's re-read the question carefully. `*` is right-associative. Let's re-evaluate.
Expression: `10 / 5 * 4 + 8 / 2`

Step 1: Highest precedence is `/` (left-associative).
Evaluate `10 / 5` first:

2 * 4 + 8 / 2

Evaluate `8 / 2` next:

2 * 4 + 4

The expression is now `2 * 4 + 4`.

Step 2: Next precedence is `*` (medium), then `+` (low).
Evaluate `2 * 4`:

8 + 4

Step 3: Final operation is `+`.

8 + 4 = 12

My initial answer was 12. Let me re-read the problem one more time.
Ah, I see. My initial evaluation was correct. Let me create a different question where associativity matters more.

New Question: An expression `8 - 4 - 2 3 2` is evaluated with these rules:

`*` has higher precedence and is right-associative.

`-` has lower precedence and is right-associative.

The value of the expression is ______.

Solution for new question:
Step 1: Expression is `8 - 4 - 2 3 2`. Highest precedence is `*`.
Since `` is right-associative, we evaluate `3 2` first in the sub-expression `2 3 2`.

8 - 4 - 2

Step 2: Evaluate the remaining `*`.

8 - 4 - (2 * 6) = 8 - 4 - 12

Step 3: The remaining operator is `-`, which is right-associative. In `8 - 4 - 12`, we evaluate `4 - 12` first.

8 - (4 - 12) = 8 - (-8)

Step 4: Final calculation.

8 - (-8) = 8 + 8 = 16

The value is 16. Let's make this the official question.

:::question type="NAT" question="An expression `8 - 4 - 2 3 2` is evaluated according to the following rules: `*` has higher precedence and is right-associative. `-` has lower precedence and is right-associative. The value of the expression is ______." answer="16" hint="Evaluate the highest precedence operator first. For operators of the same precedence, apply the associativity rule. Right-associativity means grouping from the right." solution="
Step 1: The expression is `8 - 4 - 2 3 2`. The `*` operator has higher precedence.

Step 2: The sub-expression with `` is `2 3 2`. Since `` is right-associative, the rightmost `*` is evaluated first.

2

Step 3: The expression becomes `8 - 4 - 12`.

Step 4: The `-` operator is right-associative. Therefore, the rightmost `-` is evaluated first.

8 - (4 - 12) = 8 - (-8)

Step 5: Compute the final result.

8 - (-8) = 16

Result: The value is 16.
"
:::

---

Summary

❗ Key Takeaways for GATE

Bottom-Up Parsing Strategy: The core idea is Shift-Reduce. The parser builds the parse tree from the leaves (input tokens) up to the root (start symbol). The critical challenge is resolving shift-reduce and reduce-reduce conflicts.

LR Parser Hierarchy and Power: The strict hierarchy of parsing power is $LR(0) \subset SLR(1) \subset LALR(1) \subset CLR(1)$ . All LR grammars are unambiguous. CLR(1) is the most powerful but has the most states, while LALR(1) offers a practical compromise widely used in real-world compilers.

Conflict Resolution Mechanism: The key differentiator is the information used to make a reduction decision. SLR(1) uses the general FOLLOW set of the non-terminal. CLR(1) and LALR(1) use more specific lookahead symbols, which resolves more conflicts than the SLR(1) approach.

Construction Fundamentals (CLOSURE and GOTO): Questions requiring you to compute the number of items in a state are common. A mastery of the `CLOSURE` and `GOTO` algorithms for LR(0) items is essential for solving these accurately and quickly.

---

What's Next?

💡 Continue Learning

This topic connects to:

Top-Down Parsing (LL Parsers): It is crucial to contrast the bottom-up approach with the top-down strategies of LL(1) and recursive descent parsing. Understand the types of grammars each can handle (e.g., left-recursive grammars are problematic for top-down parsers but handled naturally by bottom-up parsers).

Syntax Directed Translation (SDT): Parsing is not an end in itself; it drives translation. Learn how semantic actions are attached to grammar productions and executed during the reduction steps of a bottom-up parser to build Abstract Syntax Trees (ASTs) or generate intermediate code.

---

Chapter Summary

📖 Parsing (Syntax Analysis) - Key Takeaways

In this chapter, we have explored the principles and techniques of syntax analysis, the second phase of the compilation process. As we conclude, it is essential to consolidate the core concepts that are fundamental for the GATE examination.

The Role of the Parser: The primary function of the parser is to accept a stream of tokens from the lexical analyzer and verify if this stream can be generated by the grammar for the source language. It constructs a parse tree, which represents the syntactic structure of the program, for use in subsequent compilation phases.

Top-Down vs. Bottom-Up Parsing: We have examined two fundamental strategies. Top-down parsing, exemplified by LL(1) parsers, attempts to construct the parse tree from the root (start symbol) downwards. Bottom-up parsing, characteristic of LR parsers, builds the tree from the leaves (the input string) upwards to the root.

Predictive Parsing (LL(1)): For a grammar to be LL(1), it must be unambiguous, free of left recursion, and free of common prefixes. The construction of an LL(1) parser is driven by a parsing table, which is populated using the FIRST and FOLLOW sets of the non-terminals in the grammar. A conflict in this table indicates that the grammar is not LL(1).

LR Parsing Hierarchy: The family of LR parsers (LR(0), SLR(1), LALR(1), CLR(1)) represents the most powerful class of parsers for deterministic context-free languages. Their power is distinguished by the lookahead information used to make parsing decisions. The hierarchy of parsing power is: $SLR(1) \subset LALR(1) \subset CLR(1)$ .

Parsing Conflicts: The central challenge in constructing LR parsers is the resolution of conflicts. A Shift-Reduce (S/R) conflict occurs when the parser can either shift the next input symbol or reduce by a completed production. A Reduce-Reduce (R/R) conflict occurs when multiple reduction moves are possible. A grammar for which a conflict-free LR parsing table cannot be built is not in that LR class.

Canonical Item Sets: The construction of all LR parsing tables is based on the creation of a DFA whose states are sets of LR(0) or LR(1) items. The core operations for this construction are the `Closure` and `Goto` functions, which systematically explore all possible configurations of the parser.

---

Chapter Review Questions

:::question type="MCQ" question="Consider the standard expression grammar $G$ :
$E \to E+T \mid T$
$T \to T*F \mid F$
$F \to (E) \mid \textbf{id}$
Which of the following statements about $G$ is FALSE?" options=["The grammar G is unambiguous.","The grammar G is not LL(1).","The grammar G is LALR(1).","A predictive parser can be constructed for G without any modification."] answer="D" hint="Recall the requirements for a grammar to be suitable for predictive (LL(1)) parsing. Consider the implications of left recursion." solution="
A. The grammar G is unambiguous. This statement is True. The grammar is structured to enforce the precedence of $*$ over $+$ and left-associativity for both operators, thereby ensuring that any given expression has only one unique parse tree.

B. The grammar G is not LL(1). This statement is True. The productions $E \to E+T$ and $T \to T*F$ are left-recursive. A grammar containing left recursion can never be LL(1), as a top-down parser would enter an infinite loop.

C. The grammar G is LALR(1). This statement is True. This is a classic example of a grammar that is not LL(1) but is parsable by bottom-up methods. It is LALR(1) and, by extension, CLR(1). Most programming language expression grammars are designed to be LALR(1).

D. A predictive parser can be constructed for G without any modification. This statement is False. A predictive parser is an LL(1) parser. As established in point B, the grammar is not LL(1) due to left recursion. To build a predictive parser, the grammar must first be modified to eliminate left recursion. For example, $E \to E+T \mid T$ would be transformed into:

E \to TE'

E' \to +TE' \mid \epsilon

Since the statement claims construction is possible without modification, it is false.
"
:::

:::question type="NAT" question="Consider the following context-free grammar:
$S \to CC$
$C \to cC \mid d$
Calculate the number of states in the canonical collection of LR(0) items (the DFA of item sets) for this grammar." answer="7" hint="Begin with the augmented grammar $S' \to S$ . Systematically compute the closure and goto functions for each new set of items until no new sets are generated. Remember to count the initial state." solution="
We start with the augmented grammar: $S' \to S$ .

$I_0 = \text{closure}(\{S' \to .S\})$

S' \to .S

S \to .CC

C \to .cC

C \to .d

$\text{GOTO}(I_0, c) = I_1 = \text{closure}(\{C \to c.C\})$

C \to c.C

C \to .cC

C \to .d

$\text{GOTO}(I_0, d) = I_2 = \text{closure}(\{C \to d.\})$

C \to d.

$\text{GOTO}(I_0, S) = I_3 = \text{closure}(\{S' \to S.\})$

S' \to S.

(Accept state)

$\text{GOTO}(I_0, C) = I_4 = \text{closure}(\{S \to C.C\})$

S \to C.C

C \to .cC

C \to .d

Now we process the new states: $I_1, I_2, I_3, I_4$ .

From $I_1$ :

\text{GOTO}(I_1, c) = \text{closure}(\{C \to c.C, \dots\}) = I_1

(Loop)
-

\text{GOTO}(I_1, d) = \text{closure}(\{C \to d.\}) = I_2

- $\text{GOTO}(I_1, C) = I_5 = \text{closure}(\{C \to cC.\})$

C \to cC.

State $I_2$ has no GOTO transitions as the dot is at the end.
State $I_3$ is the accept state.

From $I_4$ :

\text{GOTO}(I_4, c) = \text{closure}(\{C \to c.C, \dots\}) = I_1

\text{GOTO}(I_4, d) = \text{closure}(\{C \to d.\}) = I_2

- $\text{GOTO}(I_4, C) = I_6 = \text{closure}(\{S \to CC.\})$

S \to CC.

No new states are generated. The complete set of states is $\{I_0, I_1, I_2, I_3, I_4, I_5, I_6\}$ .
Therefore, the total number of states is 7.
"
:::

:::question type="MSQ" question="Consider the following grammar $G$ and a state $I_k$ from its canonical collection of LR(0) items:

Grammar G:
$S \to L=R \mid R$
$L \to *R \mid \textbf{id}$
$R \to L$

State $I_k$ : $\{ S \to L.=R, \ R \to L. \}$

Which of the following statements is/are correct?" options=["State $I_k$ contains a Shift-Reduce conflict for an SLR(1) parser.","The grammar G is ambiguous.","The conflict in state $I_k$ can be resolved by a CLR(1) or LALR(1) parser.","The grammar G is LL(1)."] answer="A,C" hint="To check for an SLR(1) conflict, compare the shift-on symbol with the FOLLOW set of the non-terminal on the left-hand side of the reduce item. Consider if more powerful parsers with precise lookaheads could resolve this." solution="
Let's analyze the given state $I_k = \{ S \to L.=R, \ R \to L. \}$ for an SLR(1) parser.

The item

S \to L.=R

implies a shift action on the terminal `=`.

The item

R \to L.

is a reduce item. In SLR(1), this implies a reduce action on all terminals in

FOLLOW(R)

Let's compute $FOLLOW(R)$ :

From $S \to L=R$ , the symbol `=` is in $FOLLOW(L)$ .

From $S \to R$ , anything in $FOLLOW(S)$ is in $FOLLOW(R)$ . $FOLLOW(S)$ contains `$`.

From $L \to *R$ , anything in $FOLLOW(L)$ is in $FOLLOW(R)$ .

This means $FOLLOW(R)$ contains at least `{=, $}`.

In state

I_k

, on input symbol `=`, we have two possible actions:

Shift (from $S \to L.=R$ )

Reduce by $R \to L$ (since `=` is in $FOLLOW(R)$ )

This constitutes a Shift-Reduce conflict. Therefore:

A. State $I_k$ contains a Shift-Reduce conflict for an SLR(1) parser. This statement is Correct. The conflict exists on the terminal `=`.

B. The grammar G is ambiguous. This statement is Incorrect. This is a well-known example of a grammar that is not SLR(1) but is LALR(1) and unambiguous. It models assignment statements.

C. The conflict in state $I_k$ can be resolved by a CLR(1) or LALR(1) parser. This statement is Correct. A CLR(1) parser would associate a specific lookahead with the reduce item. The item would be $[R \to L., \$ ] $. This is because this reduction is only valid in the context of the production$ S \to R $, where the symbol following$ R $is `$ `. The shift action on `=` is associated with the production $S \to L=R$ . Since the lookaheads (`$`) and (`=`) are disjoint, a CLR(1) parser (and by extension, an LALR(1) parser) can resolve the conflict.

D. The grammar G is LL(1). This statement is Incorrect. The grammar contains left recursion ( $R \to L$ and $L$ can derive a string starting with `R`), so it cannot be LL(1). For instance, $R \Rightarrow L \Rightarrow$ R $R \Rightarrow L \Rightarrow * R$ .
"
:::

:::question type="MCQ" question="For the grammar below, what is the FOLLOW set of the non-terminal A?
$S \to ACB \mid CbB \mid Ba$
$A \to da \mid BC$
$B \to g \mid \epsilon$
$C \to h \mid \epsilon$
" options=["{h, g, $}","{g, h, b}","{h, g, a}","{$ , b, a}"] answer="A" hint="FOLLOW(A) is determined by what can appear immediately to its right. Consider the production S -> ACB. You need to compute FIRST(CB). Remember to account for the possibility that CB can derive epsilon." solution="
We need to compute $FOLLOW(A)$ . We apply the rules for computing FOLLOW sets.

Rule 1: Place `

` in

FOLLOW(S)

.

FOLLOW(S) = \{\

\}

Rule 2: Look for productions of the form

X \to \alpha A \beta

. Here, we have

S \to ACB

- Everything in

FIRST(\beta)

except

\epsilon

is in

FOLLOW(A)

. Here,

\beta = CB

.
- We need to calculate

FIRST(CB)

.
-

FIRST(C) = \{h, \epsilon\}

FIRST(B) = \{g, \epsilon\}

FIRST(CB) = (FIRST(C) - \{\epsilon\}) \cup (\text{if } \epsilon \in FIRST(C) \text{ then } FIRST(B) \text{ else } \emptyset)

FIRST(CB) = (\{h\}) \cup FIRST(B) = \{h\} \cup \{g, \epsilon\} = \{h, g, \epsilon\}

- Therefore, we add

(FIRST(CB) - \{\epsilon\})

FOLLOW(A)

.
-

FOLLOW(A) = FOLLOW(A) \cup \{h, g\}

Rule 3: Look for productions of the form

X \to \alpha A

X \to \alpha A \beta

where

FIRST(\beta)

contains

\epsilon

- In our case, from

S \to ACB

, since

FIRST(CB)

contains

\epsilon

(as both C and B can be

\epsilon

), we must add everything in

FOLLOW(S)

FOLLOW(A)

.
-

FOLLOW(A) = FOLLOW(A) \cup FOLLOW(S) = \{h, g\} \cup \{\

\} = \{h, g, \

\}

We also have the production

A \to BC

. This affects

FOLLOW(B)

and

FOLLOW(C)

, but not

FOLLOW(A)

Combining all results, the final set is $FOLLOW(A) = \{h, g, \$ \}$.
"
:::

---

What's Next?

💡 Continue Your GATE Journey

Having completed our study of Parsing, you have mastered the principles of syntactic structure recognition, a cornerstone of compiler design. This knowledge forms a critical bridge between the lexical and semantic phases of compilation.

Key connections:

Relation to Previous Chapters: This chapter builds directly upon Lexical Analysis. The token stream, which we treated as the fundamental input here, is the direct output of the lexical analyzer. Our work also represents a practical application of the theory from Formal Languages and Automata, where we moved from the regular expressions and finite automata of lexical analysis to the context-free grammars and pushdown automata that model parsing.

Foundation for Future Chapters: The parse tree, which is the ultimate output of the syntax analysis phase, is the central data structure for the next stages of compilation.

- In Syntax Directed Translation (SDT), we will learn to annotate the grammar productions with semantic actions. The parser, as it performs shifts and reductions, will trigger these actions to calculate attributes, build symbol tables, and perform type checking. - The result of SDT is often an Intermediate Code Representation (ICR). The hierarchical structure provided by the parse tree is indispensable for systematically generating machine-independent code, such as three-address code.

Parsing (Syntax Analysis)

Parsing (Syntax Analysis)

Overview

Chapter Contents

Learning Objectives

Part 1: Top-Down Parsing

Introduction

Key Concepts

1. FIRST and FOLLOW Sets

The `FIRST` Set

The `FOLLOW` Set

2. LL(1) Grammars

3. Construction of an LL(1) Parsing Table

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Part 2: Bottom-Up Parsing

Introduction

Key Concepts

1. Shift-Reduce Parsing

2. The LR Parser Family

3. Construction of LR Parsing Tables

The CLOSURE Operation

The GOTO Operation

4. SLR, LALR, and CLR Parsers

SLR(1) - Simple LR

CLR(1) - Canonical LR

LALR(1) - Look-Ahead LR

5. Operator Precedence Parsing

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Chapter Summary

Chapter Review Questions

What's Next?

🎯 Key Points to Remember

Related Topics in Compiler Design

Code Optimization

Syntax-Directed Translation

Intermediate Code Generation

Lexical Analysis

More Resources

Study Notes

Short Notes

Test Series

Mock Tests

Previous Year Papers

Chapter-wise PYQs

Chapter Practice

Why Choose MastersUp?

AI-Powered Plans

15,000+ Questions

Smart Analytics

Bookmark & Revise