0% found this document useful (0 votes)
2 views

u1

This document is a confidential educational resource for the RMK Group of Educational Institutions, detailing the course structure for Compiler Design (Lab Integrated) for the CSE department. It includes course objectives, prerequisites, syllabus, lecture plans, and assessment schedules. The document emphasizes the importance of confidentiality and outlines the consequences of unauthorized dissemination.

Uploaded by

Hari haran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

u1

This document is a confidential educational resource for the RMK Group of Educational Institutions, detailing the course structure for Compiler Design (Lab Integrated) for the CSE department. It includes course objectives, prerequisites, syllabus, lecture plans, and assessment schedules. The document emphasizes the importance of confidentiality and outlines the consequences of unauthorized dissemination.

Uploaded by

Hari haran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

1

2
Please read this disclaimer before
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-
mail if you have received this document by mistake and delete this document
from your system. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited.

1 3
22CS601
Compiler Design (Lab Integrated)

Department: CSE

Batch / Year: 2022 - 2026 / III Year CSE

Created by

Dr.P. Ezhumalai Professor/CSE/RMDEC


Dr. Selvi S, Professor / CSE / RMKEC
Dr.A.K.Jaithunbi ASP/CSE/RMDEC
Dr. Pranamita Nanda ASP/CSE/RMKCET
(Course Coordinator)
Mr. Prabhu V, Associate Professor / CSE / RMKEC
Mr.D.Jayakumar Asst. Professor/CSE/RMDEC

Date: 18.12.2024

4
CONTENTS

S. No Page No
Contents

1 Course Objectives 6

2 Pre Requisites 7

3 Syllabus 8

4 Course outcomes 10

5 CO- PO/PSO Mapping 11

6 Lecture Plan 12

7 Activity based learning 13

8 Lecture Notes 14

9 Assignments 82

10 Part A Questions & Answers 83

11 Part B Questions 89

12 Supportive online Certification courses 90

13 Real time Applications 91

14 Contents beyond the Syllabus 92

15 Assessment Schedule 94

16 Prescribed Text Books & Reference Books 95

17 Mini Project Suggestions 96

5
1. COURSE OBJECTIVES

• To study the different phases of compiler

• To understand the techniques for tokenization and


parsing

• To understand the conversion of source program into


an intermediate representation

• To learn the different techniques used for assembly


code generation

• To analyze various code optimization techniques

6
2. PRE REQUISITES

22CS601
Semester VI
Compiler Design (Lab Integrated)

Semester V

22CS502
Theory of Computation (Lab Integrated)

Semester II
Semester III

22MA301 22CS201
Data Structures (Lab Integrated)
Discrete Mathematics

Semester I Semester II
22CS101 22IT202
Problem Solving DBMS (Lab Integrated)
using C++ (Lab
Integrated)

7
3. SYLLABUS

22CS601 COMPILER DESIGN (LAB INTEGRATED) L T P C


3 0 2 4

OBJECTIVES

• To study the different phases of compiler


• To understand the techniques for tokenization and parsing
• To understand the conversion of source program into an intermediate
representation
• To learn the different techniques used for assembly code generation
• To analyze various code optimization techniques

UNIT I INTRODUCTION TO COMPILERS 9 + 6 = 15


Introduction–Structure of a Compiler–Role of the Lexical Analyzer - Input Buffering -
Specification of Tokens - Recognition of Tokens–The Lexical Analyzer Generator LEX- Finite
Automata - Regular Expressions to NFA-Optimization of DFA based pattern matches -
Conversion from NFA to DFA - Minimization of DFA.
UNIT II SYNTAX ANALYSIS 9 + 6 = 15
Role of the Parser - Context-free grammars – Derivation Trees – Ambiguity in Grammars and
Languages- Writing a grammar – Types of parsing - Top-Down Parsing – Predictive parser or
LL(1) Parser –Bottom-Up Parsing – Shift Reduce Parser - LR Parsers - SLR, CLR, LALR Parser -
Parser Generators YACC .
UNIT III INTERMEDIATE CODE GENERATION 9 + 6 = 15
Syntax Directed Definitions - Evaluation Orders for Syntax Directed Definitions – Application of
Syntax Directed Translation - Intermediate Languages - Syntax Tree – Three Address Code –
Implementation of Three address code – Declarations - Translation of Expressions - Type
Checking
UNIT IV RUN-TIM ENVIRONMENT AND CODE GENERTION 9 + 6 = 15
Run Time Environment: Storage Organization-Storage allocation strategies - Access to nonlocal
data on stack – Heap management - Parameter Passing - Issues in the design of Code
Generator – Design of simple Code Generator –Register allocation and assignment.
UNIT V CODE OPTIMIZATION 9 + 6 = 15

Principle sources of optimization – Peep hole Optimization – DAG construction -Basic blocks and
flow graph - Optimization in Basic blocks – Data flow analysis.

8
3. SYLLABUS

LIST OF EXPERIMENTS:
1. Develop a lexical analyzer to recognize a few patterns in C. (Ex.
identifiers, constants, comments, operators etc.). Create a symbol table,
while recognizing identifiers.
2. Design a lexical analyzer for the given language. The lexical analyzer
should ignore redundant spaces, tabs and new lines, comments etc.

3. Implement a Lexical Analyzer using Lex Tool


4. Design Predictive Parser for the given language
5. Implement an Arithmetic Calculator using LEX and YACC
6. Generate three address code for a simple program using LEX and YACC.
7. Implement simple code optimization techniques (Constant folding,
Strength reduction and Algebraic transformation)
8. Implement back-end of the compiler for which the three address code is
given as input and the 8086 assembly language code is produced as
output.

9
4. COURSE OUTCOME

Upon completion of the course, the students will be able to:

COURSE OUTCOMES HKL

Explain the different phases of compiler K2


CO1

Describe the fundamental components of a compiler K3


CO2

Design and implement a lexical analyzer using finite


CO3 K3
automata and regular expressions.

Compare various parsing techniques.


CO4 K3

Implement code optimization techniques with simple code K3


CO5
generators

Develop code generation strategies that translate K3


CO6
intermediate representations into target machine code

• HKL = Highest Knowledge Level

10
5. CO - PO / PSO MAPPING

PROGRAM OUTCOMES PSO


K3, PSO PSO PSO

CO HKL K3 K4 K5 K5 K4, A3 A2 A3 A3 A3 A3 A2 1 2 3
K5
PO PO PO PO PO PO PO PO PO PO PO PO
-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12

CO1 K2 3 2 1 - - - - 1 1 1 - 1 2 - -

CO2 K3 3 2 1 - - - - 1 1 1 - 1 2 - -

CO3 K3 3 2 1 - - - - 1 1 1 - 1 2 - -

CO4 K3 3 2 1 - - - - 1 1 1 - 1 2 - -

CO5 K3 3 2 1 - - - - 1 1 1 - 1 2 - -

CO6 K3 3 2 1 - - - - 1 1 1 - 1 2 - -

• Correlation Level –
• 1. Slight (Low)
• 2. Moderate (Medium)
3. Substantial (High) ,
If there is no correlation, put “-“.

11
6. LECTURE PLAN

Unit – I : INTRODUCTION TO COMPILERS


UNIT – I INTRODUCTION TO COMPILERS
Actua
Prop Re
l Highest
S. os ed Mode of Delivery
Topic Lectu CO Cognitive LU Outcomes mar
No Lect Delivery Resources
re Level k
ure
Date
Date

18.12 18.12. MD1 & Define Compiler, Discuss


1 Introduction to compilers K2 T1
.2024 2024 MD5 the purpose of compiler
19.12 Structure of a Compiler - 19.12. MD1 & Define the structure of a
2 K3 T1
.2024 Phases 2024 MD5 compiler.

20.12 Role of the Lexical 20.12. MD1 & Defines the role of
3 K2 T1
.2024 Analyzer, Input Buffering 2024 MD5 Lexical Analyzer.
Explains the token
21.12 Specification of Tokens - 21.12. MD1 &
4 K3 T1 specification and
.2024 Recognition of Tokens 2024 MD5 recognition of tokens
Explains how to use LEX
23.12 The Lexica Analyzer 23.12. MD1 &
5 K3 T1 tool to design Lexical
.2024 Generator LEX 2024 CO1 MD5 Analyzer.
MD1 & Defines the Finite
6 Finite Automata, K3 T1
24.12 24.12. MD5 automata and its types.
.2024 Regular Expressions 2024
to NFA
Explains the ways to
Optimization of DFA MD1 &
7 26.12 26.12. K3 T1 convert the RE to
.2024 based pattern 2024 MD5 Automata.
matches
27.12 Conversion from NFA to 27.12. MD1 & Discuss the steps to
8 K3 T1
.2024 DFA 2024 MD5 convert NFA to DFA
Apply the Minimization
02.01 02.01. MD1 &
9 Minimization of Automata K3 T1 algorithm to reduce the
.2025 2025 MD5 states in DFA.

• ASSESSMENT COMPONENTS MODE OF DELEIVERY


• AC 1. Unit Test MD 1. Oral presentation
• AC 2. Assignment MD 2. Tutorial
• AC 3. Course Seminar MD 3. Seminar
• AC 4. Course Quiz MD 4 Hands On
• AC 5. Case Study MD 5. Videos
• AC 6. Record Work MD 6. Field Visit
• AC 7. Lab / Mini Project
• AC 8. Lab Model Exam
• AC 9. Project Review
7. ACTIVITY BASED LEARNING

• TO UNDERSTAND THE BASIC CONCEPTS OF COMPILERS , STUDENTS


ABLE TO TAKE QUIZ AS AN ACTIVITY.

• LINK for the quiz is given below.

https://forms.gle/2HmsNdtCTFCyfBSU7

Exercises:

1. Draw a DFA for the language accepting strings ending with ‘abb’ over
input alphabets ∑ = {a, b}

2. Draw a DFA for the language accepting strings not having substring ‘001’ over
input alphabets ∑ = {0, 1}
3. Construct a DFA that accepts a language L over input alphabets ∑={a, b} such that L is
the set of all strings starting with ‘ba’ or ‘ab’

4. Design an NFA ∑ = {0, 1} in which double '1' is followed by double ‘1’


5. Design an NFA with ∑ = {0, 1} accepts all string in which the third symbol
fromthe right end is always 0.

Practice construction Compiler Design using Jflap software for the above language
http://www.jflap.org/

13
8. LECTURE NOTES

UNIT – I INTRODUCTION TO COMPILERS

INTRODUCTION
Compiler is a program that can read a program in one language and translate it
into an equivalent program in another language.
Types Of Compiler:
Cross Compiler : A cross compiler is a compiler can able to create executable
code for a platform other than the one on which the compiler is running.
Source to Source compiler : It takes the source code of a Programme written in
one programming language as its input and generates the equivalent source code
in a different programming language.
OVERVIEW OF LANGUAGE PROCESSING SYSTEM

Preprocessor

A preprocessor produce input to compilers. They may perform the following


functions.

1.Macro processing: A preprocessor may allow a user to define macros that are
short hands for longer constructs.

2. File inclusion: A preprocessor may include header files into the program text.

3.Rational preprocessor: these preprocessors augment older languages with more


modern flow-of-control and data structuring facilities.

4.Language Extensions: These preprocessor attempts to add capabilities to the


language by certain amounts to build-in macro

14
COMPILER

Compiler is a translator program that translates a program written in (HLL) the


source program and translate it into an equivalent program in (MLL) the target
program. As an important part of a compiler is error showing to the programmer.

Source Program COMPILER Target Program

Error Message

Executing a program written n HLL programming language is basically of two parts.


The source program must first be compiled translated into a object program. Then
the results object program is loaded into a memory executed.

Source Program COMPILER Obj Program

Obj Program input OBJECT PROGRAM Obj Program

ASSEMBLER:

programmers found it difficult to write or read programs in machine language. They


begin to use a mnemonic (symbols) for each machine instruction, which they would
subsequently translate into machine language. Such a mnemonic machine language
is now called an assembly language. Programs known as assembler were written to
automate the translation of assembly language in to machine language. The input
to an assembler program is called source program, the output is a machine
language translation (object program).

15
Loader and Link-editor:

Once the assembler procedures an object program, that program must be placed
into memory and executed. The assembler could place the object program directly
in memory and transfer control to it, thereby causing the machine language
program to be execute. This would waste core by leaving the assembler in memory
while the users program was being executed. Also the programmer would have to
retranslate his program with each execution, thus wasting translation time. To
overcome this problem of wasted translation time and memory, system
programmers developed another component called loader.
A loader is a program that places programs into memory and prepares them for
execution. It would be more efficient if subroutines could be translated into object
form the loader could ”relocate” directly behind the users program. The task of
adjusting programs so that they may be placed in arbitrary core locations is called
relocation. Relocation loaders perform four functions.

TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
detection. Any violation of the HLL (High Level Language) specification would be
detected and reported to the programmers. Important role of translator are:

• Translating the HLL program input into an equivalent ml program.


• Providing diagnostic messages wherever the programmer violates
specification of the HLL.

1.8 TYPE OF TRANSLATORS:-

1) INTERPRETOR 2) COMPILER 3) PREPROSSESSOR

16
INTERPRETER: An interpreter is a program that appears to execute a source
program as if it were machine language.

Source Program
INTERPRETER

Input Program Output

Languages such as BASIC, COBOL, LISP can be translated using interpreters. JAVA
also uses interpreter. The process of interpretation can be carried out in following
phases.

1. Lexical analysis

2. Syntax analysis

3. Semantic analysis

4. Direct Execution

Advantages:

• Modification of user program can be easily made and implemented as

execution proceeds.

• Type of object that denotes various may change dynamically.


• Debugging a program and finding errors is simplified task for a program used
for interpretation.

• The interpreter for the language makes it machine independent.

Disadvantages:

• The execution of the program is slower. Memory consumption is more.

17
STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler:
A compiler operates in phases. A phase is a logically interrelated operation that
takes source program in one representation and produces output in another
representation. Compilation process is partitioned into PHASES OF A COMPILER,
No-of-sub processes called ‘phases’

The phases of a compiler divided into 2 parts


a. Analysis
b. Synthesis
Analysis phase is also called as Front end, here the Source program breaks up into
constituent pieces and impose a grammatical structure on them. It includes first 4
phases of Compiler.

Synthesis phase is also called Backend of the compiler, here the desired target
program constructed from the Intermediate representation.. It includes last 2
phases of compiler

Phases of Compiler are,


1. Lexical Analysis
2.Syntax Analysis
3.Semantic Analysis
4.Intermediate Code Generation
5.Code Optimization
6.Code Generation

Lexical Analysis
Lexical Analysis is the first phase of the compiler. Another name for lexical analysis
is Text Scanner. Lexical analysis scans the source code as a stream of characters
and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
in the form of tokens. The syntax for the token is <token name, attribute value>.

18
Syntax Analysis:-
The second phase of compilation is called Syntax analysis or parsing. In this phase
syntax analysis reads the tokens as input and produce the parse tree. In syntax
analysis phase, the parser checks that the expression made by the tokens is
syntactically correct or not. The parse tree is also called the derivation tree. Parse
trees are generally constructed to check for ambiguity in the given grammar. There
are certain rules associated with the derivation tree.

Semantic Analysis:
In this phase, Semantic analysis reads the parse tree as the input and produce the
same parse tree with semantic consistency. The semantic analyzer uses the syntax
tree and the information in the symbol table to check the source program for
semantic consistency with the language definition. It also gathers type information
and saves it in either the syntax tree or the symbol table.

19
An important part of semantic analysis is type checking, where the compiler checks
that each operator has matching operands. For example, many programming
language definitions require an array index to be an integer; the compiler must
report an error if floating-point number is used to index an array.

Intermediate Code Generations:-


In the process of translating a source program into target code, a compiler may
construct one or more intermediate representations, which can have a variety of
forms. Syntax trees are a form of intermediate representation, they are commonly
used during syntax and semantic analysis This intermediate representation should
have two important properties:

• It should be easy to produce


• It should be easy to translate into the target machine.

Code Optimization :-
The machine-independent code-optimization phase attempts to improve the
intermediate code so that better target code will result. Usually better means faster,
but other objectives may be desired, such as shorter code, or target code that
consumes less power. Optimization can be assumed as something that removes
unnecessary code lines, and arranges the sequence of statements in order to speed
up the program execution without wasting resources.

Code Generation :
Code generation is the final stage of the compilation process. The code generator
takes as input an intermediate representation of the source program and maps it
into the target language. If the target language is machine code, registers or
memory locations are selected for each of the variables used by the program. Then,
the intermediate instructions are translated into sequences of machine instructions
that perform the same task.

20
Symbol table:
A Symbol table is a data structure maintained by the compiler to record the identifier
used in the source program and to collect information about various attributes of
each identifier.

These attributes may provide information about


• The storage allocated for a name,
• Its type,
• Its scope (where in the program its value may be used), and in the case of
procedure names, such things as the number and types of its arguments, the

method of passing each argument (for example, by value or by reference), and


• The type returned.
The Code Generator enters and uses detailed information stored in the symbol
table about the storage assigned to identifier.

Error Handlers:-
Error Handler connected with all the six phases of the compiler. Each phases of the
compiler can encounter errors. Each error is handled by error handling mechanism
and to handle the errors at the respective phase.
The lexical analysis phase detect for the token stream which violates the structure
rules or syntax rules of the language.
The semantic analysis phase detect constructs that have the right syntactic
structure but no meaning to the operation involved.

Problems:
X=Y+10
X=(a+b)*C

21
22
Compiler Construction tools:

The compiler writer, like any software developer, can profitably use modern
Software development environments containing tools such as language editors,
debuggers, version managers, profilers, test harnesses, and so on. In addition to
these general software-development tools, other more specialized tools have been
created to help implement various phases of a compiler.

These tools use specialized languages for specifying and implementing specific
components, and many use quite sophisticated algorithms. The most successful
tools are those that hide the details of the generation algorithm and produce
components that can be easily integrated into the remainder of the compiler.

Some commonly used compiler-construction tools include

1. Parser generators that automatically produce syntax analyzers from a


grammatical description of a programming language.

2. Scanner generators that produce lexical analyzers from a regular-expression


description of the tokens of a language.

3. Syntax-directed translation engines that produce collections of routines for


walking a parse tree and generating intermediate code.

4. Code-generator generators that produce a code generator from a collection of


rules for translating each operation of the intermediate language into the machine
language for a target machine.

5. Data-flow analysis engines that facilitate the gathering of information about


how values are transmitted from one part of a program to each other part. Data-ow
analysis is a key part of code optimization.

6. Compiler-construction toolkits that provide an integrated set of routines for


constructing various phases of a compiler.

23
LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of
tokens. A program or function which performs lexical analysis is called a lexical analyzer or
scanner. A lexer often exists as a single function which is called by a parser or another
function.
THE ROLE OF THE LEXICAL ANALYZER
The lexical analyzer is the first phase of a compiler.
Its main task is used to read the input characters and produces as output a sequence of
tokens that the parser uses for syntax analysis.

source token
program
Lexical
Analyzer Parser
getnexttoken

Symbol table

Upon receiving a “get next token” command from the parser, the lexical analyzer reads
input characters until it can identify the next token.
ISSUES OF LEXICAL ANALYZER:
There are three issues in lexical analysis:
• To make the design simpler.
• To improve the efficiency of the compiler.
• To enhance the computer portability.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream of
characters is called tokenization.

24
A token can look like anything that is useful for processing an input text stream or
text file. Consider this expression in the C programming language: sum=3+2;
Lexeme Token type

sum Identifier

= Assignment operator

3 Number

+ Addition operator

2 Number

; End of statement

LEXEME:
Collection or group of characters forming tokens is called Lexeme.
PATTERN:
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.

Attributes for Tokens


Some tokens have attributes that can be passed back to the parser. The lexical
analyzer collects information about tokens into their associated attributes. The
attributes influence the translation of tokens.

a. Constant : value of the constant


b. Identifiers: Pointer to the corresponding symbol table entry.

25
ERROR RECOVERY STRATERGIES IN LEXICAL ANALYSIS:
The following are the error-recovery actions in lexical analysis:
1) Deleting an extraneous character.
2) Inserting a missing character.
3) Replacing an incorrect character by a correct character.
4) Transforming two adjacent characters.
5) Panic mode Recovery: Deletion of successive characters from the token until error
is resolved.
INPUT BUFFERING TECHNIQUES:

The lexical analyzer scans the input from left to right one character at a time. It uses
two pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input
scanned.

Initially both the pointers point to the first character of the input string as shown
below

26
The forward ptr moves ahead to search for end of lexeme. As soon as the blank
space is encountered, it indicates end of lexeme. In above example as soon as ptr (fp)
encounters a blank space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it
ignore and moves ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next
token. The input character is thus read from secondary storage, but reading in this way from
secondary storage is costly, hence buffering technique is used.
A block of data is first read into a buffer, and then second by lexical analyzer.
there are two methods used in this context: One Buffer Scheme, and Two Buffer Scheme.
These are explained as following below.

One Buffer Scheme: In this scheme, only one buffer is used to store the input string but
the problem with this scheme is that if lexeme is very long then it crosses the buffer
boundary, to scan rest of the lexeme the buffer has to be refilled, that makes
overwriting the first of lexeme

27
Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string. the first buffer and second buffer are scanned
alternately.
when end of current buffer is reached the other buffer is filled. the only problem with this
method is that if length of the lexeme is longer than length of the buffer then scanning input
cannot be scanned completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp
moves towards right in search of end of lexeme. as soon as blank character is recognized,
the string between bp and fp is identified as corresponding token. to identify, the boundary
of first buffer end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at the
end of second buffer. when fp encounters first eof, then one can recognize end of first buffer
and hence filling up second buffer is started in the same way when second eof is obtained
then it indicates of second buffer alternatively both the buffers can be filled up until end of
the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to identify the end
of buffer

28
Code to advance forward pointer:
if forward at end of first half then
begin reload second half;
forward := forward + 1 end

else if forward at end of second half


then
begin reload second half;
move forward to beginning of first half end
else
forward := forward + 1;

SENTINELS
For each character read, we make two tests: one for the end of the buffer, and one to determine
what character is read. We can combine the buffer-end test with the test for the current character
if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character
that cannot be part of the source program, and a natural choice is the character eof.

Note that eof retains its use as a marker for the end of the entire input. Any eof that
appears other than at the end of a buffer means that the input is at an end.

Code to advance forward pointer:

forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin reload second half;
forward := forward + 1
end
else if forward at end of second half then begin reload first
half;
move forward to beginning of first half end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end

29
SPECIFICATION OF TOKENS:
There are 3 specifications of Tokens:
1) Strings
2) Language
3) Regular expression

Strings and Languages:


An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that
alphabet. A language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms
for "string." The length of a string s, usually written |s|, is the number of
occurrences of symbols in s. For example, banana is a string of length six. The
empty string, denoted ε, is the string of length zero.

Operations on strings
The following string-related terms are commonly used:
1.A prefix of string s is any string obtained by removing zero or more symbols
from the end of string s.
For example, ban is a prefix of banana.
2.A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana is a suffix of banana.
3.A substring of s is obtained by deleting any prefix and any suffix from s. For
example, nan is a substring of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those prefixes,
suffixes, and substrings, respectively of s that are not ε or not equal to s itself.
5.A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s. For example, baan is a subsequence of banana.

30
Operations on languages:
The following are the operations that can be applied to languages:
1.Union
2.Concatenation
3.Kleene closure
4.Positive closure.

The following example shoes the operations om strings:


Let L={0,1} and S={a,b,c}
1. Union: L U S ={0,1,a,b,c}
2. Concatenation: L.S = {0a,0b,0c,1a,1b,1c}
3. Kleene closure: L* = {ϵ,0,1,00,.....}
4. Positive closure; L+ = {0,1,00,….}

Regular Expressions:
Each regular expression r denotes a language L(r).
Here are the rules that define the regular expressions over some alphabet Σ and
the languages that those expressions denote:
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole
member is the empty string.
2.If „a‟ is a symbol in Σ, then „a‟ is a regular expression, and L(a) = {a}, that is,
the language with one string, of length one, with „a‟ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then, a) (r)|(s) is a regular expression denoting the language L(r) U L(s).

b) (r)(s) is a regular expression denoting the language L(r)L(s).


c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence and is left associative.
5. Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.

31
Regular set
A language that can be defined by a regular expression is called a regular set.
If two regular expressions r and s denote the same regular set, we say they are
equivalent and Write r = s.

There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

Regular Definitions
Giving names to regular expressions is referred to as a Regular definition.If ∑ is an
alphabet of basic symbols, then a regular definition is a sequence of definitions of
the form
d1 → r 1
d2 → r2

……… dn → rn

1. Each di is a distinct name.


2. Each ri is a regular expression over the alphabet ∑ U {d1,d2,.....di-1}.
Example: Identifies is the set of strings of letters and digits beginning with a letter.
Regular definition for this set:
letters → A | B |…….|Z | a | b | ……|z
digits → 0 | 1 |……. | 9
id → letters ( letters | digits)*

Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce notational shorthands for them.

32
1. One or more instances (+):
• The unary postfix operator + means “ one or more instances of” .
• If r is a regular expression that denotes the language L(r), then ( r )+ is a
regular expression that denotes the language (L (r ))+

• Thus the regular expression a+ denotes the set of all strings of one or more a‟s.
• The operator + has the same precedence and associativity as the operator *

2. Zero or one instance ( ?):


• The unary postfix operator ? means “zero or one instance of”.
• The notation r? is a shorthand for r | ε.
• If „r‟ is a regular expression, then ( r )? is a regular expression that denotes the
language L( r ) U { ε }.

3. Character Classes:
• The notation [abc] where a, b and c are alphabet symbols denotes the regular
expression a | b | c.

• Character class such as [a – z] denotes the regular expression a | b | c | d |


….|z.
• We can describe identifiers as being strings generated by the regular expression,
[A–Z a–z] [A–Z a–z 0-9]*

Non-regular Set:
A language which cannot be described by any regular expression is a non-regular
set. Example: The set of all strings of balanced parentheses and repeating strings
cannot be described by a regular expression. This set can be specified by a context-
free grammar.

33
RECOGNITION OF TOKENS:

Consider the following grammar fragment:


stmt → if expr then stmt | if expr then stmt else stmt | ε
expr → term relop term | term

term → id | num
where the terminals if , then, else, relop, id and num generate sets of strings given
by the following regular definitions:
if the n → if the n els
els → e
e →
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?

For this language fragment the lexical analyzer will recognize the keywords if, then,
else, as well as the lexemes denoted by relop, id, and num. To simplify matters, we
assume keywords are reserved; that is, they cannot be used as identifiers.

Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a
lexical analyzer is called by the parser to get the next token. It is used to keep
track of information about the characters that are seen as the forward pointer
scans the input.

34
FINITE AUTOMATA
Finite Automata is one of the mathematical models that consist of a number of
states and edges. It is a transition diagram that recognizes a regular expression or
grammar.

Types of Finite Automata


There are two types of Finite Automata:
Non-deterministic Finite Automata (NFA)
Deterministic Finite Automata (DFA)
Non-deterministic Finite Automata
NFA is a mathematical model that consists of five tuples denoted by M = {Qn, Ʃ, δ,
q0, fn}
Qn – finite set of states
Ʃ – finite set of input symbols
δ – transition function that maps state-symbol pairs to set of states
q0 – starting state
fn – final state
35
Deterministic Finite Automata (DFA):
DFA is a special case of a NFA in which
(i) no state has an ε-transition.
(ii) There is at most one transition from each state on any input. DFA has five tuples
denoted by
M = {Qd, Ʃ, δ, q0,fd}
Qd – finite set of states ,
Ʃ– finite set of input symbols
Δ – transition function that maps state-symbol pairs to set of states.
q0 - starting state
fd – final state
Converting a Regular Expression into a Non-Deterministic Finite Automaton
(Thompson’s Algorithm)

There are only 5 rules, one for each type of RE:

36
Let X = (a, b)
 The regular expression a|b denote the language {a, b}.

 (a|b)(a|b) represent {aa, ab, ba, bb} the language of all strings having
length two over the alphabet X. One more regular expressions that can
accept the same language is aa|ab|ba|bb.
 a* represents the group of all strings that have zero or more a's, i.e. (e, aa,

aaa, aaaa, ......).

 (a|b)* represent the group of all strings having zero or more times of a or

b, i.e., all string that contains a's and b's: {e, a, b, aa, ab, ba, bb, aaa,}.
 One more regular expression that accepts the same language is
(a*b*)*.

 a|a*b denotes the language {a, b, ab, aab, aaan ...}, i.e., the string a, and

all strings have zero or more a's and ending with b.

FINITE AUTOMATA

o Finite automata are used to recognize patterns.


o It takes the string of symbol as input and changes its state accordingly.
When the desired symbol is found, then the transition occurs.

o At the time of transition, the automata can either move to the next state or
stay in the same state.

o Finite automata have two states, Accept state or Reject state. When the

input string is processed successfully, and the automata reached its final
state, then it will accept.

Formal Definition of FA

A finite automaton is a collection of 5-tuple (Q, ∑, δ, q0, F), where:

1. Q: finite set of states


2. ∑: finite set of the input symbol
3. q0: initial state
4. F: final state
5. δ: QX ∑ ->Q: Transition function

2
2
The algorithm constructs NFAs with only one final state. For example, the third rule
indicates that, to construct the NFA for the RE AB, we construct the NFAs for A and
B which are represented as two boxes with one start and one final state for each
box. Then the NFA for AB is constructed by connecting the final state of A to the
start state of B using an empty transition.

For example, the RE (a|b)c is mapped to the following NFA:

Non Deterministic Finite Automaton into a Deterministic Finite


Automaton: (Subset Construction Algorithm)

The next step is to convert a NFA to a DFA (called subset construction). Suppose
that you assign a number to each NFA state. The DFA states generated by subset
construction have sets of numbers, instead of just one number. For example, a DFA
state may have been assigned the set {5,6,8}. This indicates that arriving to the
state labeled {5,6,8} in the DFA is the same as arriving to the state 5, the state 6,
or the state 8 in the NFA when parsing the same input. (Recall that a particular
input sequence when parsed by a DFA, leads to a unique state, while when parsed
by a NFA it may lead to multiple states.)
First we need to handle transitions that lead to other states for free (without
consuming any input). These are the transitions. We define the closure of a NFA
node as the set of all the nodes reachable by this node using zero, one, or more
transitions. For example, The closure of node 1 in the left figure below

38
is the set {1,2}. The start state of the constructed DFA is labeled by the closure of the NFA start
state. For every DFA state labeled by some set and for every character c in the language alphabet,
you find all the states reachable by s1, s2, ..., or sn using c arrows and you union together the
closures of these nodes. If this set is not the label of any other node in the DFA constructed so far,
you create a new DFA node with this label. For example, node {1,2} in the DFA above has an arrow
to a {3,4,5} for the character a since the NFA node 3 can be reached by 1 on a and nodes 4 and 5
can be reached by 2. The b arrow for node {1,2} goes to the error node which is associated with an
empty set of NFA nodes. The following NFA recognizes , even though it wasn't constructed

Converting NFAs to DFAs


To convert an NFA to a DFA, we must and a way to remove all "-transitions and to ensure that there
is one transition per symbol in each state. We do this by constructing a DFA in which each state

corresponds to a set of some states from the NFA. In the DFA, transitions from a state S by some

symbol go to the state S that consists of all the possible NFA-states that could be reached by from

some NFA state q contained in the present DFA state S. The resulting DFA \simulates" the given NFA

in the sense that a single DFA-transition represents many simultaneous NFA-transitions. The first

concept we need is the "E-closure pronounced \epsilon closure". The " -closure of an NFA state q is

the set containing q along with all states in the automaton that are reachable by any number of " E-

transitions from q . In the following automaton, the " E-closures are given in the table to the right:

Likewise, we can done the "-closure of a set of states to be the states reachable by " - transitions
from its members. In other words, this is the union of the " -closures of its elements. To convert our

NFA to its DFA counterpart, we begin by taking the " –closure of the start state q of our NFA and

constructing a new start state S.in our DFA corresponding to that " -closure. Next, for each symbol in

our alphabet, we record the set of NFA states that we can reach from S 0on that symbol. For each

such set, we make a DFA state corresponding to its "E-closure, taking care to do this only once for

each set. 36
Construction of Optimized DFA from Regular Expression

Construction of DFA directly from a Regular Expression (RE) by computing the functions
nullable(n), firstpos(n), lastpos(n) and followpos(i)from the syntax tree.

• nullable(n): Is true for * node and node labeled with Ɛ. For other nodes it is false.

• firstpos(n): Set of positions that correspond to the first symbol of strings in n's
subtree.

• lastpos(n): set of positions that correspond to the last symbol of strings in n's
subtree.

• followpos (n): Set of next possible positions from n for valid strings.
Rules for computing nullable,firstpos and lastpos(n)

Node n nullable(n) firstpos(n) lastpos(n)

A leaf
true ∅ ∅
labeled by

A leaf with
false {i} {i}
position i

nullable(c1) or firstpos(c1) U lastpos(c1)


n = c1 | c2
nullable(c2) firstpos(c2) U lastpos(c2)

if ( nullable(c1) )
firstpos(c1) U if ( nullable(c2) )
nullable(c1) and firstpos(c2) else lastpos(c1) U
n = c1.c2 lastpos(c2) else
nullable(c2) firstpos(c1)
lastpos(c2)

n = c1* true firstpos(c1) lastpos(c1)

Computation of followpos
The position of regular expression can follow another in the following ways:
• If n is a cat node with left child c1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2)are in followpos(i).
For cat node(.), for each position i in lastpos of its left child, the firstpos of its right
child will be in followpos(i).
• If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n)are
in followpos(i).
• For star node(*), the firstpos of that node is in followpos of all positions in lastpos of
that node.

40
Building Syntax tree
Example

Step 1 (a|b)*abb - > (a|b)*abb #

Node followpos

1 {1, 2, 3}

2 {1, 2, 3}

3 {4}

4 {5}

5 {6}

6 -

41
Step 4

A=firstpos(n0)={1,2,3}
Move[A,a]={1,2}
=followpos(1) U followpos(3)
= {1,2,3,4}=B
Move[A,b]={2}
followpos(2)={1,2,3}=A Move[B,a]={1,3}
=followpos(1) U followpos(3)=B
Move[B,b]={2,4}
=followpos(2) U followpos(4)
={1,2,3,5}=C
Move[C,a]={1,3}=B
Mov[C,b]={2,5}
= Followpos(2)Ufollowpos(5)={1,2,3,6}---D
Mov(D,a)={1,3}=B
Mov(D,b)={2,4}=C

42
Finite Automata Model:

Finite automata can be represented by input tape and finite control.

Input tape: It is a linear tape having some number of cells. Each input symbol is
placed in each cell.

Finite control: The finite control decides the next state on receiving particular
input from input tape. The tape reader reads the cells one by one from left to right,
and at a time only one input symbol is read.

Concepts of Automata Theory:

Alphabet:
An alphabet is a finite, non-empty set of symbols. It is

denoted by Σ.Example:

i) Σ = {a,b}, an alphabet of 2
symbols a and b.
ii)Σ = {0,1}, the binary

alphabet.

2
3
Σ = {0, 1, 2}, an alphabet of 3 symbols 0, 1 and 2.

iii) Σ = {a,b,…,z}, the set of all lower case letters.


iv) The set of all ASCII characters or the set of all printable ASCII characters

String:

A string or sometimes word is a finite sequence of symbols


chosen from somealphabet.

Example:
i) 01101 is a string from the binary alphabet Σ = {0,1}
The string 111 is another string chosen from this alphabet

ii) If Σ = {a,b} then abab, aabba area all strings over the alphabet Σ = {a,b}.

iii) If Σ = {a} then a, aa, aaa, are all strings over the alphabet Σ = {a}.

Length of a String:
Let w be the string then “length of the string │w│ is the number of
symbols composing the string.

Example

i) If w=abcd then │w│ = 4.

ii) If x=0101010 then │x│= 7.


iii) │ ε │= 0.
Empty or Null String:
The empty string is the string with zero occurrences of symbols. or
the length ofa string is zero. It is denoted by ε.

│ ε │=0
Thus there are only two symbols, 0 and 1, in the string 01101, but

there are fivepositions for symbols and its length is 5.However you

2
4
should generally expect that “the number of symbols” can be used

when “number of positions” is meant.

Substring:
A string v appears within another string w(w=uv) is called substring of
w.

If w=uv, then substrings u and v are said to be prefix and suffix of w


respectively.

Example:
i) w=abbab be a string over an alphabet Σ = {a,b} then a,ab,abb,ba,bab,… are
all substrings of w.
ii) w=123
Prefixes = {ε, 1, 12, 123}
Suffixes = {ε, 3, 23, 123}
iii) w=abbab
Prefixes = {ε, a, ab, abb,
abba, abbab}

Suffixes = {ε, b, ab, bab,


bbab, abbab}

Reverse of a String:

i) The reverse of a string is obtained by writing the symbols in reverse


order.

ii) Let w is a string. Then its reverse is wR.

i.e) w=a1a2a3…am

wR = am… a3a2a1

iii) If u=100010 the uR=010001


Properties of string operations:

i) Concatenation is associative, that is for all strings u,v, and w (uv)


ii) w = u(vw)
iii) If u and v are strings then the length of their concatenation is
2
5
the sum of theindividual length, i.e) │uv│=│u│+│v│.

Example:

x=abc and y=123

xy = abc123
│xy│=6 i.e) │x│=3, │y│=3 hence │xy│=│x│+│y│

Languages:
For any alphabet Σ, any subset L of Σ* is a language. A set of strings from
analphabet is called a language.

Example:
i) English: It is a language over Σ= {a,b,c,…z}.

ii) Binary string: {0, 1, 01, 0101 …} is a language over Σ = {0,1}.


iii) If Σ={a,b} is a alphabet then Σ*={ε, a,b,aa,ab…} is a language.

Notations:

1. {λ} or {ε} : Empty string or Null string language. It is a language over


every
alphabet and it contains exactly

one string ε or λ.Example:


{λ} or {ε} has one string λ or ε.

2. Φ: Empty language.

Example:
It contains no strings.

3. Σ*: Universal language.


It contains all string over the alphabet Σ.

Power of an alphabet:

Let Σ be an alphabet, then


Σ* denotes the set of all string over the alphabet Σ.

2
6
Σm denotes the set of all string over the alphabet Σ of
length m.Example:

If Σ = {0, 1} then

Σ0 = {ε} empty string.

Σ1 = {0, 1} set of all string of length one over Σ = {0, 1}.

Σ2 = {00, 01, 10, 11} set of all string of length two over Σ = {0, 1}.

Σ3 = {000, 001, 010, 011,100,101,110,111} set of all string of length three


over
Σ = {0, 1}.
The set of all strings over an alphabet Σ is conventionally denoted Σ *.

For Instance, {0,1}* = {ε, 0,1, 00, 01, 10, 11, 000, 001, 010, 011,100,101, . . . }

Put another way

Σ* = Σ0 U Σ1 U Σ2 U Σ3 U . . .

Sometimes, we wish to exclude the empty string from the set of strings.
The set of nonempty strings from alphabet Σ is denoted Σ + thus two appropriate

equivalences are

Σ+ = Σ1

U Σ2 U

Σ3 U . . .

Σ* = Σ+
U {ε}

Kleen’s Closure:
Let Σ be an alphabet, then “kleen’s closure Σ*” denotes the set of all
strings

over the alphabet Σ (includes ε, empty string).

Example:
i) If Σ = {a} then Σ* = {ε,a,aa…}.

27
ii) If Σ = {0,1} then Σ*={ε,0,1,00,11,10…}

Positive Closure

• Set of all non empty strings over Σ


• Σ+ = Σ1 U Σ2 U Σ3 U Σ4 …. So on

Σ* = Σ+ U Σ0

Types of Automata

There are two types of finite automata:

1. DFA(deterministic finite automata)

2. NFA(non-deterministic finite automata)

1. Deterministic Finite Automata (DFA)

 DFA refers to deterministic finite automata.


 Deterministic refers to the uniqueness of the computation.
 The finite automata are called deterministic finite automata if the machine is
read an input string one symbol at a time.
 In DFA, there is only one path for specific input from the current state to the
next state.
 DFA does not accept the null move, i.e., the DFA cannot change state
without any input character.
 DFA can contain multiple final states. It is used in Lexical Analysis in
Compiler.

2
8
In the following diagram, we can see that from state q0 for input a, there is only
one path which is going to q1. Similarly, from q0, there is only one path for input b
going to q2.

Graphical Representation of DFA


A DFA can be represented by digraphs called state diagram. In which:
1. The state is represented by vertices.
2. The arc labeled with an input character show the transitions.
3. The initial state is marked with an arrow.
4. The final state is denoted by a double circle.
Example 1:
1. Q = {q0, q1, q2}
2. ∑ = {0, 1}
3. q0 = {q0}
4. F = {q2}
Solution:
Transition Diagram:

Transition Table:

Presen Next state Next State


t for Input 0 of Input 1
State

→q0 q0 q1

q1 q2 q1
2
9
*q2 q2 q2

Example 2:
DFA with ∑ = {0, 1} accepts all starting with 0.
Solution:

Explanation:
o In the above diagram, we can see that on given 0 as input to DFA in state
q0 the DFA changes state to q1 and always go to final state q1 on starting
input 0. It can accept 00, 01, 000, 001....etc. It can't accept any string which
starts with 1, because it will never go to final state on a string starting with

1.

Example 3:
DFA with ∑ = {0, 1} accepts all ending with 0.
Solution:

Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0, the
DFA changes state to q1. It can accept any string which ends with 0 like 00, 10,
110, 100....etc. It can't accept any string which ends with 1, because it will never
go to the final state q1 on 1 input, so the string ending with 1, will not be accepted
or will be rejected.

Example 4:
Design a FA with ∑ = {0, 1} accepts those string which starts with 1 and ends with
0.

Solution:
3
0
The FA will have a start state q0 from which only the edge with input 1 will go to
the next state.

In state q1, if we read 1, we will be in state q1, but if we read 0 at state q1, we will
reach to state q2 which is the final state. In state q2, if we read either 0 or 1, we
will go to q2 state or q1 state respectively. Note that if the input ends with 0, it will
be in the final state.

Example 5:

Design a FA with ∑ = {0, 1} accepts the only input 101.

Solution:

In the given solution, we can see that only input 101 will be accepted. Hence, for
input 101, there is no other path shown for other input.

Example 6:

Design FA with ∑ = {0, 1} accepts the set of all strings with three consecutive 0's.

Solution:

The strings that will be generated for this particular languages are 000, 0001, 1000,
10001, .... in which 0 always appears in a clump of 3. The transition graph is as
follows:

Example 7:

3
1
Design a DFA L(M) = {w | w ε {0, 1}*} and W is a string that does not contain
consecutive 1's.

Solution:

When three consecutive 1's occur the DFA will be:

Here two consecutive 1's or single 1 is acceptable, hence

The stages q0, q1, q2 are the final states. The DFA will generate the strings that do
not contain consecutive 1's like 10, 110, 101,..... etc.

Example 8:

Design a FA with ∑ = {0, 1} accepts the strings with an even number of 0's
followed by single 1.

Solution:

The DFA can be shown by a transition diagram as:

2. NFA (Non-Deterministic finite automata)


 NFA stands for non-deterministic finite automata.
 It is used to transmit any number of states for a particular input.
 It can accept the null move.
 It is easy to construct an NFA than DFA for a given regular language.

3
2
 The finite automata are called NFA when there exist many paths for specific
input from the current state to the next state.

 Every NFA is not DFA, but each NFA can be translated into DFA.
 NFA is defined in the same way as DFA but with the following two
exceptions, it contains multiple next states, and it contains ε transition.

In the following image, from state q0 for input a, there are two next states q1 and
q2, similarly, from q0 for input b, the next states are q0 and q1. Thus it is not fixed
or determined that with a particular input where to go next. Hence this FA is called
non-deterministic finite automata.

Formal definition of NFA:

NFA also has five states same as DFA, but with different transition function, as
shown follows:

δ: Q x ∑ →2Q

where,

1. Q: finite set of states


2. ∑: finite set of the input symbol
3. q0: initial state
4. F: final state
5. δ: Transition function

Graphical Representation of an NFA

An NFA can be represented by digraphs called state diagram. In which:


1. The state is represented by vertices.
2. The arc labeled with an input character show the transitions.
3. The initial state is marked with an arrow.

3
3
4. The final state is denoted by the double circle.
Example 1:
1. Q = {q0, q1, q2}
2. ∑ = {0, 1}
3. q0 = {q0}
4. F = {q2}
Solution:
Transition diagram:

Transition Table:

Presen Next Next


t state for State of

State Input 0 Input 1

→q0 q0, q1 q1

q1 q2 q0

*q2 q2 q1, q2
In the above diagram, we can see that when the current state is q0, on input 0, the
next state will be q0 or q1, and on 1 input the next state will be q1. When the
current state is q1, on input 0 the next state will be q2 and on 1 input, the next
state will be q0. When the current state is q2, on 0 input the next state is q2, and
on 1 input the next state will be q1 or q2.

Example 2:
NFA with ∑ = {0, 1} accepts all strings with 01.
Solution:

3
4
Transition Table:

Presen Next state Next


t for State of
State Input 0 Input 1

→q0 q1 ε

q1 ε q2

*q2 q2 q2
Example 3:
NFA with ∑ = {0, 1} and accept all string of length atleast 2.
Solution:

Transition Table:

Present State Next state Next State


for of Input
Input 0 1

→q0 q1 q1

q1 q2 q2

*q2 ε ε
Example 4:
Design a NFA for the transition table as given below:

Presen 0 1
t
State
3
5
→q0 q0, q1 q0, q2

q1 q3 ε

q2 q2, q3 q3

→q3 q3 q3

Solution:
The transition diagram can be drawn by using the mapping function as given in the
table.

Here,
1. δ(q0, 0) = {q0, q1}
2. δ(q0, 1) = {q0, q2}
3. Then, δ(q1, 0) = {q3}
4. Then, δ(q2, 0) = {q2, q3}
5. δ(q2, 1) = {q3}
6. Then, δ(q3, 0) = {q3}
7. δ(q3, 1) = {q3}
Example 5:
Design an NFA with ∑ = {0, 1} accepts all string ending with 01.
Solution:

3
6
Hence, NFA would be:

Example 6:
Design an NFA with ∑ = {0, 1} in which double '1' is followed by double '0'.
Solution:
The FA with double 1 is as follows:

It should be immediately followed by double 0.


Then,

Now before double 1, there can be any string of 0 and 1. Similarly, after double 0,
there can be any string of 0 and 1.

Hence the NFA becomes:

Now considering the string 01100011


1. q0 → q1 → q2 → q3 → q4 → q4 → q4 → q4
3
7
Example 7:
Design an NFA in which all the string contain a substring 1110.
Solution:
The language consists of all the string containing substring 1010. The partial
transition diagram can be:

Now as 1010 could be the substring. Hence we will add the inputs 0's and 1's so
that the substring 1010 of the language can be maintained. Hence the NFA
becomes:

Transition table for the above transition diagram can be given below:

Present 0 1
State

→q1 q1 q1,
q2

q2 q3

q3 q4

q4 q5

*q5 q5 q5
Consider a string 111010,
1. δ(q1, 111010) = δ(q1, 1100)
2. = δ(q1, 100)
3. = δ(q2, 00)

3
8
Got stuck! As there is no path from q2 for input symbol 0. We can process string
111010 in another way.

1. δ(q1, 111010) = δ(q2, 1100)


2. = δ(q3, 100)
3. = δ(q4, 00)
4. = δ(q5, 0)
5. = δ(q5, ε)
As state q5 is the accept state. We get the complete scanned, and we reached to
the final state.

Example 8:
Design an NFA with ∑ = {0, 1} accepts all string in which the third symbol from the
right end is always 0.

Solution:

Thus we get the third symbol from the right end as '0' always. The NFA can be:

The above image is an NFA because in state q0 with input 0, we can either go to
state q0 or q1.

Finite Automata with Epsilon-Transition:


The ε-transitions in NFA are given in order to move from one state to another
without having any symbol from input set Σ.
Definition (or) Formal notation for an ε-NFA:
The language accepted by NFA with ε denoted by M = (Q, Σ, δ, q0, F) can be
defined as follows:
M= (Q, Σ, δ, q0, F) be a NFA with ε where, Q is a finite set of states
Σ is input symbols

δ is a transition function , Q X { Σ 𝖴 ε } to 2Q q0 is a start state.


3
9
F is a set of final states such that F ⊆ Q.
The string w is accepted by NFA can be represented as
L(M) = {w | w ⊆ Σ* and δ transition for w from q0 reaches to F}
Epsilon Closures: (E-closure())
Epsilon closure for a given state X is a set of states which can be reached from the
states X with only (null) or ε moves including the state X itself.

∈-closure(1) : {1,2,3,5,8}
∈-closure(2) : { 2,3,5}
∈-closure(3) : { 3}
∈-closure(4) : { 4,7,8,2,3,5}
∈-closure(5) : { 5}
∈-closure(6) : { 6,7,8,2,3,5}
∈-closure(7) : { 7,8,2,3,5}
∈-closure(8) : { 8}

40
Steps to constructing NFA

Some important points about DFA and NFA:


1. Every DFA is NFA, but NFA is not DFA.

2. There can be multiple final states in both NFA and DFA.

3. DFA is used in Lexical Analysis in Compiler.

4. NFA is more of a theoretical concept.

REGULAR EXPRESSIONS

o The language accepted by finite automata can be easily described by simple


expressions called Regular Expressions. It is the most effective way to
represent any language.

o The languages accepted by some regular expression are referred to as


Regular languages.

o A regular expression can also be described as a sequence of pattern that


defines a string.

4
1
o Regular expressions are used to match character combinations in strings.
String searching algorithm used this pattern to find the operations on a
string.

For instance:

In a regular expression, x* means zero or more occurrence of x. It can generate


{e, x, xx, xxx, xxxx, .....}

In a regular expression, x+ means one or more occurrence of x. It can generate {x,


xx, xxx, xxxx, .....}

Operations on Regular Language

The various operations on regular language are:

Union: If L and M are two regular languages then their union L U M is also a
union.

L U M = {s | s is in L or s is in M}

Intersection: If L and M are two regular languages then their intersection is also
an intersection.

L ⋂ M = {st | s is in L and t is in M}

Kleen closure: If L is a regular language then its Kleen closure L1* will also be a
regular language.

L* = Zero or more occurrence of language L.

Example 1:

Write the regular expression for the language accepting all combinations of a's,
over the set

∑ = {a}

4
2
Solution:

All combinations of a's means a may be zero, single, double and so on. If a is
appearing zero times, that means a null string. That is we expect the set of {ε, a,
aa, aaa, ....}. So we give a regular expression for this as:

1. R = a*

That is Kleen closure of a.

Example 2:

Write the regular expression for the language accepting all combinations of a's
except the null string, over the set ∑ = {a}

Solution:

The regular expression has to be built for the language

1. L = {a, aa, aaa, ....}

This set indicates that there is no null string. So we can denote regular expression
as:

R = a+

RELATING REGULAR EXPRESSIONS AND FINITE AUTOMATA

In the below article, we shall see some Designing of Finite Automata form the
given Regular Expression-

43
Regular Expression 1: Regular language,
L1 = a(a+b)*

The language of the given RE is,

{aaa, aba, baa, bba}

Strings always start with ‘a’. Its finite automata will be like below-

In the above transition diagram, as we can see that initial state ‘Y’ on getting ‘a’
as the input it transits to a final state ‘Z’ and so on for the remaining states. Thus
this FA accepting all the strings of the given RE language.

Regular Expression 2: Regular language,


L2 = (a+b)*a

The language of the given RE is,

{aaa, aba, baa, bba}

4
4
Strings always end with ‘a’. Its finite automata will be like below-

In the above transition diagram, as we can see that initial state ‘Y’ on getting ‘a’
as the input it transits to a final state ‘Z’ and on getting ‘b’ as the input it remains
in the state of itself and so on for the remaining states. Thus this FA accepting all
the strings of the given RE language.

Regular Expression 3: Regular language,


L3 = (a+b)*a(a+b)*

The language of the given RE is,

{aaa, aba, baa, bba}

Strings containing ‘a’ as the alphabet. Its finite automata will be like below-

In the above transition diagram, as we can see that initial state ‘Y’ on getting ‘b’
as the input it remains in the state of itself and on getting ‘a’ as the input it
transits to a final state ‘Z’ and so on for the remaining states. Thus this FA
accepting all the strings of the given RE language.

Regular Expression 4: Regular language,


L4 = (a(a+b)*b)+(b(a+b)*a)

The language of the given RE is,

4
5
{aab, abb, baa, bba}
Strings start and symbols.
end with different
Its finite automata will be like below-

In the above transition diagram, as we can see that initial state ‘V’ on getting ‘a’
as the input it transits to a state ‘W’ and so on for the remaining states. Thus this

FA accepting all the strings of the given RE language.

Regular Expression 5: Regular language,


L5 = (a(a+b)*a)+(b(a+b)*b)+a+b+ε

The language of the given RE is,

{&epsilon, a, b, aba, bab, bbab}

Strings start and end with same symbols. Its finite automata will be like below-

In the above transition diagram, as we can see that initial and final state ‘V’ on
getting ‘a’ as the input it transits to another final state ‘W’ and so on for the
remaining states. Thus this FA accepting all the strings of the given RE language.

4
6
Conversion from NFA to DFA
Example 1
Consider a Non-deterministic finite automata (NFA) and convert that NFA into
equivalent Deterministic Finite Automata (DFA).

Solution

Let’s construct NFA transition table for the given diagram −

States\inputs a b

->q0 {q0,q1} q0

q1 q2 q1

q2 q3 q3

q3 - q2
DFA cannot have multiple states. So, consider {q0,q1} as a single state while
constructing DFA.
Let’s convert the above table into equivalent DFA

States\inputs a b

->q1 [q0,q1] q0

[q0,q1] [q0q1q2] [q0q1]

*[q0q1q2] [q0q1q2q3] [q0q1q3]

*[q0q1q2q3] [q0q1q2q3] [q0q1q2q3]

*[q0q1q3] [q0q1q2] [q0q1q2]


In DFA the final states are q2 and q3, wherever q2 and q3 are present that state
becomes a final state.

Now the transition diagram for DFA is as follows −

4
7
Convert the given NFA to DFA.

Solution: For the given transition diagram we will first construct the transition
table.

Example 2

Convert the given NFA to DFA.

4
8
State 0 1

→q0 {q0, q1} {q1}

*q1 ϕ {q0, q1}

Solution: For the given transition diagram we will first construct the transition
table.

Now we will obtain δ' transition for state q0.

δ'([q0], 0) = {q0, q1}


= [q0, q1] (new state generated)
δ'([q0], 1) = {q1} = [q1]

The δ' transition for state q1 is obtained as:

δ'([q1], 0) = ϕ
δ'([q1], 1) = [q0, q1]

Now we will obtain δ' transition on [q0, q1].

δ'([q0, q1], 0) = δ(q0, 0) 𝖴 δ(q1, 0)


i. = {q0, q1} 𝖴 ϕ
ii. = {q0, q1}
iii. = [q0, q1]

Similarly,

δ'([q0, q1], 1) = δ(q0, 1) 𝖴 δ(q1, 1)


i. = {q1} 𝖴 {q0, q1}

ii. = {q0, q1}


iii. = [q0, q1]

As in the given NFA, q1 is a final state, then in DFA wherever, q1 exists that state
becomes a final state. Hence in the DFA, final states are [q1] and [q0, q1].
Therefore set of final states F = {[q1], [q0, q1]}.

The transition table for the constructed DFA will be:

4
9
State 0 1

→[q0] [q0, q1] [q1]

*[q1] ϕ [q0, q1]

*[q0, q1] [q0, q1] [q0, q1]

The Transition diagram will be:

Even we can change the name of the states of DFA.

Suppose

1. A = [q0]
2. B = [q1]
3. C = [q0, q1]

With these new names the DFA will be as follows:

Example 3
Let us consider the NDFA shown in the figure below.

5
0
q δ(q,0) δ(q,1)

a {a,b,c,d,e} {d,e}

b {c} {e}

c ∅ {b}

d {e} ∅

e ∅ ∅

Using the above algorithm, we find its equivalent DFA. The state table of the DFA is
shown in below.

q δ(q,0) δ(q,1)

[a] [a,b,c,d,e] [d,e]

[a,b,c,d,e] [a,b,c,d,e] [b,d,e]

[d,e] [e] ∅

[b,d,e] [c,e] [e]

[e] ∅ ∅

[c, e] ∅ [b]

[b] [c] [e]

[c] ∅ [b]

The state diagram of the DFA is as follows −

5
1
SCANNER GENERATOR (LEX, FLEX)
Lex (lexical analyzer generator) is a tool or software which automatically
generates a lexical analyzer (finite Automata). It takes as its input a LEX source
program and produces lexical Analyzer as its output. Lexical Analyzer will convert
the input string entered by the user into tokens as its output.
LEX is a program generator designed for lexical processing of character
input/output stream. Anything from simple text search program that looks for
pattern in its input-output file to a C compiler that transforms a program into
optimized code.

In program with structure input-output two tasks occurs over and over. It
can divide the input-output into meaningful units and then discovering the
relationships among the units for C program (the units are variable names,
constants, and strings). This division into units (called tokens) is known as lexical
analyzer or LEXING. LEX helps by taking a set of descriptions of possible tokens n
producing a routine called a lexical analyzer or LEXER or Scanner.

5
2
LEX Source Program
It is a language used for specifying or representing Lexical Analyzer.
There are two parts of the LEX source program −

 Auxiliary Definitions

 Translation Rules

 Auxiliary Definition
It denotes the regular expression of the form.
Distinct Names [D1=R1\D2=R2\Dn=Rn][D1=R1\D2=R2\Dn=Rn] Regular
Expressions
Where

 Distinct Names (Di)→ Shortcut name of Regular Expression


 Regular Expression (Ri)→ Notation to represent a collection of input symbols.
Example
Auxiliary Definition for Identifiers −

5
3
Auxiliary Definition for Signed Numbers
integer=digit digit*
sign = + | -

signedinteger = sign integer


Auxiliary Definition for Decimal Numbers
decimal = signedinteger . integer | sign.integer
Auxiliary Definition for Exponential Numbers
Exponential – No = (decimal | signedinteger) E signedinteger
Auxiliary Definition for Real Numbers

Real-No. = decimal | Exponential – No


 Translation Rules
It is a set of rules or actions which tells Lexical Analyzer what it has to do or what it
has to return to parser on encountering the token.

It consists of statements of the form −


P1 {Action1}
P2 {Action2}

.
.
.
Pn {Actionn}
Where
Pi → Pattern or Regular Expression consisting of input alphabets and Auxiliary
definition names.
Actioni → It is a piece of code that gets executed whenever a token is Recognized.
Each Actioni specifies a set of statements to be executed whenever each regular
expression or pattern Pi matches with the input string.

Example
Translation Rules for "Keywords"

54
We can see that if Lexical Analyzer is given the input "begin", it will recognize the
token "begin" and Lexical Analyzer will return 1 as integer code to the parser.

Translation Rules for "Identifiers"


letter (letter + digit)* {Install ( );return 6}
If Lexical Analyzer is given the token which is an "identifier", then the Action taken
by the Lexical Analyzer is to install or store the name in the symbol table & return
value 6 as integer code to the parser.

Flex (Fast Lexical Analyzer Generator ) is a tool/computer program for generating


lexical analyzers (scanners or lexers) written by Vern Paxson in C around 1987. It
is used together with Berkeley Yacc parser generator or GNU Bison parser
generator. Flex and Bison both are more flexible than Lex and Yacc and produces
faster code.

Bison produces parser from the input file provided by the user. The
function yylex() is automatically generated by the flex when it is provided with a .l
file and this yylex() function is expected by parser to call to retrieve tokens from
current/this token stream.
Note: The function yylex() is the main flex function that runs the Rule Section
and extension (.l) is the extension used to save the programs.

Given image describes how the Flex is used:

5
5
Step 1: An input file describes the lexical analyzer to be generated named lex.l is
written in lex language. The lex compiler transforms lex.l to C program, in a file
that is always named lex.yy.c.
Step 2: The C compiler compile lex.yy.c file into an executable file called a.out.

Step 3: The output file a.out take a stream of input characters and produce a
stream of tokens.

Program Structure:
In the input file, there are 3 sections:
1. Definition Section: The definition section contains the declaration of
variables, regular definitions, manifest constants. In the definition section, text is
enclosed in “%{ %}” brackets. Anything written in this brackets is copied
directly to the file lex.yy.c

Syntax:
%{

// Definitions

%}

2. Rules Section: The rules section contains a series of rules in the


form: pattern action and pattern must be unintended and action begin on the
same line in {} brackets. The rule section is enclosed in “%% %%”.
Syntax:
%%
pattern action
%%
Examples: Table below shows some of the pattern matches.

Pattern It can match with

[0-9] all the digits between 0 and 9

[0+9] either 0, + or 9

[0, 9] either 0, ‘, ‘ or 9

5
6
[0 9] either 0, ‘ ‘ or 9

[-09] either -, 0 or 9

either – or all digit between


[-0-9] 0 and 9

one or more digit between 0


[0-9]+ and 9

[^a] all the other characters except a

all the other characters except


[^A-Z] the upper case letters

a{2, 4} either aa, aaa or aaaa

a{2, } two or more occurrences of a

a{4} exactly 4 a’s i.e, aaaa

. any character except newline

a* 0 or more occurrences of a

a+ 1 or more occurrences of a

[a-z] all lower case letters

[a-zA-Z] any alphabetic letter

w(x | y)z wxz or wyz


3. User Code Section: This section contains C statements and additional
functions. We can also compile these functions separately and load with the
lexical analyzer.
Basic Program Structure:
%{
// Definitions
%}
57
%%
Rules
%%

User code section


How to run the program:
To run the program, it should be first saved with the extension .l or .lex. Run the
below commands on terminal in order to run the program file.
Step 1: lex filename.l or lex filename.lex depending on the extension file is saved
with
Step 2: gcc lex.yy.c
Step 3: ./a.out
Step 4: Provide the input to program in case it is required
Note: Press Ctrl+D or use some rule to stop taking inputs from the user. Please
see the output images of below programs to clear if in doubt to run the programs.

Example 1: Count the number of characters in a string

/*** Definition Section has one variable


which can be accessed inside yylex()
and main() ***/

%{
int count = 0;
%}
/*** Rule Section has three rules, first rule
matches with capital letters, second rule
matches with any character except newline and
third rule does not take input after the enter***/

%%
[A-Z] {printf("%s capital letter\n", yytext);
count++;}

. {printf("%s not a capital letter\n", yytext);}


\n {return 0;}
%%
/*** Code Section prints the number of

5
8
capital letter present in the given input***/
int yywrap(){}

int main(){
// Explanation:
// yywrap() - wraps the above rule section
/* yyin - takes the file pointer
which contains the input*/
/* yylex() - this is the main flex function
which runs the Rule Section*/

// yytext is the text in the buffer


// Uncomment the lines below
// to take input from file
// FILE *fp;
// char filename[50];
// printf("Enter the filename: \n");
// scanf("%s",filename);
// fp = fopen(filename,"r");
// yyin = fp;
yylex();

printf("\nNumber of Capital letters "


"in the given input - %d\n", count);
return 0;

5
9
Example 2: Count the number of characters and number of lines in the
input

/* Declaring two counters one for number


of lines other for number of characters */

%{
int no_of_lines = 0;
int no_of_chars = 0;
%}
/***rule 1 counts the number of lines,
rule 2 counts the number of characters
and rule 3 specifies when to stop

taking input***/
%%
\n ++no_of_lines;
. ++no_of_chars;
end return 0;
%%
/*** User code section***/
int yywrap(){}

int main(int argc, char **argv)


{
yylex();
printf("number of lines = %d, number of chars = %d\n",
no_of_lines, no_of_chars );
return 0;
}

6
0
Output:

6
1
9. ASSIGNMENT : UNIT – I

Set Number Questions


(Category)

Set - 1 (i) Show how an output a:= b + c * 60 get processed in compiler. Show
that output at each stage of the compiler and also show the contents of
symbol table (CO1, K2)
(Toppers)
(ii) Write a LEX program to find out total number of vowels and
consonants from the given input string. (CO1, K3)

Set - 2 (i) Describe the various phases of compiler and trace the program
segment a: = b + c * 4 for all phases (CO1, K2)

(Above (ii) Write a LEX program for counting total number of words, lines and
Average) character in the given input file. (CO1, K3)

Set - 3
(i) What is Compiler? Explain the various phases of compiler in detail.
(CO1, K1)
(Average)
(ii) Define the terms: Lexeme, token and pattern (CO1, K1)

Set - 4
(i) Explain about the grouping of phases in detail. (CO1, K1)

(Below (ii) List the various Compiler construction tools and explain in detail.
Average) (CO1, K1)

Set - 5
(i) Explain about Input Buffering techniques. (CO1, K1)

(Slow (i) Discuss the role of Lexical analyzer in detail. (CO1, K1)
Learners)

82
10. PART A : Q & A : UNIT – I
SNo Questions and Answers CO K
1 What are the two parts of a compilation? Explain
briefly.
Analysis and Synthesis are the two parts of compilation.
(Front-end & Back-end)
● The analysis part breaks up the source program K1
into constituent pieces and creates an
intermediate representation of the source program.
● The synthesis part constructs the desired target
program from the intermediate representation.
2 List the various compiler construction tools.
● Parser generators
● Scanner Generator
● Syntax-directed translation engines K1
● Automatic code generators
● Dataflow Engines
● Compiler construction tool kits
3 Differentiate compiler and interpreter.
● The machine-language target program produced
by a compiler is usually much faster than an
interpreter at mapping inputs to outputs. K1
● An interpreter, however, can usually give better
CO1
error diagnostics than a compiler, because it executes
the source program statement by statement.
4 Define tokens, Patterns, and lexemes.
● Tokens- Sequence of characters that have a collective
meaning. A token of a language is a category of its
lexemes.
● Patterns- There is a set of strings in the input for
which the same token is produced as output. This K1
set of strings is described by a rule called a
pattern associated with the token.
● Lexeme- A sequence of characters in the source
program that is matched by the pattern for a token.
5 Describe the possible error recovery actions in lexical
analyzer.
·Panic mode recovery
·Deleting an extraneous character
·Inserting a missing character K1
·Replacing an incorrect character by a correct character
·Transposing two adjacent characters

83
10. PART A : Q & A : UNIT – I
SNo Questions and Answers CO K
6 Write the regular expression for identifier.
letter -> A | B | ……| Z | a | b | …..| z
K2
digit -> 0 | 1 |…….|9
id -> letter (letter | digit ) *
7 Write the regular expression for Number.
digit -> 0 | 1 |…….|9
digits -> digit digit*
optional_fraction->.digits | € K2
optional_exponent -> (E(+|-| €)digits) | €
num -> digits optional_fractionoptional_exponent
id -> letter (letter | digit ) *
8 List the phases that constitute the front end of a
compiler.
The front end consists of those phases or parts of
phases that depend primarily on the source language
and are largely independent of the target
machine.These include
● Lexical and Syntactic analysis K1
● The creation of symbol table
● Semantic analysis CO1
● Generation of intermediate code
A certain amount of code optimization can be done by
the front end as well and includes Error handling that
goes along with each of these phases.
9 Mention the back-end phases of a compiler.
The back end of compiler includes those portions
that depend on the target machine and generally
those portions do not depend on the source
language, just the intermediate language. These K1
include Code optimization
Code generation, along with error handling and symbol-
table operations.
10 What is the role of lexical analysis phase?
Lexical analyzer reads the source program one
character at a time, and grouped into a sequence
of atomic units called tokens. Identifiers, keywords,
constants, operators and punctuation symbols such as
commas, parenthesis, are typical tokens. K1
Identifiers are entered into the Symbol Table.
Removes comments, whitespaces (blanks, newline , tab)
Keeps track of the line numbers.
Identifies Lexical Errors and appropriate error messages.
84
10. PART A : Q & A : UNIT – I

SNo Questions and Answers CO K


11 What is Sentinel ?
For each character read, we make two tests in lexical
analysis phase: one for the end of the buffer, and one
to determine what character is read (the latter may be K1
a multiway branch). We can combine the buffer-end
test with the test for the current character if we extend
each buffer to hold a sentinel character at the end. The
sentinel is a special character that cannot be part of
the source program, and a natural choice is the
character eof.
12 Define Symbol Table.
A symbol table is a data structure containing a record for
each identifier, with fields for the attributes of the
identifier. The data structure allows us to find the record
for each identifier quickly and to store or retrieve data K1
from that record quickly. CO1
Whenever an identifier is detected by a lexical analyzer,
it is entered into the symbol table. The attributes of an
identifier cannot be determined by the lexical analyzer.
13 Define a preprocessor.
A source program may be divided into modules
stored in separate files. The task of collecting the
source program is sometimes entrusted to a separate K1
program, called a preprocessor.The preprocessor may
also expand shorthands, called macros, into source
language statements.
14 What are the functions of the pre-processors?
The task of collecting the source program is sometimes
entrusted to a separate program, called a preprocessor. K1
The preprocessor may also expand shorthands, called
macros, into source language statements.
15 What are the various parts in LEX program?
Lex program will be in following form
declarations
%%
translation rules K2
%%
auxiliary functions
10. PART A : Q & A : UNIT – I

SNo Questions and Answers CO K

16 How will you group the phases of the compiler?


The front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation
might be grouped together into one pass. Code
optimization might be an optional pass. K1
Then there could be a back-end pass consisting of code
generation for a particular target machine.

17 Name some variety of intermediate forms.


· Postfix notation or polish notation.
· Syntax tree
· Three address code K1
· Quadruple
· Triple

18 Write a regular expression to describe a


languages consists of strings made of even
numbers a and b.
K1
r=((a|b)(a|b))*
CO1

19 Give the transition diagram for an identifier.

K1

20 Write a short notes on LEX.


A LEX source program is a specification of lexical
analyzer consisting of set of regular expressions
together with an action for each regular expression. The
action is a piece of code, which is to be executed K1
whenever a token specified by the corresponding
regular expression is recognized. The output of a LEX is
a lexical analyzer program constructed from the LEX
source specification.

86
10. PART A : Q & A : UNIT – I

SNo Questions and Answers CO K


21 What is Translator and mention its types?
Translator : It is a program that translates one language to
another.

Types of Translator: K1
1.Interpreter
2.Compiler
3.Assembler
.
22 Mention the uses of SYMBOL TABLE MANAGEMENT
* Symbol table is used to store all the information about
identifiers used in the program. K1
* It is a data structure containing a record for each
identifier, with fields for the attributes of the identifier.
23 What is Parser Generators?
- These produce syntax analyzers, normally from
input that is based on a context-free grammar.
K1
- It consumes a large fraction of the running
time of a compiler.
- Example - YACC CO1
(Yet Another Compiler Compiler).
24 What is Scanner Generator?

- These generate lexical analyzers, normally from a


specification based on regular expressions. K1
- The basic organization of lexical analyzers is based on
finite automation
25 Write a short notes on Assembler:
It translates assembly level language to machine code. K1
Example: Microprocessor 8085, 8086.

26. Mention the parts of compilation:


There are 2 parts to compilation:
1. Analysis
2. Synthesis
Analysis part breaks down the source program into K1
constituent pieces and creates an
intermediate representation of the source program.
Synthesis part constructs the desired target program from
the intermediate representation

87
10. PART A : Q & A : UNIT – I

SNo Questions and Answers CO K


27 List the Software tools used in Analysis part.
1. Structure editorr
2. Pretty printers K1
3. Static checkers
4. Interpreters
28 Mention the various phases of Compiler.
Main phases:
1) Lexical analysis
2) Syntax analysis K1
3) Semantic analysis
4) Intermediate code generation
5) Code optimization
6) Code generation
Sub-Phases:
1) Symbol table management
2) Error handling
29 What are the uses of Semantic Analysis?
-> It is the third phase of the compiler.
-> It gets input from the syntax analysis as K1
parse tree and checks whether the given syntax
CO1
is correct or not.
-> It performs type conversion of all the data
types into real data types.
30 Mention the benefits of Code Optimization.
* It is the fifth phase of the compiler.
* It gets the intermediate code as input and produces
optimized intermediate code as
output.
* This phase reduces the redundant code and attempts to
improve the intermediate code so that faster-running
machine code will result.
* During the code optimization, the result of the program
is not affected.
K1
* To improve the code generation, the optimization
involves
- deduction and removal of dead code (unreachable
code).
- calculation of constants in expressions and terms.
- collapsing of repeated expression into temporary string.
- loop unrolling.
- moving code outside the loop.
- removal of unwanted temporary variables.
88
11. PART B QUESTIONS : UNIT – I

(CO1, K2)

1. Describe the various phases of compiler and trace it with the program segment
(position := initial + rate * 60).
2. Explain in detail the process of compilation. Illustrate the output of each phase of
the compilation for the input “a = (b+c) * (b+c) *2”
3. Discuss Input buffering techniques in detail.
4. Construct DFA for the Regular expression (a l b)*a.
5. Construct DFA using the regular expression (a/b)*abb.
6. Construct the NFA from the (a|b)*a(a|b) using Thompson’s construction
algorithm.
7. Construct the DFA for the augmented regular expression (a | b )* # directly
using syntax tree.
8. Write an algorithm for minimizing the number of states of a DFA.

PART C QUESTIONS
(CO1, K3)

1. Explain the structure of a LEX program with an example.


2. Illustrate how does LEX work?
3. Consider the regular expression below which can be used as part of a
specification of the definition of exponents in floating-point numbers. Assume that
the alphabet consists of numeric digits (‘0’ through ‘9’) and alphanumeric
characters (‘a’ through ‘z’ and ‘A’ through ‘Z’) with the addition of a selected small
set of punctuation and special characters. (say in this example only the characters
‘+’ and ‘-‘ are relevant). Also, in this representation of regular expressions the
character ‘.’ Denotes concatenation.
Exponent = (+|-|€) . (E | e) . (digit)+
(i) Derive an NFA capable of recognizing its language using Thompsons’
construction.
(ii) Derive the DFA for the NFA found in i) above using subset construction.
(iii) Minimize the DFA found in (ii) above.
4. Write Lex program to identify the following tokens - relational operators , arithmetic
operators and keywords (if, while, do, switch, for ).

89
12. Supportive online Certification courses

NPTEL : https://nptel.ac.in/courses/106105190
Swayam : https://www.classcentral.com/course/swayam-compiler-design-12926
coursera : https://www.coursera.org/learn/nand2tetris2
Udemy : https://www.udemy.com/course/introduction-to-compiler-construction-and-design/
Mooc : https://www.mooc-list.com/course/compilers-coursera
Edx : https://www.edx.org/course/compilers

90 55
13. Real time Applications in day to day life and to Industry

1. Characterizing Tokens in the JavaCC Grammar File


2. Use the StreamTokenizer object to implement an interactive
calculator
3. Regular expressions which can be useful in the real-world
applications
○ Email validation :
○ Password validation:
○ Valid date format:
○ Empty string validation
○ Phone number validation
○ Credit card number Validation
4. Regular expressions are useful in a wide variety of text
processing tasks, include data validation, data scraping
(especially web scraping), data wrangling, simple parsing, the
production of syntax highlighting systems, and many other tasks.
5. While regexps would be useful on Internet search engines,
processing them across the entire database could consume
excessive computer resources depending on the complexity and
design of the regex.
6. One salient example of a commercial, real-life application of
deterministic finite automata (DFAs) is Microsoft's historical
reliance on finite-state technology from Xerox PARC. Microsoft has
used two-level automata for spell-checking and other functions in
various products for three decades.

91 55
14. CONTENTS BEYOND SYLLABUS : UNIT – I

1. Lexical analysis and Java:


Learn how to convert human readable text into machine readable data using
the StringTokenizer and StreamTokenizer classes

The purpose of lexical analyzers is to take a stream of input characters and


decode them into higher level tokens that a parser can understand. Parsers
consume the output of the lexical analyzer and operate by analyzing the sequence
of tokens returned. The parser matches these sequences to an end state, which
may be one of possibly many end states. The end states define the goals of the
parser.
When an end state is reached, the program using the parser does some
action -- either setting up data structures or executing some action-specific code.
Additionally, parsers can detect -- from the sequence of tokens that have been
processed -- when no legal end state can be reached; at that point the parser
identifies the current state as an error state. It is up to the application to decide
what action to take when the parser identifies either an end state or an error state.
The standard Java class base includes a couple of lexical analyzer classes,
however it does not define any general-purpose parser classes. In this column I'll
take an in-depth look at the lexical analyzers that come with Java.

Java's lexical analyzers


The Ja va Language Specification, version 1.0.2, defines two lexical analyzer
classes, StringTokenizer and StreamTokenizer. From their names you can deduce
that StringTokenizer uses String objects as its input, and StreamTokenizer uses
InputStream objects.

As a lexical analyzer, StringTokenizer could be formally defined as shown below.

[~delim1,delim2,...,delimN] :: Token

Detailed description can be obtained from the link,


https://www.infoworld.com/article/2076874/lexical-analysis-and-java--part-1.html

92
Contents beyond the Syllabus

2. Lexical Analyzer in C++

Below steps are using a lexical analyzer in C++.


 Include header files.
 Write a function to split the sentence into tokens.
 Define tokens and rules for each token type.
 Write code to output tokens from the input sentence.
 Test and debug your code until it works correctly.
Compiler is responsible for converting high level language in machine language.
There are several phases involved in this and lexical analysis is the first phase.

Lexical analyzer reads the characters from source code and convert it into tokens.

Different tokens or lexemes are:

 Keywords
 Identifiers
 Operators
 Constants

Detailed description can be obtained from the link,

https://www.thecrazyprogrammer.com/2017/02/lexical-analyzer-in-c.html

https://favtutor.com/blogs/lexical-analyzer-cpp

https://www.delftstack.com/howto/cpp/lexical-analyzer-cpp/

8
2
15. ASSESSMENT SCHEDULE

• Tentative schedule for the Assessment During 2024-2025 Even Semester:

S.NO Name of the Start Date End Date Portion


Assessment

1 Unit Test 1 06.01.2025 11.01.2025 UNIT 1

2 Internal Assessment 28.01.2025 03.02.2025 UNIT 1 & 2


Test 1

3 Unit Test 2 17.02.2025 22.02.2025 UNIT 3

4 Internal Assessment 10.03.2025 15.03.2025 UNIT 3 & 4


Test 2
5 Model Examination 03.04.2025 17.04.2025 ALL 5 UNITS

94
16. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS

• TEXT BOOKS:

Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, “Compilers:


Principles, Techniques and Tools”, Second Edition, Pearson Education
Limited, 2014.

• REFERENCE BOOKS:
1.Randy Allen, Ken Kennedy, “Optimizing Compilers for Modern Architectures: A
Dependence-based Approach”, Morgan Kaufmann Publishers, 2002.
2.Steven S. Muchnick, “Advanced Compiler Design and Implementation”, Morgan
Kaufmann Publishers - Elsevier Science, India, Indian Reprint, 2003.
3.Keith D Cooper and Linda Torczon, “Engineering a Compiler”, Morgan
Kaufmann Publishers, Elsevier Science, 2004.
4.V. Raghavan, “Principles of Compiler Design”, Tata McGraw Hill Education
Publishers, 2010.
5. Allen I. Holub, “Compiler Design in C”, Prentice-Hall Software Series, 1993.

95
17. MINI PROJECT SUGGESTION
• Objective:
To design DFA, Convertors of NFA to DFA and Lex program for languages like
C,PASCAL , JAVA , etc. (Mini Project)
• Planning:
• This method is mostly used to improve the ability of students in application
domain and also to reinforce knowledge imparted during the lecture.

• Being a technical institute, this method is extensively used to provide empirical


evidence of theory learnt.
• Students are asked to prepare mini projects involving application of the
concepts, principles or laws learnt.
• The faulty guides the students at various stages of developing the project
and gives timely inputs for the development of the model.

Projects:
Set
Number Mini Project Title
(Category)
Set - 1 Convert the following Regular Expression into DFA using JFLAP tool.
(Toppers) 10 + (0+ 11) 0* 1 (CO1, K3)
Set - 2 Prepare a project to convert Regular Expression to DFA (from compiler
(Above point of view) (CO1, K3)
Average)
Set - 3 Design a DFA that describes the behavior of a vending machine which
(Average) accepts 1 dollar and its quarter, and charges $1.25 per soda. Once the
machine receives at least $1.25, the soda can be dispensed to the user.
The machine will not dispense a soda until at least $1.25 has been
deposited, and it will not accept more money once it has already
received greater than or equal to $1.25. (CO1, K3)
Set - 4 Design a DFA to implement a Pac-Man game. The Pac-Man game play
requires the player to navigate through a maze, eating pellets and
(Below avoiding the ghosts who chase him through the maze. Occasionally,
Average) Pac-Man can turn the tables on his pursuers by eating a power pellet,
which temporarily grants him the power to eat the ghosts. When this
occurs, the ghosts' behavior changes, and instead of chasing Pac-Man
they try to avoid him. If the ghosts are eaten by the Pac-man the game
will be regenerated. (CO1, K3)
Set - 5 Write a LEX program to find biggest among three numbers
(Slow (CO1, K3)
Learners) 96
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of Educational
Institutions. If you have received this document through email in error, please notify the system manager. This
document contains proprietary information and is intended only to the respective group / learning community as
intended. If you are not the addressee you should not disseminate, distribute or copy through e-mail. Please notify
the sender immediately by e-mail if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly prohibited.

61
61

You might also like