0% found this document useful (0 votes)
23 views

CSC412 Compiler Construction I March 24 2022 NOUN-pages-1

CIT 316 - Compiler Construction I is a three-credit course designed for B.Sc. Computer Science majors, focusing on the principles and techniques of compiler design and implementation. The course includes 17 study units divided into four modules covering topics such as lexical analysis, syntax analysis, and code generation. Students will engage in assignments and a final examination to assess their understanding and application of compiler construction techniques.

Uploaded by

Anthony Yunusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

CSC412 Compiler Construction I March 24 2022 NOUN-pages-1

CIT 316 - Compiler Construction I is a three-credit course designed for B.Sc. Computer Science majors, focusing on the principles and techniques of compiler design and implementation. The course includes 17 study units divided into four modules covering topics such as lexical analysis, syntax analysis, and code generation. Students will engage in assignments and a final examination to assess their understanding and application of compiler construction techniques.

Uploaded by

Anthony Yunusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

COURSE

GUIDE

CIT 316
COMPILER CONSTRUCTION I (PRINCIPLES
AND TECHNIQUES OF COMPILER)

Course Team A.A. Afolorunso (Course Developer/Writer) –


NOUN
Dr. Oludele Awodele (Course Editor) - Babcock
University, Ilisan-Remo
Prof. Kehinde Obidairo (Programme Leader) –
NOUN

NATIONAL OPEN UNIVERSITY OF NIGERIA


CIT 316 COURSE GUIDE

National Open University of Nigeria Headquarters


Nnamdi Azikiwe Expressway
Airport Road
Jabi, Abuja

e-mail: [email protected] URL: www


.nou.edu.ng

Published by:
National Open University of Nigeria

Printed 2013, 2022

ISBN: 978-058-470-6

All Rights Reserved

ii
CIT 316 COURSE GUIDE

CONTENTS PAGE

Introduction ……………………………………………………. iv
What You Will Learn In This Course …………………………. iv
Course Aims …………………………………………………… v
Course Objectives ……………………………………………… v
Working through this Course …………………………………... v
Course Materials ……………………………………………….. vi
Study Units ……………………………………………………… vi
Textbooks and References ……………………………………… vii
Assignment File ………………………………………………… viii
Presentation Schedule ………………………………………….. viii
Assessment ……………………………………………………… viii
Tutor-Marked Assignment ……………………………………… ix
Final Examination and Grading ………………………………… ix
Course Marking Scheme ………………………………………... x
Course Overview ……………………………………………….. xi
How to Get the most of this Course ……………………………. xi
Tutors and Tutorials …………………………………………….. xiii
Summary ………………………………………………………… xiv

iii
CIT 316 COURSE GUIDE

INTRODUCTION

CIT 316 – Complier Construction I is a three (3) credit unit course


of 17 units . With the increasing diversity and complexity of
computers and their applications , the development of efficient ,
reliable software has become increasingly dependent on automatic
support from compilers and other programme analysis and translation
tools . This course covers principal topics in understanding and
transforming programmes at the code block, function, programme,
and behaviour levels . Specific techniques for imperative languages
include data flow , dependence , inter -procedural , and profiling
analyses , resource allocation , and multi-grained parallelism on both
CPUs and GPUs

It is a course for B. Sc. Computer science major students, and is


normally taken in a student's fourth year. It should appeal to anyone
who is interested in the design and implementation of programming
languages. Anyone who does a substantial amount of programming
should find the material valuable.

This course is divided into four modules. The first module deals with
the review of grammars, languages and automata, introduction to
compiler.

The second module treats, extensively, lexical analysis. Under which


such concepts as the scanner, lexical analyser generation were
discussed.

The third module deals with syntax analysis and discusses context-
free grammars, LL (k), LR (k), operator-precedence grammars, etc.
Also, implementation of various parsing techniques was discussed.

The fourth module which is the concluding module of the course


discusses code generation issues such as symbol tables, intermediate
representation, code optimisation techniques and code generation.

This Course Guide gives you a brief overview of the course contents,
course duration, and course materials.

WHAT YOU WILL LEARN IN THIS COURSE

The main purpose of this course is to acquaint students with software


tools and techniques which are applicable both to compilers and the
implementation of system utility routines, command interpreters,
etc.Thus; we intend to achieve this through the following

iv
CIT 316 COURSE GUIDE

COURSE AIMS

First, students will learn the key techniques in modern compiler


construction, getting prepared for industry demands for compiler
engineers.

Second, students will understand the rationale of various programme


analysis and optimisation techniques, able to improve their
programming skills accordingly.

The third goal is to build the foundation for students to pursue the
research in the areas of compiler, programme analysis, programme
modelling, and operating systems

COURSE OBJECTIVES

Certain objectives have been set out to ensure that the course
achieves its aims. Apart from the course objectives, every unit of this
course has set objectives. In the course of the study, you will need to
confirm, at the end of each unit, if you have met the objectives set at
the beginning of each unit. Upon completing this course you should
be able to:

• recognise various classes of grammars, languages, and


automata, and employ these to solve common software
problems
• explain the major steps involved in compiling a high-level
programming language down to a low-level target machine
language
• construct and use the major components of a modern compiler;
• work together effectively in teams on a substantial software
implementation project.

Related Courses
Prerequisites: CIT 342; Computer Science students only

WORKING THROUGH THIS COURSE

In order to have a thorough understanding of the course units, you


will need to read and understand the contents, practise the steps by
designing a compiler of your own for a known language, and be
committed to learning and implementing your knowledge.

This course is designed to cover approximately seventeen weeks, and


it will require your devoted attention. You should do the exercises in
the Tutor-Marked Assignments and submit to your tutors.
v
CIT 316 COURSE GUIDE

COURSE MATERIALS

These include:

1. Course Guide
2. Study Units
3. Recommended Texts
4. A file for your assignments and for records to monitor your
progress.

STUDY UNITS

There are 17 study units in this course:

Module 1 Introduction to Compilers

Unit 1 Review of Grammars, Languages and Automata


Unit 2 What is a Compiler?
Unit 3 The Structure of a Compiler

Module 2 Lexical Analysis

Unit 1 The Scanner


Unit 2 Hand Implementation of Lexical Analyser
Unit 3 Automatic Generation of Lexical Analyser
Unit 4 Implementing a Lexical Analyser

Module 3 Syntax Analysis

Unit 1 Context-Free Grammars


Unit 2 Bottom-Up Parsing Techniques
Unit 3 Precedence Parsing
Unit 4 Top-Down Parsing Techniques
Unit 5 LR Parsers

Module 4 Code Generation

Unit 1 Error Handling


Unit 2 Symbol Tables
Unit 3 Intermediate Code Generation
Unit 4 Code Generation
Unit 5 Code Optimisation

Make use of the course materials, do the exercises to enhance your


learning.

vi
CIT 316 COURSE GUIDE

TEXTBOOKS AND REFERENCES

Aho, Alfred & Sethi, Ravi & Ullman, Jeffrey. Compilers: Principles,
Techniques, and Tools ISBN 0201100886 The Classic Dragon
book.

Appel, A., Modern Compiler Implementation in Java, 2nd ed.,


Cambridge University Press, 2002.

Appel, Andrew Modern Compiler Implementation in C/Java/ML


(respectively ISBN 0-521-58390-X,ISBN 0-521-58388-8,
ISBN 0-521-58274-1) is a set of cleanly written texts on
compiler design, studied from various different
methodological perspectives.

Brown, P.J. Writing Interactive Compilers and Interpreters ISBN


047127609X Useful practical advice, not much theory.

Fischer, Charles & LeBlanc, Richard. Crafting A Compiler ISBN


0805332014 Uses an ADA like pseudo-code.

Fischer, LeBlanc, Cytron, Crafting a Compiler Implementation,


Addison-Wesley

Holub, Allen Compiler Design in C ISBN 0131550454 Extensive


examples in "C".

Hunter, R. The Design and Construction of Compilers ISBN


0471280542 several chapters on theory of syntax analysis,
plus discussion of parsing difficulties caused by features of
various source languages.

Keith, D. Cooper & Linda Torczon, "Engineering a Compiler",


Morgan Kaufmann Publishers, 2004

Pemberton, S. & Daniels, M.C. Pascal Implementation. The P4


Compiler ISBN 0853123586 (Discussion) and ISBN
085312437X (Compiler listing) Complete listing and readable
commentary for a Pascal compiler written in Pascal.

Randy Allen and Ken Kennedy, "Optimising Compilers for Modern


Architectures", Morgan Kaufmann Publishers, 2001.

Weinberg, G.M. The Psychology of Computer Programming: Silver


Anniversary Edition ISBN 0932633420 Interesting insights
and anecdotes.

vii
CIT 316 COURSE GUIDE

Wirth, Niklaus Compiler Construction ISBN 0201403536 From the


inventor of Pascal, Modula-2 and Oberon-2, examples in
Oberon.

ASSIGNMENTS FILE

These are of two types: the self-assessment exercises and the Tutor-
Marked Assignments. The self-assessment exercises will enable you
monitor your performance by yourself, while the Tutor-Marked
Assignment is a supervised assignment. The assignments take a
certain percentage of your total score in this course. The Tutor-
Marked Assignments will be assessed by your tutor within a specified
period. The examination at the end of this course will aim at
determining the level of mastery of the subject matter. This course
includes twelve Tutor-Marked Assignments and each must be done
and submitted accordingly. Your best scores however, will be
recorded for you. Be sure to send these assignments to your tutor
before the deadline to avoid loss of marks.

PRESENTATION SCHEDULE

The Presentation Schedule included in your course materials gives


you the important dates for the completion of tutor marked
assignments and attending tutorials. Remember, you are required to
submit all your assignments by the due date. You should guard
against lagging behind in your work.

ASSESSMENT

There are two aspects to the assessment of the course. First are the
tutor marked assignments; second, is a written examination.

In tackling the assignments, you are expected to apply information


and knowledge acquired during this course. The assignments must be
submitted to your tutor for formal assessment in accordance with the
deadlines stated in the Assignment File. The work you submit to your
tutor for assessment will count for 30% of your total course mark.

At the end of the course, you will need to sit for a final three-hour
examination. This will also count for 70% of your total course mark.

viii
CIT 316 COURSE GUIDE

TUTOR MARKED ASSIGNMENTS (TMAS)

There are twenty-two tutor marked assignments in this course. You


need to submit all the assignments. The total marks for the best three
(3) assignments will be 30% of your total course mark.

Assignment questions for the units in this course are contained in the
Assignment File. You should be able to complete your assignments
from the information and materials contained in your set textbooks,
reading and study units. However, you may wish to use other
references to broaden your viewpoint and provide a deeper
understanding of the subject.

When you have completed each assignment, send it together with


form to your tutor. Make sure that each assignment reaches your
tutor on or before the deadline given. If, however, you cannot
complete your work on time, contact your tutor before the assignment
is done to discuss the possibility of an extension.

EXAMINATION AND GRADING

The final examination for the course will carry 70% percentage of the
total marks available for this course. The examination will cover
every aspect of the course, so you are advised to revise all your
corrected assignments before the examination.

This course endows you with the status of a teacher and that of a
learner. This means that you teach yourself and that you learn, as
your learning capabilities would allow. It also means that you are in a
better position to determine and to ascertain the what, the how, and
the when of your language learning. No teacher imposes any method
of learning on you.

The course units are similarly designed with the introduction


following the contents, then a set of objectives and then the dialogue
and so on.

The objectives guide you as you go through the units to ascertain


your knowledge of the required terms and expressions.

ix
CIT 316 COURSE GUIDE

COURSE MARKING SCHEME

This table shows how the actual course marking is broken down.

Assessment Marks
Assignment 1- 4 Four assignments, best three marks of the
four count at 30% of course marks
Final Examination 70% of overall course marks
Total 100% of course marks

COURSE OVERVIEW

Unit Title of Work Weeks Assessment


Activity (End of Unit)
Course Guide Week 1
Module 1 Introduction to Compilers
1 Unit 1 Review of Grammars, Week 1 Assignment 1
Languages and Automata
2 Unit 2 What is a Compiler? Week 2 Assignment 2
3 Unit 3 The Structure of a Compiler Week 2 Assignment 3
Module 2 Lexical Analysis
1 Unit 1 The Scanner Week 3 Assignment 5
2 Unit 2 Hand Implementation of Week 3 Assignment 6
Lexical Analyser
3 Unit 3 Automatic Generation of Week 4 Assignment 7
Lexical Analyser
4 Unit 4 Implementing a Lexical Week 4
Analyser
Module 3 Syntax Analysis
1 Unit 1 Context-Free Grammars Week 5 Assignment 8
2 Unit 2 Bottom-Up Parsing Techniques Week 6 Assignment 9
3 Unit 3 Precedence Parsing Week 7 – 8 Assignment 10
4 Unit 4 Top-Down Parsing Techniques Week 8 – 9 Assignment 11
5 Unit 5 LR Parsers Week 10 - Assignment 12
11
Module 4 Code Generation
1 Unit 1 Error Handling Week 12 Assignment 13
2 Unit 2 Symbol Tables Week 13 Assignment 14
3 Unit 3 Intermediate Code Generation Week 14 Assignment 15
4 Unit 4 Code Generation Week 15 Assignment 16
5 Unit 5 Code Optimisation Week 16 Assignment 17
Revision Week 16
Examination Week 17
Total 17 weeks

x
CIT 316 COURSE GUIDE

HOW TO GET THE BEST FROM THIS COURSE

In distance learning the study units replace the university lecturer.


This is one of the great advantages of distance learning; you can read
and work through specially designed study materials at your own
pace, and at a time and place that suit you best. Think of it as reading
the lecture instead of listening to a lecturer. In the same way that a
lecturer might set you some reading to do, the study units tell you
when to read your set books or other material. Just as a lecturer might
give you an in-class exercise, your study units provide exercises for
you to do at appropriate points.

Each of the study units follows a common format. The first item is
an introduction to the subject matter of the unit and how a particular
unit is integrated with the other units and the course as a whole. Next
is a set of learning objectives. These objectives enable you know
what you should be able to do by the time you have completed the
unit. You should use these objectives to guide your study. When you
have finished the units you must go back and check whether you have
achieved the objectives. If you make a habit of doing this you will
significantly improve your chances of passing the course.

Remember that your tutor’s job is to assist you. When you need help,
don’t hesitate to call and ask your tutor to provide it.

1. Read this Course Guide thoroughly.


2. Organise a study schedule. Refer to the ‘Course Overview’
for more details. Note the time you are expected to spend on
each unit and how the assignments relate to the units.
Whatever method you chose to use, you should decide on it
and write in your own dates for working on each unit.
3. Once you have created your own study schedule, do
everything you can to stick to it. The major reason that
students fail is that they lag behind in their course work.
4. Turn to Unit 1 and read the introduction and the objectives for
the unit.
5. Assemble the study materials. Information about what you
need for a unit is given in the ‘Overview’ at the beginning of
each unit. You will almost always need both the study unit
you are working on and one of your set of books on your desk
at the same time.
6. Work through the unit. The content of the unit itself has been
arranged to provide a sequence for you to follow. As you
work through the unit you will be instructed to read sections
from your set books or other articles. Use the unit to guide
your reading.

xi
CIT 316 COURSE GUIDE

7. Review the objectives for each study unit to confirm that you
have achieved them. If you feel unsure about any of the
objectives, review the study material or consult your tutor.
8. When you are confident that you have achieved a unit’s
objectives, you can then start on the next unit. Proceed unit by
unit through the course and try to pace your study so that you
keep yourself on schedule.
9. When you have submitted an assignment to your tutor for
marking, do not wait for its return before starting on the next
unit. Keep to your schedule. When the assignment is
returned, pay particular attention to your tutor’s comments,
both on the tutor-marked assignment form and also written on
the assignment. Consult your tutor as soon as possible if you
have any questions or problems.
10. After completing the last unit, review the course and prepare
yourself for the final examination. Check that you have
achieved the unit objectives (listed at the beginning of each
unit) and the course objectives (listed in this Course Guide).

TUTORS AND TUTORIALS

There are 12 hours of tutorials provided in support of this course.


You will be notified of the dates, times and location of these tutorials,
together with the name and phone number of your tutor, as soon as
you are allocated a tutorial group.

Your tutor will mark and comment on your assignments, keep a close
watch on your progress and on any difficulties you might encounter
and provide assistance to you during the course. You must mail or
submit your tutor-marked assignments to your tutor well before the
due date (at least two working days are required). They will be
marked by your tutor and returned to you as soon as possible.

Do not hesitate to contact your tutor by telephone, or e-mail if you


need help. The following might be circumstances in which you
would find help necessary. Contact your tutor if:

• you do not understand any part of the study units or the


assigned readings
• you have difficulty with the self-tests or exercises
• you have a question or problem with an assignment, with your
tutor’s comments on an assignment or with the grading of an
assignment.

You should try your best to attend the tutorials. This is the only
chance to have face to face contact with your tutor and to ask

xii
CIT 316 COURSE GUIDE

questions which are answered instantly. You can raise any problem
encountered in the course of your study. To gain the maximum
benefit from course tutorials, prepare a question list before attending
them. You will learn a lot from participating in discussions actively.

SUMMARY

Compiler Construction I will introduce you to the concepts


associated with programming , programming languages and the
compilation process . The content of the course material was
planned and written to ensure that you acquire the proper knowledge
and skills for the appropriate situations. Real-life situations have been
created to enable you identify with and create some of your own. The
essence is to help you in acquiring the necessary knowledge and
competence by equipping you with the necessary tools to accomplish
this.

We hope that by the end of this course you would have acquired the
required knowledge to view compilers, programming languages and
programming environments in a new way.

We wish you success with the course and hope that you will find it
both interesting and useful.

xiii
MAIN
COURSE

CONTENTS PAGE

Module 1 Introduction to Compilers…………………. 1

Unit 1 Review of Grammars, Languages and


Automata ……………………………………. 1
Unit 2 What is a Compiler? ………………………… 19
Unit 3 The Structure of a Compiler ………………… 27

Module 2 Lexical Analysis …………………………… 37

Unit 1 The Scanner ………………………………… 37


Unit 2 Hand Implementation of Lexical Analyser … 40
Unit 3 Automatic Generation of Lexical Analyser … 47
Unit 4 Implementing a Lexical Analyser …………… 60

Module 3 Syntax Analysis …………………………….. 72

Unit 1 Context-Free Grammars ……………………. 72


Unit 2 Bottom-Up Parsing Techniques ……………. 87
Unit 3 Precedence Parsing …………………………. 98
Unit 4 Top-Down Parsing Techniques …………….. 110
Unit 5 LR Parsers …………………………………… 125

Module 4 Code Generation ………………………… 143

Unit 1 Error Handling …………………………….. 143


Unit 2 Symbol Tables …………………………….. 156
Unit 3 Intermediate Code Generation …………… 174
Unit 4 Code Generation ………………………….. 190
Unit 5 Code Optimisation ………………………… 206
CIT 316 MODULE 1

MODULE 1 INTRODUCTION TO COMPILERS

Unit 1 Review of Grammars, Languages and Automata


Unit 2 What is a Compiler?
Unit 3 The Structure of a Compiler

UNIT 1 REVIEW OF GRAMMARS, LANGUAGES AND


AUTOMATA

CONTENTS

1.0 Introduction
2.0 Objectives
3.0 Main Content
3.1 Formal Grammar
3.1.1 The Syntax of Grammars
3.1.2 The Semantic of Grammars
3.2 Types of Grammar and Automata
3.2.1 Type-0: Unrestricted Grammars
3.2.2 Type-1: Context-Sensitive Grammars
3.2.3 Type-2: Context-Free Grammars
3.2.4 Type-3: Regular Grammars
3.2.5 Analytic Grammars
3.3 Chomsky Hierarchy
3.3.1 The Hierarchy
3.4 Formal Languages
3.4.1 Words over an Alphabet
3.4.2 Language-Specification Formalisms
3.4.3 Operations on Languages
3.4.4 Uses of Formal Languages
3.5 Programming Languages
3.5.1 Formal Theories, Systems and Proofs
3.5.2 Interpretations and Models
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Reading

1.0 INTRODUCTION

As you learnt in CIT 342: Formal Languages and Automata Theory, in


the field of Computer Science, there are different types of grammars on
which different languages are defined. For each of these grammars,
there is a class of automata that can parse/recognise strings form from

1
CIT 316 COMPILER CONSTRUCTION

the grammar. The set of all strings that can be generated from the
grammar constitutes the language of the grammar.

In this unit, you will be taken through some of the things you learnt
previously on formal grammar, formal language and automata.

Now let us go through your study objectives for this unit.

2.0 OBJECTIVES

At the end of this unit, you should be able to:

• define formal grammar


• define alphabet, words and strings
• state the types of formal grammars we have in the field of
Computer Science
• describe the class of automata that can recognise strings
generated by each grammar
• identify strings that are generated by a particular grammar
• describe the Chomsky hierarchy
• explain the relevance of formal grammar and language to
computer programming.

3.0 MAIN CONTENT

3.1 Formal Grammar

A formal grammar (sometimes called a grammar) is a set of rules of a


specific kind, for forming strings in a formal language. The rules
describe how to form strings from the language's alphabet that are valid
according to the language's syntax. A grammar describes only the form
of the strings and not the meaning or what can be done with them in any
context.

A formal grammar is a set of rules for rewriting strings, along with a


"start symbol" from which rewriting must start. Therefore, a grammar is
usually thought of as a language generator. However, it can also
sometimes be used as the basis for a "recogniser". Recogniser is a
function in computing that determines whether a given string belongs to
the language or is grammatically incorrect. To describe such
recognisers, formal language theory uses separate formalisms, known as
Automata Theory.

The process of recognising an utterance (a string in natural languages)


by breaking it down to a set of symbols and analysing each one against
the grammar of the language is referred to as Parsing. Most languages
2
CIT 316 MODULE 1

have the meanings of their utterances structured according to their


syntax – a practice known as compositional semantics. As a result, the
first step to describing the meaning of an utterance in language is to
break it down into parts and look at its analysed form (known as its
parse tree in Computer Science).

A grammar mainly consists of a set of rules for transforming strings. (If


it consists of these rules only, it would be a semi-Thue system). To
generate a string in the language, one begins with a string consisting of a
single start symbol. The production rules are then applied in any order,
until a string that contains neither the start symbol nor designated non-
terminal symbols is produced. The language formed by the grammar
consists of all distinct strings that can be generated in this manner. Any
particular sequence of production rules on the start symbol yields a
distinct string in the language. If there are multiple ways of generating
the same single string, the grammar is said to be ambiguous.

Example 3.1

Assuming the alphabet consists of a and b, the start symbol is S and we


have the following production rules:

1.
2.

then we start with S, and can choose a rule to apply to it. If we choose
rule 1, we obtain the string aSb. If we choose rule 1 again, we replace S
with aSb and obtain the string aaSbb. If we now choose rule 2, we
replace S with ba and obtain the string aababb, and are done. We can
write this series of choices more briefly, using symbols:
. The language of the grammar is then
the infinite
set , where ak
is a repeated k times (and n in particular represents the number of times
production rule 1 has been applied).

3.1.1 The Syntax of Grammars

In the classic formalisation of generative grammars first proposed by


Noam Chomsky in the 1950s, a grammar G consists of the following
components:

• a finite set N of non-terminal symbols, none of which appear in


strings formed from G.
• a finite set Σ of terminal symbols that is disjoint from N.
• a finite set P of production rules, each rule of the form
3
CIT 316 COMPILER CONSTRUCTION

• where * is the Kleene star operator and denotes set union. That
is, each production rule maps from one string of symbols to
another, where the first string (the "head") contains at least one
non-terminal symbol. In the case that the second string (the
"body") consists solely of the empty string – i.e. it contains no
symbols at all, it may be denoted with a special notation (often Λ,
e or ε) in order to avoid confusion.

A distinguished symbol , that is, the start symbol.


A grammar is formally defined as the tuple (N, Σ, P, S). Such a formal
grammar is often called a rewriting system or a phrase structure
grammar in the literature.

3.1.2 The Semantics of Grammars

The operation of a grammar can be defined in terms of relations on


strings:

• given a grammar G = (N, Σ, P, S), the binary relation


(pronounced as "G derives in one step") on strings in
is defined by:

• the relation (pronounced as G derives in zero or more steps)
is defined as the reflexive transitive closure of
• a sentential form is a member of that can be derived in
a finite number of steps from the start symbol S; that is, a
sentential form is a member of .A
sentential form that contains no non-terminal symbols (i.e. is a
member of Σ * ) is called a sentence.
• the language of G, denoted as , is defined as all those
sentences that can be derived in a finite number of steps from the
start symbol S, that is, the set .

Note that the grammar G = (N, Σ, P, S) is effectively the semi-Thue


system , rewriting strings in exactly the same way; the only
difference is that, we distinguish specific non-terminal symbols which
must be rewritten in rewrite rules, and are only interested in rewritings
from the designated start symbol S to strings without non-terminal
symbols.

4
CIT 316 MODULE 1

Example 3.2

Note that for these examples, formal languages are specified using set-
builder notation.

Consider the grammar G where , , S is the


start symbol, and P consists of the following production rules:

1.
2.
3.
4.

This grammar defines the language where


an denotes a string of n consecutive a's. Thus, the language is the set of
strings that consist of one or more a's, followed by the same number of
b's, and then by the same number of c's.

Some examples of the derivation of strings in L(G) are:

(Note on notation: reads "String P generates string Q by means


of production i", and the generated part is each time indicated in bold
type.)

3.2 Types of Grammars and Automata

In the field of Computer Science, there are four basic types of


grammars:

a. Type-0 grammars (unrestricted grammars) include all formal


grammars.
b. Type-1 grammars (context-sensitive grammars) generate the
context-sensitive languages.
c. Type-2 grammars (context-free grammars) generate the context-
free languages.
d. Type-3 grammars (regular grammars) generate the regular
languages.

5
CIT 316 COMPILER CONSTRUCTION

The difference between these types is that they have increasingly strict
production rules and can express fewer formal languages. Two
important types are context-free grammars (Type 2) and regular
grammars (Type 3). The languages that can be described with such a
grammar are called context-free languages and regular languages,
respectively. Although much less powerful than unrestricted grammars
(Type 0), which can in fact express any language that can be accepted
by a Turing machine, these two restricted types of grammars are most
often used because parsers for them can be efficiently implemented

Recently, there have been other types of classification of grammars such


as analytical grammars identified.

3.2.1 Type-0: Unrestricted Grammars

These grammars generate exactly all languages that can be recognised


by a Turing machine. These languages are also known as the
Recursively Enumerable Languages. Note that this is different from the
recursive languages which can be decided by an always-halting Turing
machine.

3.2.2 Type-1: Context-Sensitive Grammars

These grammars have rules of the form with A being a


non-terminal and α, β and γ strings of terminals and non-terminals. The
strings α and β may be empty, but γ must be non-empty. The rule
is allowed if S does not appear on the right side of any rule. The
languages described by these grammars are exactly all languages that
can be recognied by a linear bounded automaton (a non-deterministic
Turing machine whose tape is bounded by a constant times the length of
the input.)

3.2.3 Type2: Context-Free Grammars

A context-free grammar is a grammar in which the left-hand side of


each production rule consists of only a single non-terminal symbol. This
restriction is non-trivial; not all languages can be generated by context-
free grammars. Those that can are called context-free languages.

The language defined above is not a context-free language, and this can
be strictly proven using the pumping lemma for context-free languages,
but for example the language (at least 1 a followed by
the same number of b's) is context-free, as it can be defined by the
grammar G2 with , , S the start symbol, and the
following production rules:

6
CIT 316 MODULE 1

1.
2.

A context-free language can be recognised in O(n3) time by an algorithm


such as Earley's algorithm. That is, for every context-free language, a
machine can be built that takes a string as input and determines in O (n3)
time whether the string is a member of the language, where n is the
length of the string. Further, some important subsets of the context-free
languages can be recognised in linear time using other algorithms.

These are exactly all languages that can be recognised by a non-


deterministic pushdown automaton. Context-free languages are the
theoretical basis for the syntax of most programming languages.

3.2.4 Type-3: Regular Grammars

In regular grammars, the left hand side is again only a single non-
terminal symbol, but now the right-hand side is also restricted. The right
side may be the empty string, or a single terminal symbol, or a single
terminal symbol followed by a non-terminal symbol, but nothing else.
(Sometimes a broader definition is used: one can allow longer strings of
terminals or single non-terminals without anything else, making
languages easier to denote while still defining the same class of
languages.)

The language defined above is not regular, but the language


(at least 1 a followed by at least 1 b, where the
numbers may be different) is, as it can be defined by the grammar G3
with , , S the start symbol, and the
following production rules:

All languages generated by a regular grammar can be recognised in


linear time by a finite state machine. Although, in practice, regular
grammars are commonly expressed using regular expressions, some
forms of regular expression used in practice do not strictly generate the
regular languages and do not show linear recognitional performance due
to those deviations. Regular languages are commonly used to define
search patterns and the lexical structure of programming languages.

7
CIT 316 COMPILER CONSTRUCTION

3.2.5 Analytic Grammars

Though there is a tremendous body of literature on parsing algorithms,


most of these algorithms assume that the language to be parsed is
initially described by means of a generative formal grammar, and that
the goal is to transform this generative grammar into a working parser.
Strictly speaking, a generative grammar does not in any way correspond
to the algorithm used to parse a language, and various algorithms have
different restrictions on the form of production rules that are considered
well-formed.

An alternative approach is to formalise the language in terms of an


analytic grammar in the first place, which more directly corresponds to
the structure and semantics of a parser for the language. Examples of
analytic grammar formalisms include the following:

• the Language Machine directly implements unrestricted analytic


grammars. Substitution rules are used to transform an input to
produce outputs and behaviour. The system can also produce the
lm-diagram which shows what happens when the rules of an
unrestricted analytic grammar are being applied.
• top-down parsing language (TDPL): a highly minimalist analytic
grammar formalism developed in the early 1970s to study the
behaviour of top-down parsers.
• link grammars: a form of analytic grammar designed for
linguistics, which derives syntactic structure by examining the
positional relationships between pairs of words.
• parsing expression grammars (PEGs): a more recent
generalisation of TDPL designed around the practical
expressiveness needs of programming language and compiler
writers.

SELF ASSESSMENT TEST I

i. In the context of Computer Science, what do you understand by


the word ‘grammar’?
ii. Enumerate the components that make up the syntax of grammars
iii. According to this course, list and describe the types of grammars
iv. Given the grammar G with following production rules, S → a | aS
| bS, determine whether the following strings can be generated by
the grammar
(i) babaab (ii) aaabbbab (iii) bbaaba

8
CIT 316 MODULE 1

3.3 Chomsky Hierarchy

The Chomsky hierarchy (occasionally referred to as Chomsky–


Schützenberger hierarchy) is a containment hierarchy of classes of
formal grammars.

This hierarchy of grammars was described by Noam Chomsky in 1956.


It is also named after Marcel-Paul Schützenberger who played a crucial
role in the development of the theory of formal languages.

3.3.1 The Hierarchy

The Chomsky hierarchy consists of the levels of grammars as presented


in Section 3.2.1 through 3.2.4 above

Recursively enumerable

Context-sensitive

Context-free

Regular

Fig. 1: Set Inclusions Described by the Chomsky Hierarchy

Note that the set of grammars corresponding to recursive languages is


not a member of this hierarchy.

Every regular language is context-free, every context-free language, not


containing the empty string, is context-sensitive and every context-
sensitive language is recursive and every recursive language is
recursively enumerable. These are all proper inclusions, meaning that
there exist recursively enumerable languages which are not context-
sensitive, context-sensitive languages which are not context-free and
context-free languages which are not regular.

The following table summarises each of Chomsky's four types of


grammars, the class of language it generates, the type of automaton that
recognises it, and the form its rules must have.

9
CIT 316 COMPILER CONSTRUCTION

Table 1: Summary of the Languages, Automata and Production


Rules of Chomsky’s Four Types of Grammars
Production rules
Grammar Languages Automaton
(constraints)
Recursively (no
Type-0 Turing machine
enumerable restrictions)
Linear-bounded
Context- non-
Type-1
sensitive deterministic
Turing machine
Non-
deterministic
Type-2 Context-free
pushdown
automaton

Finite state and


Type-3 Regular
automaton

However, there are further categories of formal languages that you can
read more about in the further reading.

3.4 Formal Language

A formal language is a set of words, i.e. finite strings of letters,


symbols, or tokens. The set from which these letters are taken is called
the alphabet over which the language is defined. A formal language is
often defined by means of a formal grammar (also called its formation
rules); accordingly, words that belong to a formal language are
sometimes called well-formed words (or well-formed formulas). Formal
languages are studied in computer science and linguistics; the field of
formal language theory studies the purely syntactical aspects of such
languages (that is, their internal structural patterns).

3.4.1 Words over an Alphabet

An alphabet, in the context of formal languages can be any set,


although it often makes sense to use an alphabet in the usual sense of the
word, or more generally a character set such as ASCII. Alphabets can
also be infinite; e.g. first-order logic is often expressed using an alphabet
which, besides symbols such as ∧, ¬, ∀ and parentheses, contains
infinitely many elements x0, x1, x2, … that play the role of variables. The
elements of an alphabet are called its letters.

10
CIT 316 MODULE 1

A word over an alphabet can be any finite sequence, or string, of letters.


The set of all words over an alphabet Σ is usually denoted by Σ* (using
the Kleene star). For any alphabet there is only one word of length 0, the
empty word, which is often denoted by e, ε or λ. By concatenation one
can combine two words to form a new word, whose length is the sum of
the lengths of the original words. The result of concatenating a word
with the empty word is the original word.

In some applications, especially in logic, the alphabet is also known as


the vocabulary and words are known as formulas or sentences; this
breaks the letter/word metaphor and replaces it by a word/sentence
metaphor.

Therefore, a formal language L over an alphabet Σ is just a subset of Σ*,


that is, a set of words over that alphabet.

In computer science and mathematics, which do not usually deal with


natural languages, the adjective "formal" is often omitted as redundant.
While formal language theory usually concerns itself with formal
languages that are described by some syntactical rules, the actual
definition of the concept "formal language" is only as above: a (possibly
infinite) set of finite-length strings, no more nor less. In practice, there
are many languages that can be described by rules, such as regular
languages or context-free languages. The notion of a formal grammar
may be closer to the intuitive concept of a "language," one described by
syntactic rules. By an abuse of the definition, a particular formal
language is often thought of as being equipped with a formal grammar
that describes it.

Examples 3.3
a. The following rules describe a formal language L over the
alphabet Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, =}:
b. Every nonempty string that does not contain + or = and does not
start with 0 is in L.
The string 0 is in L.
c. A string containing = is in L if and only if there is exactly one =,
and it separates two valid strings in L.
d. A string containing + but not = is in L if and only if every + in
the string separates two valid strings in L.
e. No string is in L other than those implied by the previous rules.
f. Under these rules, the string "23+4=555" is in L, but the string
"=234=+" is not. This formal language expresses natural
numbers, well-formed addition statements, and well-formed
addition equalities, but it expresses only what they look like (their
syntax), not what they mean (semantics). For instance, nowhere

11
CIT 316 COMPILER CONSTRUCTION

in these rules is there any indication that 0 means the number


zero or that + means addition.
g. For finite languages one can simply enumerate all well-formed
words. For example, we can describe a language L as just
L = {"a", "b", "ab", "cba"}.
However, even over a finite (non-empty) alphabet such as
Σ = {a, b} there are infinitely many words: "a", "abb", "ababba",
"aaababbbbaab”… Therefore formal languages are typically
infinite, and describing an infinite formal language is not as
simple as writing L = {"a", "b", "ab", "cba"}. Here are some
examples of formal languages:
L = Σ*, the set of all words over Σ;
L = {a}* = {an}, where n ranges over the natural numbers and an
means "a" repeated n times (this is the set of words consisting
only of the symbol "a");
h. the set of syntactically correct programmes in a given
programming language (the syntax of which is usually defined by
a context-free grammar);
i. the set of inputs upon which a certain Turing machine halts; or
j. the set of maximal strings of alphanumeric ASCII characters on
this line, (i.e., the set {"the", "set", "of", "maximal", "strings",
"alphanumeric", "ASCII", "characters", "on", "this", "line", "i",
"e"}).

3.4.2 Language-Specification Formalisms

Formal language theory rarely concerns itself with particular languages


(except as examples), but is mainly concerned with the study of various
types of formalisms to describe languages. For instance, a language can
be given as:

• those strings generated by some formal grammar;


• those strings described or matched by a particular regular
expression;
• those strings accepted by some automaton, such as a Turing
machine or finite state automaton;
• those strings for which some decision procedure (an algorithm
that asks a sequence of related YES/NO questions) produces the
answer YES.

Typical questions asked about such formalisms include:

• What is their expressive power? (Can formalism X describe every


language that formalism Y can describe? Can it describe other
languages?)

12
CIT 316 MODULE 1

• What is their recognisability? (How difficult is it to decide


whether a given word belongs to a language described by
formalism X?)
• What is their comparability? (How difficult is it to decide
whether two languages, one described in formalism X and one in
formalism Y, or in X again, are actually the same language?).

Surprisingly often, the answer to these decision problems is "it cannot be


done at all", or "it is extremely expensive" (with a precise
characterisation of how expensive exactly). Therefore, formal language
theory is a major application area of computability theory and
complexity theory. Formal languages may be classified in the Chomsky
hierarchy based on the expressive power of their generative grammar as
well as the complexity of their recognising automaton. Context-free
grammars and regular grammars provide a good compromise between
expressivity and ease of parsing, and are widely used in practical
applications.

3.4.3 Operations on Languages

Certain operations on languages are common. This includes the standard


set operations, such as union, intersection, and complement. Another
class of operation is the element-wise application of string operations.

Example 3.4
a. Suppose L1 and L2 are languages over some common alphabet.
b. The concatenation L1L2 consists of all strings of the form vw
where v is a string from L1 and w is a string from L2.
c. The intersection L1 ∩ L2 of L1 and L2 consists of all strings which
are contained in both languages
d. The complement ¬L of a language with respect to a given
alphabet consists of all strings over the alphabets that are not in
the language.
e. The Kleene star: the language consisting of all words that are
concatenations of 0 or more words in the original language;
Reversal:
f. Let e be the empty word, then eR = e, and
g. for each non-empty word w = x1…xn over some alphabet, let
wR = xn…x1,
h. then for a formal language L, LR = {wR | w ∈ L}.

String homomorphism
Such string operations are used to investigate closure properties of
classes of languages. A class of languages is closed under a particular
operation when the operation, applied to languages in the class, always
produces a language in the same class again. For instance, the context-

13
CIT 316 COMPILER CONSTRUCTION

free languages are known to be closed under union, concatenation, and


intersection with regular languages, but not closed under intersection or
complement.

3.4.4 Uses of Formal Languages

Formal languages are often used as the basis for richer constructs
endowed with semantics. In computer science they are used, among
other things, for the precise definition of data formats and the syntax of
programming languages.
Formal languages play a crucial role in the development of compilers,
typically produced by means of a compiler, which may be a single
programme or may be separated in tools like lexical analyser generators
(e.g. lex), and parser generators (e.g. yacc). Since formal languages
alone do not have semantics, other formal constructs are needed for the
formal specification of programme semantics.

Formal languages are also used in logic and in foundations of


mathematics to represent the syntax of formal theories. Logical systems
can be seen as a formal language with additional constructs, like proof
calculi, which define a consequence relation "Tarski's definition of
truth" in terms of a T-schema for first-order logic is an example of fully
interpreted formal language; all its sentences have meanings that make
them either true or false.

SELF-ASSESSMENT EXERCISE II

i. Define formal languages.


ii. What is the relationship between grammar and language?

3.5 Programming Languages

A compiler usually has two distinct components/parts: the analysis part


that breaks up the source programme into constant piece and creates an
intermediate representation of the source programme and the synthesis
part that constructs the desired target programme from the intermediate
representation.

A lexical analyser, generated by a tool like lex, identifies the tokens of


the programming language grammar, e.g. identifiers or keywords, which
are themselves expressed in a simpler formal language, usually by
means of regular expressions. At the most basic conceptual level, a
parser, generated by a parser generator like yacc, attempts to decide if
the source programme is valid, that is if it belongs to the programming
language for which the compiler was built. Of course, compilers do
more than just parse the source code; they translate it into some

14
CIT 316 MODULE 1

executable format; because of this, a parser usually outputs more than a


yes/no answer. Typically an abstract syntax tree, which is used in
subsequent stages of the compiler to eventually generate an executable
containing machine code that runs directly on the hardware, or some
intermediate code that requires a virtual machine to execute.

3.5.1 Formal Theories, Systems and Proofs

Fig. 2: Syntactic Divisions within a Formal System

Figure 2 shows the syntactic divisions within a formal system. Symbols


and strings of symbols may be broadly divided into nonsense and well-
formed formulas. The set of well-formed formulas is divided into
theorems and non-theorems. However, quite often, a formal system will
simply define all of its well-formed formula as theorems.

In mathematical logic, a formal theory is a set of sentences expressed in


a formal language.

A formal system (also called a logical calculus or a logical system)


consists of a formal language together with a deductive apparatus (also
called a deductive system). The deductive apparatus may consist of a set
of transformation rules which may be interpreted as valid rules of
inference or a set of axioms, or have both. A formal system is used to
derive one expression from one or more other expressions. Although a
formal language can be identified with its formulas, a formal system
cannot be likewise identified by its theorems. Two formal systems
and may have all the same theorems and yet differ in some
significant proof-theoretic way (a formula A may be a syntactic
consequence of a formula B in one but not another for instance).

15
CIT 316 COMPILER CONSTRUCTION

A formal proof or derivation is a finite sequence of well-formed


formulas (which may be interpreted as propositions) each of which is an
axiom or follows from the preceding formulas in the sequence by a rule
of inference. The last sentence in the sequence is a theorem of a formal
system. Formal proofs are useful because their theorems can be
interpreted as true propositions.

3.5.2 Interpretations and Models

Formal languages are entirely syntactic in nature but may be given


semantics that give meaning to the elements of the language. For
instance, in mathematical logic, the set of possible formulas of a
particular logic is a formal language, and an interpretation assigns a
meaning to each of the formulas - usually, a truth value.

The study of interpretations of formal languages is called formal


semantics. In mathematical logic, this is often done in terms of model
theory. In model theory, the terms that occur in a formula are interpreted
as mathematical structures, and fixed compositional interpretation rules
determine how the truth value of the formula can be derived from the
interpretation of its terms; a model for a formula is an interpretation of
terms such that the formula becomes true.

4.0 CONCLUSION

In this unit you have been taken through a brief revision of formal
grammars, formal languages and automata because of crucial roles they
play in the development of compilers. You should read more on these
various topics before proceeding to the next unit. In the subsequent
units, you will be learning about compiler construction. Precisely, unit 2
of this module will introduce you to the concept of compilers.

5.0 SUMMARY

In this unit, you learnt that a formal grammar is a set of rules of a


specific kind, for forming strings in a formal language. It has four
components that form its syntax and a set of operations that can be
performed on it, which form its semantic.

Each type of grammars is recognised by a particular type of automata.


For example, type-2 grammars are recognised by pushdown automata
while type-3 grammars are recognised by finite state automata.

16
CIT 316 MODULE 1

Parsing is the process of recognising an utterance by breaking it down to


a set of symbols and analysing each one against the grammar of the
language.

According to Chomsky hierarchy, there are four types of grammars. The


difference between these types is that they have increasingly strict
production rules and can express fewer formal languages.

A formal language is a set of words, i.e. finite strings of letters,


symbols or tokens. The set from which these letters are taken is called
the alphabet over which the language is defined. A formal language is
often defined by means of a formal grammar.

Formal languages play a crucial role in the development of compilers


and precise definition of the syntax of programming languages. A
formal system consists of a formal language together with a deductive
apparatus.

6.0 TUTOR-MARKED ASSIGNMENT

i. Name the class of automata that are used in recognising the


following grammars:
a. Regular grammars
b. Context-sensitive grammars
c. Type-0 grammars
d. Context-free grammars
ii. What are the use(s) of formal languages?
iii. What does it mean to say a class of languages is closed under a
particular operation? Hence or otherwise suppose L1 and L2 are
languages over some common alphabet; state (with appropriate
examples) the standard operations that can be performed on the
languages.
iv. From what you have learnt so far in this course, justify the
relevance of formal languages to computer programming?
v. Briefly discuss the Chomsky hierarchy. What is the relationship
among the various types of grammars described in the Chomsky
hierarchy?

7.0 REFERENCES/FURTHER READING

Arnon, Avron (1994). What is a logical system? Chapter 8. In: Dov M.


Gabbay (Ed.). What is a logical system? Oxford University Press.

Chomsky, N. & Schützenberger, M. P. (1963). "The algebraic theory of


context free languages". In: P. Braffort, D. Hirschberg.Computer

17
CIT 316 COMPILER CONSTRUCTION

Programming and Formal Languages. Amsterdam: North


Holland. pp. 118–161.

Chomsky, Noam (1956). "Three models for the description of


language". IRE Transactions on Information Theory (2): 113–
124. Retrieved from http://www.chomsky.info/articles/195609--
.pdf.

Chomsky, Noam (1959). "On certain formal properties of grammars".


Information and Control 2 (2): 137–167.

Davis, M. E., Sigal, R. & Weyuker, E. J. (1994). Computability,


Complexity, and Languages: Fundamentals of Theoretical
Computer Science. Boston: Academic Press, Harcourt, Brace.
pp. 327.

Ginsburg, S. (1975). Algebraic and Automata Theoretic Properties of


Formal Languages. North-Holland.

Godel, Escher, Bach: An Eternal Golden Braid, Douglas Hofstadter

Grzegorz Rozenberg & Arto Salomaa (1997). Handbook of Formal


Languages: Volume I-III, Springer. ISBN 3 540 61486 9.

Hamilton, G. (1978). Logic for Mathematicians. Cambridge University


Press. ISBN 0 521 21838 1.

Harrison, M. A. (1978). Introduction to Formal Language Theory.


Addison-Wesley.

Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata


Theory, Languages, and Computation. Addison-
Wesley:Publishing, Reading Massachusetts.

Suppes, P. (1957). Introduction to Logic. D. Van Nostrand.

18
CIT 316 MODULE 1

UNIT 2 WHAT IS A COMPILER?

CONTENTS

1.0 Introduction
2.0 Objectives
3.0 Main Content
3.1 Translators
3.2 Why do we need Translators?
3.3 What is a Compiler?
3.4 The Challenges in Compiler Development
3.5 Compiler Architecture
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Reading

1.0 INTRODUCTION

In the previous unit you were taken through some basic concepts you
learnt in an earlier course. This was done because of their
relevance/importance to your understanding of this course.

In this unit you will be introduced to the concept of compilers and their
importance to programme development.

Now let us go through your study objectives for this unit.

2.0 OBJECTIVES

At the end of this unit, you should be able to:

• define compiler and its importance in the programming world


• distinguish between a translator, compiler and an interpreter
• discuss the major challenges to be faced in building compilers
• state the qualities of compilers
• mention some of the knowledge required for building compilers
• describe the architecture of a compiler.

3.0 MAIN CONTENT

3.1 Translators

A translator is a programme that takes as input a programme written in


one programming language ( the source language) and produces as

19
CIT 316 COMPILER CONSTRUCTION

output a programme in another language (the object or target language).


If the source language is a high-level language such as COBOL,
PASCAL, etc. and the object language is a low-level language such as
an assembly language or machine language, then such a translator is
called a Compiler.

Executing a programme written in a high-level programming language is


basically a two-step process, as illustrated in Figure 1. The source
programme must first be compiled, that is, translated into object
programme. Then the resulting object programme is loaded into memory
and executed.

Source Compiler Object


Programme Programme

Object Object Object


Programme Programme Programme

Input
Fig. 1: Compilation and Execution

Certain other translators transform a programming language into a


simplified language, called intermediate code, which can be directly
executed using a programme called an interpreter. You can think of the
intermediate code as the machine language of an abstract computer
designed to execute the source code.

There are other important types of translators, besides compilers. If the


source language is assembly language and the target language is
machine language, then the translator is called an assembler. The term
preprocessor is used for translators that take programmes in one high-
level language into equivalent programmes in another high level
language. For example, there many FORTRAN preprocessors that map
‘structured’ versions of FORTRAN into conventional FORTRAN.

3.2 Why Do We Need Translators?

We need translators to overcome the rigour of programming in machine


language, which involves communicating directly with a computer in
terms of bits, register, and primitive machine operations. As you have
learnt in earlier courses in this programme, a machine language
programme is a sequence of 0’s and 1’s, therefore, programming a
complex algorithm in such a language is terribly tedious and prone to
mistakes.

20
CIT 316 MODULE 1

Translators free the programmer from the hassles of programming in


machine language.

3.3 What is a Compiler?

A compiler is a programme that translates a source programme written


in some high-level programming language (such as Java) into machine
code for some computer architecture (such as the Intel Pentium
architecture). The generated machine code can later be executed many
times against different data each time.

An interpreter reads an executable source programme written in a high-


level programming language as well as data for this programme, and it
runs the programme against the data to produce some results. One
example is the UNIX shell interpreter, which runs operating system
commands interactively.

You should note that both interpreters and compilers (like any other
programme) are written in some high-level programming language
(which may be different from the language they accept) and they are
translated into machine code. For example, a Java interpreter can be
completely written in Pascal, or even Java. The interpreter source
programme is machine independent since it does not generate machine
code. (Note the difference between generate and translated into
machine code.) An interpreter is generally slower than a compiler
because it processes and interprets each statement in a programme as
many times as the number of the evaluations of this statement. For
example, when a for-loop is interpreted, the statements inside the for-
loop body will be analysed and evaluated on every loop step. Some
languages, such as Java and Lisp, come with both an interpreter and a
compiler. Java source programmes (Java classes with .java extension)
are translated by the java compiler into byte-code files (with .class
extension). The Java interpreter, java, called the Java Virtual Machine
(JVM), may actually interpret byte codes directly or may internally
compile them to machine code and then execute that code.

Like was mention in section 3.1, compilers and interpreters are not the
only examples of translators. In the table below are a few more:

21
CIT 316 COMPILER CONSTRUCTION

Table 1: Table of Translators, Source Language and Target


Language
Source Language Translator Target Language
LaTeX Text Formater PostScript
SQL database query optimizer Query Evaluation Plan
Java javac compiler Java byte code
Java cross-compiler C++ code
Natural Language
English text semantics (meaning)
Understanding
Regular
JLex scanner generator a scanner in Java
Expressions
BNF of a language CUP parser generator a parser in Java

This course deals mainly with compilers for high-level programming


languages, but the same techniques apply to interpreters or to any other
compilation scheme.

SELF-ASSESSMENT EXERCISE

i. What is a translator?
ii. What is the importance of translators in programming?
iii. Distinguish among a translator, compiler and an interpreter.

3.4 The Challenge in Compiler Development

There are various challenges involved in developing compilers; some of


these are itemised below:

1) Many variations:
a. many programming languages (e.g. FORTRAN, C++,
Java)
b. many programming paradigms (e.g. object-oriented,
functional, logic)
c. many computer architectures (e.g. MIPS, SPARC, Intel,
alpha)
d. many operating systems (e.g. Linux, Solaris, Windows)
2) Qualities of a compiler: these concerns the qualities that are
compiler must possess in other to be effective and useful. These
are listed below in order of importance:
a. the compiler itself must be bug-free
b. it must generate correct machine code
c. the generated machine code must run fast

22
CIT 316 MODULE 1

d. the compiler itself must run fast (compilation time must be


proportional to programme size)
e. the compiler must be portable (i.e. modular, supporting
separate compilation)
f. it must print good diagnostics and error messages
g. the generated code must work well with existing
debuggers
h. must have consistent and predictable optimisation.
3) In-depth knowledge:
Building a compiler requires in-depth knowledge of:
a. programming languages (parameter passing, variable
scoping, memory allocation, etc.)
b. theory (automata, context-free languages, etc.)
c. algorithms and data structures (hash tables, graph
algorithms, dynamic programming, etc.)
d. computer architecture (assembly programming)
e. software engineering.

You should try building a non-trivial compiler for a Pascal-like language


as the course project. This will give you a hands-on experience on
system implementation that combines all this knowledge.

3.5 Compiler Architecture

As earlier mentioned, a compiler can be viewed as a programme that


accepts a source code (such as a Java programme) and generates
machine code for some computer architecture. Suppose that you want to
build compilers for n programming languages (e.g. FORTRAN, C, C++,
Java, BASIC, etc.) and you want these compilers to run on m different
architectures (e.g. MIPS, SPARC, Intel, alpha, etc.). If you do that
naively, you need to write n*m compilers, one for each language-
architecture combination.

The holly grail of portability in compilers is to do the same thing by


writing n + m programmes only. You can do this by using a universal
Intermediate Representation (IR) and you make the compiler a two-
phase compiler. An IR is typically a tree-like data structure that captures
the basic features of most computer architectures. One example of an IR
tree node is a representation of a 3-address instruction, such as d s1 +
s2 that gets two source addresses, s1 and s2, (i.e. two IR trees) and
produces one destination address, d. The first phase of this compilation
scheme, called the front-end, maps the source code into IR, and the
second phase, called the back-end, maps IR into machine code. That
way, for each programming language you want to compile, you write
one front-end only, and for each computer architecture, you write one
back-end. So, totally you have n + m components.

23
CIT 316 COMPILER CONSTRUCTION

But the above ideal separation of compilation into two phases does not
work very well for real programming languages and architectures.
Ideally, you must encode all knowledge about the source programming
language in the front end, you must handle all machine architecture
features in the back end, and you must design your IRs in such a way
that all language and machine features are captured properly.

A typical real-world compiler usually has multiple phases (this will be


treated to greater details in unit 3 of this module. This increases the
compiler's portability and simplifies retargeting. The front end consists
of the following phases:

• scanning: a scanner groups input characters into tokens


• parsing: a parser recognises sequences of tokens according to
some grammar and generates Abstract Syntax Trees (ASTs)
• semantic analysis: performs type checking (i.e. checking whether
the variables, functions etc. in the source programme are used
consistently with their definitions and with the language
semantics) and translates ASTs into IRs
• optimisation: optimises IRs.

The back end consists of the following phases:

• instruction selection: maps IRs into assembly code


• code optimisation: optimises the assembly code using control-
flow and data-flow analyses, register allocation, etc
• code emission: generates machine code from assembly code.

The generated machine code is written in an object file. This file is not
executable since it may refer to external symbols (such as system calls).
The operating system provides the following utilities to execute the
code:

• linking: A linker takes several object files and libraries as input


and produces one executable object file. It retrieves from the
input files (and puts them together in the executable object file)
the code of all the referenced functions/procedures and it resolves
all external references to real addresses. The libraries include the
operating system libraries, the language-specific libraries, and,
maybe, user-created libraries.
• loading: A loader loads an executable object file into memory,
initialises the registers, heap, data, etc. and starts the execution of
the programme.
• Relocatable shared libraries allow effective memory use when
many different applications share the same code.

24
CIT 316 MODULE 1

4.0 CONCLUSION

In this unit you have been taken through the definition, functions, and
architecture of a compiler in the next unit you will be learning more
about the structure of a compiler and the various phases involved in
compilation process

5.0 SUMMARY

In this unit, you learnt that:

• a translator is a programme that takes as input a programme


written in one programming language ( the source language) and
produces as output a programme in another language (the object
or target language)
• a compiler is a programme that translates a source programme
written in some high-level programming language (such as Java)
into machine code for some computer architecture (such as the
Intel Pentium architecture)
• an interpreter reads an executable source programme written in a
high-level programming language as well as data for this
programme, and it runs the programme against the data to
produce some results
• translators are needed to free the programmer from the hassles of
programming in machine language and its attendant problems
• there are several challenges to be surmounted in developing a
compiler.

6.0 TUTOR-MARKED ASSIGNMENT

i. What are the challenges involved in developing compilers?


ii. Enumerate the essential qualities of a compiler.
iii. Distinguish between the function of loader and a linker.
iv. Outline some specific knowledge require for building a compiler.

7.0 REFERENCES/FURTHER READING

Aho, A. V. & Ullman, J. D. (1977). Principles of Compiler Design.


Addison-Wesley Publishing Company. ISBN 0-201-00022-9.

Chomsky, Noam & Schützenberger, Marcel P. (1963). "The algebraic


Theory of Context free Languages". In: P. Braffort & D.
Hirschberg. Computer Programming and Formal Languages.
Amsterdam: North Holland. pp. 118–161.

25
CIT 316 COMPILER CONSTRUCTION

Davis, Martin E., Sigal, Ron & Weyuker, Elaine J. (1994).


Computability, Complexity, and Languages: Fundamentals of
Theoretical Computer Science. Boston: Academic Press,
Harcourt, Brace. pp. 327. ISBN 0122063821.

Grzegorz Rozenberg and Arto Salomaa (1997). Handbook of Formal


Languages. Volume I-III. Springer. ISBN 3 540 61486 9.

John, E. Hopcroft & Jeffrey D. Ullman (1979). Introduction to


Automata Theory, Languages, and Computation. Addison-
Wesley Publishing, Reading Massachusetts. ISBN 0-201-029880-
X.

Michael, A. Harrison(1978). Introduction to Formal Language Theory,


Addison-Wesley.

26
CIT 316 MODULE 1

UNIT 3 THE STRUCTURE OF A COMPILER

CONTENTS

1.0 Introduction
2.0 Objectives
3.0 Main Content
3.1 The Structure of a Compiler
3.2 Phases of a Compiler
3.2.1 The Lexical Analyser
3.2.2 The Syntax Analyser
3.2.3 The Intermediate Code Generator
3.2.4 Code Optimisation
3.2.5 Code Generation
3.2.6 The Table Management or Bookkeeping
3.2.7 The Error Handler
3.3 Passes
3.4 Reducing the Number of Passes
3.5 Cross Compilation
3.6 T-Diagrams
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Reading

1.0 INTRODUCTION

In the previous unit you were introduced to the concept of compilers and
their roles in programming.

In this unit you will learn about the structure of a compiler and the
various phases involved in the compilation process.

Now let us go through your study objectives for this unit.

2.0 OBJECTIVES

At the end of this unit, you should be able to:

• list the various components of a compiler


• describe the activities that take place at each of the compilation
phases
• explain cross compilation
• analyse hand implementation.

27
CIT 316 COMPILER CONSTRUCTION

3.0 MAIN CONTENT

3.1 The Structure of a Compiler

We can identify four components

1. Front end
2. Back-end
3. Tables of information
4. Runtime library

i) Front-End: the front-end is responsible for the analysis of the


structure and meaning of the source text. This end is usually the
analysis part of the compiler. Here we have the syntactic
analyser, semantic analyser, and lexical analyser. This part has
been automated.
ii) Back-End: The back-end is responsible for generating the target
language. Here we have intermediate code optimiser, code
generator and code optimiser. This part has been automated.
iii) Tables of Information: It includes the symbol-table and there
are some other tables that provide information during compilation
process.
iv) Run-Time Library: It is used for run-time system support.

Languages for Writing Compiler

a. Machine language
b. Assembly language
c. High level language or high level language with bootstrapping
facilities for flexibility and transporting.

3.2 Phases of a Compiler

A compiler takes as input a source programme and produces as output


an equivalent sequence of machine instructions. This process is so
complex that it is not reasonable, either from a logical point of view or
from an implementation point of view, to consider the compilation
process as occurring in one single step. For this reason, it is customary
to partition the compilation process into a series of sub-processes called
phases as shown in the figure 1 below. A phase is a logically cohesive
operation that takes as input one representation of the source programme
and produces as output another representation.

28
CIT 316 MODULE 1

3.2.1 The Lexical Analyser

this is the first phase and it is also referred to as the Scanner. It separates
characters of the source language into groups that logically belong
together; these groups are called tokens. The usual tokens are keywords,
such as DO or IF, identifiers such as X or NUM, operator symbol such
as <= or +, and punctuation symbol such as parentheses or commas. The
output of the lexical analyser is a stream of tokens, which is passed to
the next phase, the syntax analyser or parser. The tokens in this stream
can be represented by codes which we may regard as integers. Thus DO
might be represented by 1, + by 2, and “identifier” by 3. In the case of a
token like “identifier”, a second quantity, telling which of those
identifiers used by the programme is represented by this instance of
token “identifier” is passed along with the integer code for “identifier”.
For Example, in the FORTRAN statement:

IF (5 .EQ. MAX) GO TO 100


we find the following eight tokens: IF; (; 5; .EQ; MAX; ); GOTO; 100.

Source Programme

Lexical Analysis

Syntax Analysis

Table Management Intermediate Code Error Handling


Generation

Code Optimisation

Code Generation

Target Programme
Fig. 1: Phases of a Compiler

29
CIT 316 COMPILER CONSTRUCTION

3.2.2 The Syntax Analyser

This groups tokens together into syntactic structures. For example, the
three tokens representing A+B might be grouped into a syntactic
structure called an expression. Expressions might further be combined to
form statements. Often the syntactic structure can be regarded as a tree
whose leaves are the tokens. The interior nodes of the tree represent
strings of tokens that logically belong together. The parser has two
functions. It checks that the tokens appearing in its input, which is the
output of the lexical analyser, occur in patterns that are permitted by the
specification for the source language. It also imposes on the tokens a
tree-like structure that is used by the subsequent phases of the compiler.

3.2.3 The Intermediate Code Generator

This uses the structure produced by the syntax analyser to create a


stream of simple instructions. Many styles of intermediate code are
possible. One common style uses instructions with one operator and a
small number of operands. These instructions with one operator and a
small number of operands. These instructions can be viewed as simple
macros like the macro ADD2. the primary difference between
intermediate code and assembly code is that the intermediate code need
not specify the registers to be used for each operation.

3.2.4 Code Optimisation

This is an optional phase designed to improve the intermediate code so


that the ultimate object programme runs faster and/or takes less space.
Its output is another intermediate code programme that does the same
job as the original, but perhaps in a way that saves time and/or space.

3.2.5 Code Generation

This is the final phase and it produces the object code by deciding on the
memory locations for data, selecting code to access each datum, and
selecting the registers in which each computation is to be done.
Designing a code generator that produces truly efficient object
programmes is one of the most difficult parts of a compiler design, both
practically and theoretically.

30
CIT 316 MODULE 1

3.2.6 The Table Management or Bookkeeping

This portion of the compiler keeps track of the names used by the
programme and records essential information about each, such as its
type (integer, real, etc.). The data structure used to record this
information is called a symbol table.

3.2.7 The Error Handler

This is invoked when a flaw in the source programme is detected. It


must warn the programmer by issuing a diagnostic, and adjust the
information being passed from phase to phase so that each phase can
proceed. It is desirable that compilation be completed on flawed
programmes, at least through the syntax-analysis phase, so that as many
errors as possible can be detected in one compilation. Both the table
management and error handling routines interact with all phases of the
compiler

SELF-ASSESSMENT EXERCISE

i. List the various phases of the compilation process.


ii. Why is error handler important in a compiler?
iii. Which is the first phase of the compiler?

3.2 Passes

In an implementation of a compiler, portions of one or more phases are


combined into a module called a pass. A pass reads the source
programme or the output of the previous pass, makes the
transformations specified by its phases, and writes output into an
intermediate file, which may then be read by a subsequent pass. If
several phases are grouped into one pass, then the operation of the
phases may be interleaved, with control alternating among several
phases.

The numbers of passes, and the grouping of phases into passes, are
usually dictated by a variety of considerations germane to a particular
language and machine, rather than by any mathematical optimality
criteria.

The structure of the source language has strong effect on the number of
passes. Certain languages require at least two passes to generate code
easily. For example, languages such as PL/I or ALGOL 68 allow the
declaration of a name to occur after uses of that name. Code for
expression containing such a name cannot be generated conveniently
until the declaration has been seen.

31
CIT 316 COMPILER CONSTRUCTION

3.3 Reducing the Number of Passes

Since each phase is a transformation on a stream of data representing an


intermediate form of the source programme, you may wonder how
several phases can be combined into one pass without the reading and
writing of intermediate files. In some cases one pass produces its output
with little or no memory of prior inputs. Lexical analysis is typical. In
this situation, a small buffer serves as the interface between passes. In
other cases, you may merge phases into one pass by means of a
technique known as ‘back patching’. In general terms, if the output of a
phase cannot be determined without looking at the remainder of the
phase’s input, the phase can generate output with ‘slots’ which can be
filled in later, after more of the input is read.

3.4 Cross Compilation

Consider porting a compiler for C written in C from an existing machine


A to an existing machine B.

Steps to Cross Compilation


a. Write new back-end in C to generate code for computer B
b. Compile the new back-end and using the existing C compiler
running on computer A generating code for computer B.
c. We now have a compiler running on computer A and generating
code for computer B.
d. Use this new compiler to generate a complete compiler for
computer B. In other words, we can compile the new compiler on
computer A to generate code for computer B
e. We now have a complete compiler for computer B that will run
on computer B.
f. Copy this new compiler across and run it on computer B (this is
cross Compilation).

3.5 Operations of a Compiler

As you have seen in section 3.1 of this unit, the operation of a compiler
includes:

• The lexical analysis ( or scanning)


• The syntax analysis Front-end
• Semantic analysis

• Intermediate code optimisation


• Code generation Back-end
• Code optimisation

32
CIT 316 MODULE 1

In this course, we will concern ourselves with mostly the front-end i.e.
those parts of compilation that can be automated which are lexical,
syntax and probably semantic analyses.

3.6 T-Diagrams

They are used to describe the nature of a compilation. It is usually in the


form of T and is diagrammatically represented as in figure 2 below:

Compiler

Source Language Target Language

Language
Implementation

Fig. 2: T-Diagrams

4.0 CONCLUSION

In this unit you have been taken through the structure, phases and
functions of each phase of a compiler. In the next unit you will be
learning more about the compilation process; such as cross compilation
and hand implementation.

5.0 SUMMARY

In this unit, you learnt that:

• the compilation process can be partitioned into a series of sub-


processes called phases
• several phases can be grouped into one pass, so that the operation
of the phases may be interleaved, with control alternating among
several phases
• the numbers of passes, and the grouping of phases into passes, are
usually dictated by a variety of considerations germane to a
particular language and machine
• the operations of a compiler can be classified into two: front-end
comprising the lexical analysis ( or scanning), the syntax
33
CIT 316 COMPILER CONSTRUCTION

analysis, and semantic analysis and the back-end comprising of


intermediate code optimisation, code generation, and code
optimisation.

6.0 TUTOR-MARKED ASSIGNMENT

i. Describe what happens at each phase of the compilation process.


ii. Distinguish between the intermediate code generator and the code
generator phase of the compiler.
iii. Which of the phases of the compiler is optional?
iv. In your own opinion would error handler be part of an
interpreter?
v. Enumerate the steps in cross compilation.

7.0 REFERENCES/FURTHER READING

Aho, A. V. & Ullman, J. D. (1977). Principles of Compiler Design.


Addison-Wesley Publishing Company. ISBN 0-201-00022-9.

Chomsky, Noam & Schützenberger, Marcel P. (1963). "The algebraic


Theory of Context free Languages." In: P. Braffort &
D.Hirschberg. Computer Programming and Formal Languages.
Amsterdam: North Holland. pp. 118–161.

34

You might also like