What are compilers?

Compilers are programs that read in one (source) language, and translate them into another (target) language.

Compilers need to perform two main tasks:

  1. Analysis: breaking up a source program into constituent parts, and create an intermediate representation
  2. Synthesis: Construct the desired target program from an intermediate representation

A compiler is often designed in a number of phases:

lexical analyses
reads a stream of characters, and groups them into a stream of tokens (logically cohesive character sequences). This often requires maintaining a symbol table
syntax analyses
imposes a hierarchical structure on the token stream (parse tree)
semantic analyses
checks parse tree for semantic errors (e.g. type errors)
Code optimization
Optimizes the intermediate code representation in terms of efficiency
intermediate code generation
a program for an abstract machine

To contrast this with interpreters, interpreters take as input the program and the data, and produce an output. The interpreters do their computation “online”.

Cousins of the Compiler

  1. Prepreprocessors
    1. Macro Processing
    2. File inclusion
    3. Language Extension
  2. Assemblers
  3. Loaders and Link-editors

The Economy of Programming Languages

Why are there so many programming languages? Application domains have distinctive/conflicting needs. For example, in scientific computing, there needs to be good support for FP, arrays and parallelism. Julia is a good example of a language designed for scientific computing. In systems programming, we require fine control over resources, and satisfy certain real-time constraints. Languages like C are suited for these applications.

Programmer training is the dominant cost for a programming language. It is difficult to modify a language, but easy to start a new one.

Lexical Analyses

The lexical analyzer reads input characters of the source program, group characters into lexemes, and outputs a sequence of tokens to the parser.

It may also:

Term Definitions

token
pattern
A description of the formthat can be recognized as a token
lexeme
A sequence of characters matching the pattern for a token

E.g. printf("Total = %d", score);

Tokens Lexemes
id
literal
punctuation
ID(printf)  LBRACE LITERAL("Total = %d") COMMA ID(score) RBRACE SEMI

Tokens have several categories:

Keywords if else return
Operators > < <> != !
Identifiers pi score temp1
Constants 3.14 “hello”
Punctuation ( ) ; :

Attributes for tokens distinguish two lexemes belonging to the same symbol table:

, ,

Regular Languages

Since the lexical analyzer needs to split splits into token classes, it must be specify the set of strings that belongs into a token class. To do so, we use regular languages.

Figure 1: Cycle of constructions

Figure 1: Cycle of constructions

Syntax Definition

We use context-free grammars to specify the syntax of a language.

A grammar for arithmetic expressions can be constructed from a table showing the associativity and precedence of operators.

left-associative: + -
left-associative: * /

Two different non-terminals can be constructed for the two levels of precedence:

factor -> digit | (expr)
term -> term * factor
  | term / factor
  | factor
expr -> expr + term
  | expr - term
  | term

Parsing

Parsers use pushdown automata to do parsing.

Recursive-Descent Parsing

Consists of a set of procedures, one for each non-terminal. The construction of both top-down and bottom-up parsers is aided by two functions: first and follow.

\(First(\alpha)\), where \(\alpha\) is any string of grammar symbols, is the set of terminals that begin strings derived from \(\alpha\). If \(\alpha\) derives \(\epsilon\), then \(\epsilon\) is also in \(First(\alpha)\).

Follow(A) for noterminal A, is the set of terminals \(a\) that can appear immediately to the right of A in some sentential form.: the set of terminals \(a\) such that there exists a derivation of the form \(S \overset{**}{\Rightarrow} \alpha A a B\).

Bottom-up Parsing

The parse tree for an input string is constructed beginning from the leaves (bottom) and working up towards the root (the top).

Figure 2: Bottom up parsing

Figure 2: Bottom up parsing

LR grammars can be parsed with shift-reduce parsers.

One can think of bottom-up parsing as the process of “reducing” a string \(w\) to the start symbol of the grammar. At each reduction, a specific substring matching the body of a production is replaced by the nonterminal at the head of the production.

Bottom-up parsing during a left-to-right scan of the input constructs a right-most derivation in reverse.

A stack holds grammar symbols and an inptu buffer holds the rest of the string to be parsed. During a left-to-right scan of the input string, the parser shifts zero or more input symbols onto the stack, until it is ready to reduce a string \(\beta\) of grammar symbols onto the stack. It then reduces \(\beta\) to the head of the appropriate production. The parser repeats this cycle until it has detected an error, or until the stack contains the start symbol and the input is empty.

The use of a stack in shift-reduce parsing is because the handle will always eventually appear on top of the stack, and never inside.

Shift-reduce conflicts

  1. Cannot decide whether to shift or reduce (shift/reduce conflict)
  2. Cannot decide which of several reductions (reduce/reduce conflict)

These grammars are not in the \(LR(k)\) class of grammars.

In \(LR(k)\), L stands for-to-right scanning, R stands for rightmost derivation. LR parsers are table-driven. The LR-parsing method is the most general nonbacktracking shit-reduce parsing method known. An LR parser can detect a syntactic error as soon as it is possible to do on a left-to-right scan on the input. The class of grammars that can be parsed using LR methods is a proper subset of the class of grammars that can be parsed with predictive or LL methods.

The main downside to this is that construction of a LR parser is tedious by hand.

Syntax Directed Translation

Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. e.g. consider

expr -> expr_1 + term

We can translate expr by attaching a semantic action within the production body:

expr -> expr_1 + term { print "+" }

The position of the semantic action determines the order in which the action is executed.

The most general approach to SDT is to construct a parse tree or syntax tree, and compute the values of attributes by visiting the nodes of the tree. In most cases, SDT can be performed without explicit construction of the tree.

L-attributed translations (Left to right) encompass all translations that can be performed during parsing.

S-attributed translations (synthesized) is a smaller class that can be performed easily in connection with a bottom-up parse.

Syntax-Directed Definitions

A SDD is a CFG together with attributes and rules. Attributes are associated with grammar symbols, and rules are associated with productions. We write \(X.a\) where \(X\) is a symbol and \(a\) is an attribute.

There are 2 kinds of attributes for non-terminals:

synthesized attribute
defined by semantic rule associated with the production at parse tree.
inherited attribute
defined by semantic rule associated with the production of parent at parse-tree.

Evaluating an SDD at the nodes of a parse tree

We can construct an annotated parse tree. With synthesized attributes, An SDD with both inherited and synthesized attributes has no guarantee that there is one order in which to evaluate the attributes at nodes. There are useful subclasses of SDDs that are sufficient to guarantee an order of evaluation exists.

The dependency graph characterizes the possible orders in which we can evaluate the attributes at various nodes in the parse tree.

An SDD is S-attributed if every attribute is synthesized. In this scenario, we can evaluate attributes in any bottom-up order, for example using post-order traversal of the parse tree.

In L-attributed SDDs, dependency-graph edges can only go from left-to-right, and not right-to-left. This means that each attribute must be either:

  1. Synthesized, or
  2. Inherited, but with rules limited as follows: If there is a production \( A \rightarrow X_1 X_2 \dots X_n \) and there is an inherited attribute \(X_i.a\) computed by a rule associated with this production, then the rule may use only:
    1. Inherited attributes associated with the head \(A\)
    2. Either inherited or synthesized attributes associated with the occurrences of symbols \(X_1, X_2 \dots X_{i-1}\) located to the left of \(X_i\).
    3. Inherited or synthesized attributes associated with \(X_i\) itself, in a way that no cycles are formed in the dependency graph by attributes of this \(X_i\).

Side effecting

We can control side effects in SDDs by:

  1. Permitting incidental side effects that do not constrain attribute evaluation
  2. Constrain allowable evaluation orders, so that the translation is produced for any allowable order

Semantic rules executed for their side effects, such as printing, are treade as the definitions of dummy synthesized attributes associated with the head of the production. The modified SDD produces the same translation under any topological sort, since the statement is executed at the end.

Syntax-Directed Translation Schemes

A SDT is a CFG with program fragments embedded within production bodies. The program fragments are called semantic actions, and can appear at any position within the production body.

Run Time Environment

The environment deals out layout and allocation of storage locations in the source program, linkages between procedure and mechanisms for passing parameters, as well as interfaces to the operating system, I/O devices and other programs.

Storage Organization

An example of the run-time representation of an object program in the logical address space is shown below:

The operating system maps logical addresses into physical addresses. Run-time storage typically comes in blocks of contiguous bytes, where a byte is the smallest unit of addressable memory.

aligned
addresses must be divisible by 4
padding
space left unused due to alignment issues

The size of a generated target code is fixed at compile time, so the compiler can place the executable target code in a statically determined area called code. The size of program objects like global constants and data is also known at compile time, and is placed at static.

To maximize utilization of space at run-time, the stack and heap are at opposite ends of the remainder of the address space. In practice, the stack grows towards lower addresses, and the heap towards higher, but here we assume the opposite so we can use positive offsets for notational convenience.

Dynamic storage allocation is handled with 2 strategies:

stack storage
names local to a procedure are allocated space on the stack
heap storage
data that may outlive the call to the procedure are allocated here

Garbage collection enables the run-time system to detect useless data elements and reuse their storage.

Activation Trees

Activation Records

Procedure calls and returns are managed by a run-time stack called the control stack. Each live activation has an activation record (frame) on the control stack, with the root of the activation tree at the bottom, and the entire sequence of activation records on the stack corresponding to the path in the activation tree to the activation where control currently resides.

An activation record may contain:

temporary values
arising from evaluation of expressions etc.
local data
belonging to the procedure whose actiavtion record this is
saved machine status
return address (program counter), contents of registers that must be restored when return occurs
access link
locate data needed by the called procedure found elsewhere (e.g. in another activation record)
control link
pointing to the activation record of the caller
return value
space must be allocated for the return value of the called function, if any
parameters
Parameters used by the calling procedures, these are however often placed in registers instead of the stack.

Calling Sequences

Calling sequences consists of code that allocates an activation record on the stack and enters information into its fields. A return sequence similarly contains code that restores the state of the machine so the calling procedure can continue its execution after the call.

It is desirable to put as much of the calling sequence into the callee as possible, but the callee cannot know everything. This reduces the amount of code generated. These are some guiding principles:

  1. Values communicated between the caller and callee are placed at the beginning of the callee’s activation record, so they are as close as possible to the caller’s activation record. The caller can compute the values of the actual parameters of the call and place them on top of its own activation record without having to create the entire activation record for the callee. It also allows for procedures that have multiple arity.
  2. Fixed length items are placed in the middle. These include control links, access links and the machine status fields.
  3. Items whose size may not be known early enough are placed at the end of the activation record. Most local variables have fixed length, which can be determined by the compiler by examining the type of the variable. However, some local variables like dynamically sized arrays cannot be determined until execution.
  4. We must locate the top-of-stack pointer. A common approach is to have it point to the end of the fixed-length fields in the activation record. Fixed-length data can be accessed by fixed offsets relative to the top-of-stack pointer. The offsets to variable-length fields are then calculated at run-time, using a positive offset from the top-of-stack pointer.

Variable-Length data of the Stack

A common scheme that works for dynamically sized arrays is to have a pointer to the array stored on the stack. This pointer are known offsets from the top-of-stack pointer, so the target code can access array elements through these pointers.

Data Access

Heap Management

Tools

Lex

A Lex compiler takes a Lex source program, and outputs a program. This program is compiled in its source language (originally C, jFlex for Java). The result program takes an input stream as input, and produces a sequence of tokens as output.

A lex program has the following form:

declarations
%%
translation rules
%%
auxiliary functions

The declaration section includes declarations of variables, manifest constants, and regular definitions.

The translation rules have form: Pattern { Action }.

Each pattern is a regular expression, which may use the definitions of the declaration section. The actions are fragments of code.

The auxiliary functions section contains used in these actions.