Compiler design

This section deals with the design of the compiler for this language. For reasons of simplicity, the compiler will not feature any code optimisation at this time. A grammar for the language is presented in Appendix A. A syntax-directed defintion is not presented for the moment, although this will appear in the final project.

Compiler Overview

This compiler will be a one-pass compiler. In the initial revision, the compiler will not produce any intermediate files - this is to ensure that an initial version of the compiler and interpreter can be developed and tested in the shortest possible time. The design of the compiler will take into account the possibility of imported symbols in the future.

Parse
|
Resolve symbols
|
Evaluate constants
|
Semantic analysis
|
Produce intermediate code
|
(Optional) Optimise
|
Produce code

Figure 6 - compiler execution overview

This is a working overview of the stages the compiler will take. It deviates from the classical design as presented in [1] due partially to the design of the language, in particular the unusual cases of constants and the implicit forward definition of all functions and variables, which must be resolved after parsing. Lexical and syntactic analysis is covered in the 'parse' phase.

Parser output

This section is primarily a place-holder and contains notes about the data structures output by the parser. It will eventually by replaced by a section containing a more formal definition of the parser's functions. It may not be clear at this point what a 'block' is - it is any section of source code that is enclosed by square brackets.

Levels and scope

A symbol is a name associated with a type (object), variable or function. It may be defined as a global variable (top-level), or as part of an object definition or code block. As definitions and code blocks can be defined recursively, a symbol can therefore be considered to be on a 'level', which essentially means the number of unclosed square brackets that precede the declaration of a symbol. Each symbol also has a scope associated with it - this is either global or until the closing square bracket that terminates the level that the symbol was declared on.

Symbol resolution and constant evaluation

In many languages it is a usual requirement that a function, type or global variable must be declared before use. This language is unusual in that it specifically allows these constructs to be used before they are declared. No checks on the existance of symbols can therefore be made before all code has been analysed (and in later versions of the compiler, also until all symbol tables have been loaded). Symbol resolution is therefore performed after the files have been parsed.

The first phase of symbol resolution is to check for any errors in the tables produced by the parser. In particular, no two named objects should have the same name. That is, no type should share a name with a variable or function defined at the same level as the original definition. In addition, no symbol should have a name that is a reserved word, but this condition should be noticed as a syntax error by the parser.

The next phase is to discover which declaration each symbol reference in the parsed code refers to. A reference also has a level and scope at which it appears (worked out in the same way that a declaration's level is). The reference is resolved to be a symbol with the correct name that is in the same scope as the reference and on the highest possible level. The only error that can be produced by this phase is the case where a symbol that is not defined is referred to.

Several constructs allow the use of symbols defined in other objects (for example, the function foo.bar()). In these cases, the leftmost symbol reference is in the scope of the enclosing block, and the reference to the right of it is in the scope of the type definition of the left reference. Scope changes in a similar manner for symbols that are further to the right, and is resolved in a left-to-right fashion.

The parser will also produce expressions for the values of constants. Constants cannot have any value that includes a variable or function of any sort, so the next phase of compilation will evaluate the constants to a single value. It is an error to find a constant that contains any variable or function, and it is also an error to find a constant that in some way refers to itself - for instance, as in the following code:

constant a = b+12;
constant b = a*24;

which is syntactically correct, but nonsense from a semantic point of view.

Intermediate code design

The nature of the design of the bytecode itself makes it a very appropriate language to use in the definition of the intermediate code produced by the compiler. However, the bytecode is not especially expressive, and it would simplify the design of the compiler considerably if it could include subexpressions as arguments to operations, and make references to as-yet unallocated symbols. The intermediate code will therefore have a tree structure, which is written down using a structure similar to LISP:

(MOVE (ADD (ADD #1 #2) #3) foo) - adds 3 to the result of 1+2, and stores the resulting value in a location specified by 'foo' (which is a resolved symbol).

This code would compile to:

ADD  #1 #2 -> (stack)
ADD  (stack) #3 -> (stack)
MOVE (stack) -> foo

although there are several obvious optimisations a compiler could make to this code. This design minimises the amount of work the compiler has to do converting the intermediate form of the code to the output form. An optimiser may want to store direct to a location rather than using the stack and MOVE. This form is written (ADD (ADD #1 #2) #3 -> foo), which is functionally equivalent to the above.


Andrew Hunter
Last modified: Tue Jan 2 14:37:25 GMT 2001