This section deals with the design of the compiler for this language. For reasons of simplicity, the compiler will not feature any code optimisation at this time. A grammar for the language is presented in Appendix A. A syntax-directed defintion is not presented for the moment, although this will appear in the final project.
This compiler will be a one-pass compiler. In the initial revision, the compiler will not produce any intermediate files - this is to ensure that an initial version of the compiler and interpreter can be developed and tested in the shortest possible time. The design of the compiler will take into account the possibility of imported symbols in the future.
Produce intermediate code
This is a working overview of the stages the compiler will take. It deviates from the classical design as presented in  due partially to the design of the language, in particular the unusual cases of constants and the implicit forward definition of all functions and variables, which must be resolved after parsing. Lexical and syntactic analysis is covered in the 'parse' phase.
This section is primarily a place-holder and contains notes about the data structures output by the parser. It will eventually by replaced by a section containing a more formal definition of the parser's functions. It may not be clear at this point what a 'block' is - it is any section of source code that is enclosed by square brackets.
A symbol is a name associated with a type (object), variable or function. It may be defined as a global variable (top-level), or as part of an object definition or code block. As definitions and code blocks can be defined recursively, a symbol can therefore be considered to be on a 'level', which essentially means the number of unclosed square brackets that precede the declaration of a symbol. Each symbol also has a scope associated with it - this is either global or until the closing square bracket that terminates the level that the symbol was declared on.
In many languages it is a usual requirement that a function, type or global variable must be declared before use. This language is unusual in that it specifically allows these constructs to be used before they are declared. No checks on the existance of symbols can therefore be made before all code has been analysed (and in later versions of the compiler, also until all symbol tables have been loaded). Symbol resolution is therefore performed after the files have been parsed.
The first phase of symbol resolution is to check for any errors in the tables produced by the parser. In particular, no two named objects should have the same name. That is, no type should share a name with a variable or function defined at the same level as the original definition. In addition, no symbol should have a name that is a reserved word, but this condition should be noticed as a syntax error by the parser.
The next phase is to discover which declaration each symbol reference in the parsed code refers to. A reference also has a level and scope at which it appears (worked out in the same way that a declaration's level is). The reference is resolved to be a symbol with the correct name that is in the same scope as the reference and on the highest possible level. The only error that can be produced by this phase is the case where a symbol that is not defined is referred to.
Several constructs allow the use of symbols defined in other
objects (for example, the function
these cases, the leftmost symbol reference is in the scope of
the enclosing block, and the reference to the right of it is in the
scope of the type definition of the left reference. Scope
changes in a similar manner for symbols that are further to the
right, and is resolved in a left-to-right fashion.
The parser will also produce expressions for the values of constants. Constants cannot have any value that includes a variable or function of any sort, so the next phase of compilation will evaluate the constants to a single value. It is an error to find a constant that contains any variable or function, and it is also an error to find a constant that in some way refers to itself - for instance, as in the following code:
constant a = b+12; constant b = a*24;
which is syntactically correct, but nonsense from a semantic point of view.
The nature of the design of the bytecode itself makes it a very appropriate language to use in the definition of the intermediate code produced by the compiler. However, the bytecode is not especially expressive, and it would simplify the design of the compiler considerably if it could include subexpressions as arguments to operations, and make references to as-yet unallocated symbols. The intermediate code will therefore have a tree structure, which is written down using a structure similar to LISP:
(MOVE (ADD (ADD #1 #2) #3) foo) - adds 3 to the result
of 1+2, and stores the resulting value in a location specified by
'foo' (which is a resolved symbol).
This code would compile to:
ADD #1 #2 -> (stack) ADD (stack) #3 -> (stack) MOVE (stack) -> foo
although there are several obvious optimisations a compiler
could make to this code. This design minimises the amount of
work the compiler has to do converting the intermediate form of
the code to the output form. An optimiser may want to store
direct to a location rather than using the stack and
MOVE. This form is written (ADD (ADD #1 #2) #3
-> foo), which is functionally equivalent to the above.