Appendix A - Language Grammar

Notation

This grammar begins with lexical rules. These are written in the form TOKEN -> /regexp/. The grammar continues with rules in BNF, using the notation PRODUCTION -> rule | rule | rule .... A rule consists of a list of elements - each element may be a production, a token or a constant character. Comments are lines prefixed with a hash '#'. Whitespace should seperate tokens, and otherwise be ignored. The special token 'Empty' is used to represent an epsilon rule. This grammar is intended to be LALR(1).

Grammar

# Tokens
IDENTIFIER -> /[A-Za-z_][A-Za-z0-9_]*/
INTEGER    -> /integer/
BYTE       -> /byte/
REAL       -> /real/
ARRAY      -> /array/
OF         -> /of/
REFERENCE  -> /reference/
FUNCTION   -> /function/
OBJECT     -> /object/
CONSTANT   -> /constant/
ENUM       -> /enum/

EQ         -> /==/
NE         -> /!=/
LE         -> /<=/
GE         -> />=/

STATIC     -> /static/
PUBLIC     -> /public/
PROTECTED  -> /protected/
PRIVATE    -> /private/

CATCH      -> /->/
THROW      -> /throw/
LOOP       -> /loop/
NEXT       -> /next/
IF         -> /if/
ELSE       -> /else/
INHERITS   -> /inherits/
FROM       -> /from/
SIZEOF     -> /sizeof/
NIL        -> /nil/
NEW        -> /new/

NUMBER     -> /[0-9][0-9]*/
STRING     -> /"([^"]|(\"))*"/

COMMENT    -> //\/\/.*\n

# Grammar
Type -> IDENTIFIER OptionalTypeVars      |
        INTEGER                          |
        REAL                             |
        BYTE                             |
        ARRAY '[' Expression ']' OF Type |
        ARRAY '[' ']' OF Type            |
        REFERENCE '{' FunctionType '}'

FunctionType -> Type |
                FUNCTION OptionalTypeVars '(' ArgumentDeclaration ')'

Scope -> STATIC    |
         PUBLIC    |
         PROTECTED |
         PRIVATE

VariableDeclaration -> Scope SimpleVarDeclaration                |
                       Scope SimpleVarDeclaration '=' Expression |
                       SimpleVarDeclaration                      |
                       SimpleVarDeclaration '=' Expression

SimpleVarDeclaration -> Type IDENTIFIER

FunctionDeclaration -> Scope FUNCTION Type IDENTIFIER OptionalTypeVars 
                       '(' ArgumentDeclaration ')' CodeBlock           |
                       FUNCTION Type IDENTIFIER OptionalTypeVars '('
                       ArgumentDeclaration ')' CodeBlock

OptionalTypeVars -> '<' TypeVars '>' | Empty

TypeVars -> IDENTIFIER | TypeVars ',' IDENTIFIER

ArgumentDeclaration -> SimpleVarDeclaration |
                       ArgumentDeclaration ';' SimpleVarDeclaration

CodeBlock -> '[' StatementList ']'

StatementList -> Statement ';'               |
                 StatementList Statement ';' |
                 Empty

Statement -> Expression          |
             LoopStatement       |
             IfStatement         |
             CatchStatement      |
             ThrowStatement      |
             CodeBlock           |
             VariableDeclaration |
             ConstantDeclaration |
             EnumDeclaration

Expression -> Element                       |
              Expression LE Expression      |
              Expression '<' Expression     |
              Expression '>' Expression     |
              Expression GE Expression      |
              Expression NE Expression      |
              Expression EQ Expression      |
              Expression '+' Expression     |
              Expression '-' Expression     |
              Expression '*' Expression     |
              Expression '/' Expression     |
              Expression '%' Expression     |
              '&' Expression                |
              Expression '[' Expression ']' |
              Expression '.' Element |
              Expression '=' Expression

Element -> IDENTIFIER                              |
           IDENTIFIER '(' OptionalArgumentList ')' |
           NUMBER                                  |
           NUMBER '.' NUMBER                       |
           STRING                                  |
           SIZEOF '(' Expression ')'               |
           REAL '(' Expression ')'                 |
           '(' Expression ')'                      |
           NIL                                     |
           NEW '(' Type ')'                        |

OptionalArgumentList -> Empty        |
                        ArgumentList

ArgumentList -> Expression                  |
                ArgumentList ',' Expression

LoopStatement -> LOOP '(' Expression ')' OptionalNext CodeBlock

OptionalNext -> NEXT CodeBlock |
                Empty

IfStatement -> IF '(' Expression ')' CodeBlock OptionalElse

OptionalElse -> ELSE CodeBlock |
                Empty

ThrowStatement -> THROW Expression

CatchStatement -> Type IDENTIFIER CATCH CodeBlock

ConstantDeclaration -> CONSTANT IDENTIFIER '=' Expression

EnumDeclaration -> ENUM '[' EnumList ']'

EnumList -> EnumElement ';' |
            EnumList EnumElement ';'

EnumElement -> IDENTIFIER |
               IDENTIFIER '=' Expression

ObjectDeclaration -> OBJECT IDENTIFIER OptionalTypeVars OptionalInherits
                     '[' DeclarationList ']'

OptionalInherits -> Empty |
                    INHERITS FROM IDENTIFIER

DeclarationList -> Declaration ';' |
                   DeclarationList Declaration ';'

Declaration -> VariableDeclaration |
               ConstantDeclaration |
               EnumDeclaration     |
               FunctionDeclaration |
               ObjectDeclaration

Start -> DeclarationList

There are around 184 shift/reduce conflicts in this grammar with an LALR(1) parser. 182 of these should be resolved by using an operator-precedence parsing algorithm, with the following precedence (highest at top-left, lowest at bottom-right):

'*' '/' '%'
'+' '-'
EQ NE GE '>' '<' LE
'&'
'='
'['
'.'

'.' and '&' should be right-associative, and the rest should be left-associative - that is, 1+2+3+4 should parse as (1+(2+(3+4))), but a.b.c.d should parse as (((a.b).c).d).

The remaining 2 conflicts should be resolved in favour of shifting. The first of these conflicts is between real numbers (NUMBER '.' NUMBER, in the Element production) and the object field operator (Expression '.' Expression in the Expression production), as the Element can be reduced to an integer number, which can have the field operator applied to it, or the '.' can be shifted to produce a real number Element. The former behaviour is nonsense from a semantic point of view, so a shift should be performed. The second is somewhat more involved - an IDENTIFIER at the start of a statement can either be a type name (making up a Type, as part of a catch statement or variable declaration) or a variable/function name (making up an Element, as part of an expression). Unfortunately, a type identifier may be followed by a list of type variable values in angular brackets. The open angular bracket '<' is also the 'less than' operator, and so if the IDENTIFIER is reduced to an element, an expression is the result - which will produce a parse error for a type definition like "foo<bar> baz;". If the '<' is shifted as part of a OptionalTypeVars production, the type definitions will be parsed correctly, but the (technically valid, but also purposeless) statement "baz<quux;" will be treated as a parse error without more lookahead or backtracking. As it makes no sense to make a comparason without storing the result anywhere, this conflict should be resolved by shifting the '<' to form a type definition. As most LALR(1) parsers such as yacc or bison resolve shift/reduce conflicts to a shift by default, neither of these two conflicts should pose a problem to implementation of the language.

Both of these conflicts are genuinely ambiguous aspects of the language: Note that foo<bar> baz; could either be the declaration it is obviously meant to be, or the meaningless expression foo<bar>baz;. The first can be resolved with a lexer rule, but an explicit resolution of the second would require a more involved grammar.

Andrew Hunter

Last modified: Mon Jan 8 17:54:13 GMT 2001