This grammar begins with lexical rules. These are written in the
form TOKEN -> /regexp/
. The grammar
continues with rules in BNF, using the notation PRODUCTION
-> rule | rule | rule ...
. A rule consists of a list of
elements - each element may be a production, a token or a
constant character. Comments are lines prefixed
with a hash '#'. Whitespace should seperate tokens, and
otherwise be ignored. The special token 'Empty' is used to
represent an epsilon rule. This grammar is intended to be LALR(1).
# Tokens IDENTIFIER -> /[A-Za-z_][A-Za-z0-9_]*/ INTEGER -> /integer/ BYTE -> /byte/ REAL -> /real/ ARRAY -> /array/ OF -> /of/ REFERENCE -> /reference/ FUNCTION -> /function/ OBJECT -> /object/ CONSTANT -> /constant/ ENUM -> /enum/ EQ -> /==/ NE -> /!=/ LE -> /<=/ GE -> />=/ STATIC -> /static/ PUBLIC -> /public/ PROTECTED -> /protected/ PRIVATE -> /private/ CATCH -> /->/ THROW -> /throw/ LOOP -> /loop/ NEXT -> /next/ IF -> /if/ ELSE -> /else/ INHERITS -> /inherits/ FROM -> /from/ SIZEOF -> /sizeof/ NIL -> /nil/ NEW -> /new/ NUMBER -> /[0-9][0-9]*/ STRING -> /"([^"]|(\"))*"/ COMMENT -> //\/\/.*\n # Grammar Type -> IDENTIFIER OptionalTypeVars | INTEGER | REAL | BYTE | ARRAY '[' Expression ']' OF Type | ARRAY '[' ']' OF Type | REFERENCE '{' FunctionType '}' FunctionType -> Type | FUNCTION OptionalTypeVars '(' ArgumentDeclaration ')' Scope -> STATIC | PUBLIC | PROTECTED | PRIVATE VariableDeclaration -> Scope SimpleVarDeclaration | Scope SimpleVarDeclaration '=' Expression | SimpleVarDeclaration | SimpleVarDeclaration '=' Expression SimpleVarDeclaration -> Type IDENTIFIER FunctionDeclaration -> Scope FUNCTION Type IDENTIFIER OptionalTypeVars '(' ArgumentDeclaration ')' CodeBlock | FUNCTION Type IDENTIFIER OptionalTypeVars '(' ArgumentDeclaration ')' CodeBlock OptionalTypeVars -> '<' TypeVars '>' | Empty TypeVars -> IDENTIFIER | TypeVars ',' IDENTIFIER ArgumentDeclaration -> SimpleVarDeclaration | ArgumentDeclaration ';' SimpleVarDeclaration CodeBlock -> '[' StatementList ']' StatementList -> Statement ';' | StatementList Statement ';' | Empty Statement -> Expression | LoopStatement | IfStatement | CatchStatement | ThrowStatement | CodeBlock | VariableDeclaration | ConstantDeclaration | EnumDeclaration Expression -> Element | Expression LE Expression | Expression '<' Expression | Expression '>' Expression | Expression GE Expression | Expression NE Expression | Expression EQ Expression | Expression '+' Expression | Expression '-' Expression | Expression '*' Expression | Expression '/' Expression | Expression '%' Expression | '&' Expression | Expression '[' Expression ']' | Expression '.' Element | Expression '=' Expression Element -> IDENTIFIER | IDENTIFIER '(' OptionalArgumentList ')' | NUMBER | NUMBER '.' NUMBER | STRING | SIZEOF '(' Expression ')' | REAL '(' Expression ')' | '(' Expression ')' | NIL | NEW '(' Type ')' | OptionalArgumentList -> Empty | ArgumentList ArgumentList -> Expression | ArgumentList ',' Expression LoopStatement -> LOOP '(' Expression ')' OptionalNext CodeBlock OptionalNext -> NEXT CodeBlock | Empty IfStatement -> IF '(' Expression ')' CodeBlock OptionalElse OptionalElse -> ELSE CodeBlock | Empty ThrowStatement -> THROW Expression CatchStatement -> Type IDENTIFIER CATCH CodeBlock ConstantDeclaration -> CONSTANT IDENTIFIER '=' Expression EnumDeclaration -> ENUM '[' EnumList ']' EnumList -> EnumElement ';' | EnumList EnumElement ';' EnumElement -> IDENTIFIER | IDENTIFIER '=' Expression ObjectDeclaration -> OBJECT IDENTIFIER OptionalTypeVars OptionalInherits '[' DeclarationList ']' OptionalInherits -> Empty | INHERITS FROM IDENTIFIER DeclarationList -> Declaration ';' | DeclarationList Declaration ';' Declaration -> VariableDeclaration | ConstantDeclaration | EnumDeclaration | FunctionDeclaration | ObjectDeclaration Start -> DeclarationList
There are around 184 shift/reduce conflicts in this grammar with an LALR(1) parser. 182 of these should be resolved by using an operator-precedence parsing algorithm, with the following precedence (highest at top-left, lowest at bottom-right):
'*' '/' '%' '+' '-' EQ NE GE '>' '<' LE '&' '=' '[' '.'
'.' and '&' should be right-associative, and the rest should be
left-associative - that is, 1+2+3+4
should parse as
(1+(2+(3+4)))
, but a.b.c.d
should
parse as (((a.b).c).d)
.
The remaining 2 conflicts should be resolved in favour of
shifting. The first of these conflicts is between real numbers
(NUMBER '.' NUMBER
, in the Element
production) and the object field operator (Expression '.'
Expression
in the Expression production), as the
Element can be reduced to an integer number, which can have the
field operator applied to it, or the '.' can be shifted to
produce a real number Element. The former behaviour is nonsense
from a semantic point of view, so a shift should be
performed. The second is somewhat more involved - an IDENTIFIER
at the
start of a statement can either be a type name (making up a
Type, as part of a catch statement or variable declaration) or a
variable/function name (making up an Element, as part of
an expression). Unfortunately, a type identifier may be followed
by a list of type variable values in angular brackets. The open
angular bracket '<' is also the 'less than' operator, and so
if the IDENTIFIER is reduced to an element, an expression is the
result - which will produce a parse error for a type definition
like "foo<bar> baz;
". If the '<' is shifted
as part of a OptionalTypeVars production, the type
definitions will be parsed correctly, but the (technically
valid, but also purposeless) statement
"baz<quux;
" will be treated as a parse error
without more lookahead or backtracking. As it makes no sense to
make a comparason without storing the result anywhere, this
conflict should be resolved by shifting the '<' to form a type
definition. As most LALR(1) parsers such as yacc or
bison resolve shift/reduce conflicts to a shift by
default, neither of these two conflicts should pose a problem to
implementation of the language.
Both of these conflicts are genuinely ambiguous aspects of the
language: Note that foo<bar> baz;
could
either be the declaration it is obviously meant to be, or the
meaningless expression foo<bar>baz;
. The
first can be resolved with a lexer rule, but an explicit
resolution of the second would require a more involved grammar.