This grammar begins with lexical rules. These are written in the
form TOKEN -> /regexp/. The grammar
continues with rules in BNF, using the notation PRODUCTION
-> rule | rule | rule .... A rule consists of a list of
elements - each element may be a production, a token or a
constant character. Comments are lines prefixed
with a hash '#'. Whitespace should seperate tokens, and
otherwise be ignored. The special token 'Empty' is used to
represent an epsilon rule. This grammar is intended to be LALR(1).
# Tokens
IDENTIFIER -> /[A-Za-z_][A-Za-z0-9_]*/
INTEGER -> /integer/
BYTE -> /byte/
REAL -> /real/
ARRAY -> /array/
OF -> /of/
REFERENCE -> /reference/
FUNCTION -> /function/
OBJECT -> /object/
CONSTANT -> /constant/
ENUM -> /enum/
EQ -> /==/
NE -> /!=/
LE -> /<=/
GE -> />=/
STATIC -> /static/
PUBLIC -> /public/
PROTECTED -> /protected/
PRIVATE -> /private/
CATCH -> /->/
THROW -> /throw/
LOOP -> /loop/
NEXT -> /next/
IF -> /if/
ELSE -> /else/
INHERITS -> /inherits/
FROM -> /from/
SIZEOF -> /sizeof/
NIL -> /nil/
NEW -> /new/
NUMBER -> /[0-9][0-9]*/
STRING -> /"([^"]|(\"))*"/
COMMENT -> //\/\/.*\n
# Grammar
Type -> IDENTIFIER OptionalTypeVars |
INTEGER |
REAL |
BYTE |
ARRAY '[' Expression ']' OF Type |
ARRAY '[' ']' OF Type |
REFERENCE '{' FunctionType '}'
FunctionType -> Type |
FUNCTION OptionalTypeVars '(' ArgumentDeclaration ')'
Scope -> STATIC |
PUBLIC |
PROTECTED |
PRIVATE
VariableDeclaration -> Scope SimpleVarDeclaration |
Scope SimpleVarDeclaration '=' Expression |
SimpleVarDeclaration |
SimpleVarDeclaration '=' Expression
SimpleVarDeclaration -> Type IDENTIFIER
FunctionDeclaration -> Scope FUNCTION Type IDENTIFIER OptionalTypeVars
'(' ArgumentDeclaration ')' CodeBlock |
FUNCTION Type IDENTIFIER OptionalTypeVars '('
ArgumentDeclaration ')' CodeBlock
OptionalTypeVars -> '<' TypeVars '>' | Empty
TypeVars -> IDENTIFIER | TypeVars ',' IDENTIFIER
ArgumentDeclaration -> SimpleVarDeclaration |
ArgumentDeclaration ';' SimpleVarDeclaration
CodeBlock -> '[' StatementList ']'
StatementList -> Statement ';' |
StatementList Statement ';' |
Empty
Statement -> Expression |
LoopStatement |
IfStatement |
CatchStatement |
ThrowStatement |
CodeBlock |
VariableDeclaration |
ConstantDeclaration |
EnumDeclaration
Expression -> Element |
Expression LE Expression |
Expression '<' Expression |
Expression '>' Expression |
Expression GE Expression |
Expression NE Expression |
Expression EQ Expression |
Expression '+' Expression |
Expression '-' Expression |
Expression '*' Expression |
Expression '/' Expression |
Expression '%' Expression |
'&' Expression |
Expression '[' Expression ']' |
Expression '.' Element |
Expression '=' Expression
Element -> IDENTIFIER |
IDENTIFIER '(' OptionalArgumentList ')' |
NUMBER |
NUMBER '.' NUMBER |
STRING |
SIZEOF '(' Expression ')' |
REAL '(' Expression ')' |
'(' Expression ')' |
NIL |
NEW '(' Type ')' |
OptionalArgumentList -> Empty |
ArgumentList
ArgumentList -> Expression |
ArgumentList ',' Expression
LoopStatement -> LOOP '(' Expression ')' OptionalNext CodeBlock
OptionalNext -> NEXT CodeBlock |
Empty
IfStatement -> IF '(' Expression ')' CodeBlock OptionalElse
OptionalElse -> ELSE CodeBlock |
Empty
ThrowStatement -> THROW Expression
CatchStatement -> Type IDENTIFIER CATCH CodeBlock
ConstantDeclaration -> CONSTANT IDENTIFIER '=' Expression
EnumDeclaration -> ENUM '[' EnumList ']'
EnumList -> EnumElement ';' |
EnumList EnumElement ';'
EnumElement -> IDENTIFIER |
IDENTIFIER '=' Expression
ObjectDeclaration -> OBJECT IDENTIFIER OptionalTypeVars OptionalInherits
'[' DeclarationList ']'
OptionalInherits -> Empty |
INHERITS FROM IDENTIFIER
DeclarationList -> Declaration ';' |
DeclarationList Declaration ';'
Declaration -> VariableDeclaration |
ConstantDeclaration |
EnumDeclaration |
FunctionDeclaration |
ObjectDeclaration
Start -> DeclarationList
There are around 184 shift/reduce conflicts in this grammar with an LALR(1) parser. 182 of these should be resolved by using an operator-precedence parsing algorithm, with the following precedence (highest at top-left, lowest at bottom-right):
'*' '/' '%' '+' '-' EQ NE GE '>' '<' LE '&' '=' '[' '.'
'.' and '&' should be right-associative, and the rest should be
left-associative - that is, 1+2+3+4 should parse as
(1+(2+(3+4))), but a.b.c.d should
parse as (((a.b).c).d).
The remaining 2 conflicts should be resolved in favour of
shifting. The first of these conflicts is between real numbers
(NUMBER '.' NUMBER, in the Element
production) and the object field operator (Expression '.'
Expression in the Expression production), as the
Element can be reduced to an integer number, which can have the
field operator applied to it, or the '.' can be shifted to
produce a real number Element. The former behaviour is nonsense
from a semantic point of view, so a shift should be
performed. The second is somewhat more involved - an IDENTIFIER at the
start of a statement can either be a type name (making up a
Type, as part of a catch statement or variable declaration) or a
variable/function name (making up an Element, as part of
an expression). Unfortunately, a type identifier may be followed
by a list of type variable values in angular brackets. The open
angular bracket '<' is also the 'less than' operator, and so
if the IDENTIFIER is reduced to an element, an expression is the
result - which will produce a parse error for a type definition
like "foo<bar> baz;". If the '<' is shifted
as part of a OptionalTypeVars production, the type
definitions will be parsed correctly, but the (technically
valid, but also purposeless) statement
"baz<quux;" will be treated as a parse error
without more lookahead or backtracking. As it makes no sense to
make a comparason without storing the result anywhere, this
conflict should be resolved by shifting the '<' to form a type
definition. As most LALR(1) parsers such as yacc or
bison resolve shift/reduce conflicts to a shift by
default, neither of these two conflicts should pose a problem to
implementation of the language.
Both of these conflicts are genuinely ambiguous aspects of the
language: Note that foo<bar> baz; could
either be the declaration it is obviously meant to be, or the
meaningless expression foo<bar>baz;. The
first can be resolved with a lexer rule, but an explicit
resolution of the second would require a more involved grammar.