ANTLR4 C# runtime extremely slow at parsing - c#

I'm migrating my custom DSL from GoldParser to ANTLR4, but I'm stuck at the parsing step because it takes too much to finish. A source of 1000 lines is parsed in 34 seconds versus the milliseconds range I had in GoldParser.
This is the C# code I use for parsing:
var input = new AntlrInputStream(prg);
var lexer = new PCLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new PCParser(tokens);
var tree = parser.programma(); // root rule is "programma"
I suspect the problem is in the grammar which has a lot of ambiguities, indeed it was the reason why I decided to migrate it from GoldParser (not being able to further improve it, I realized it was a lot easier to rewrite it in Antlr4 and do not care of ambiguities).
My question is: is there anything I can do to have milliseconds order-of-magnitude parsing, or it's just normal that ANTLR4 is inherently slow? I'm new to Antlr and I don't know what to expect.
Regading the grammar, it's a sort of pseudo-C:
grammar PC;
fragment Number : [0-9] ;
fragment DoubleStringCharacter : ~["\r\n] ;
fragment SingleStringCharacter : ~['\r\n] ;
fragment DoubleStringCharacterM : ~["] ;
fragment SingleStringCharacterM : ~['] ;
BlockComment : '/*' .*? '*/' -> skip ;
LineComment : '//' ~[\r\n]* -> skip ;
WhiteSpaces : [\t\u000B\u000C\u0020\u00A0]+ -> skip ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
Quote : '\'' ;
DoubleQuote : '"' ;
NullLiteral : 'null' ;
BoolLiteral : 'true' | 'false' ;
IntLiteral : (Number)+ ;
FloatLiteral : (Number)* '.' (Number)+ ;
StringLiteral : DoubleQuote DoubleStringCharacter* DoubleQuote ;
StringLiteralJs : Quote SingleStringCharacter* Quote ;
StringLiteralM : '#' DoubleQuote DoubleStringCharacter* DoubleQuote ;
StringLiteralJsM : '#' Quote SingleStringCharacter* Quote ;
Or_op : 'or' | '||' ;
And_op : 'and' | '&&' ;
Not_op : 'not' | '!' ;
Not_eq : '!=' | '<>' ;
programma : interfaccia? dichiarazione* ;
interfaccia : 'interfaccia' '{' oggettoInterfaccia* '}' ;
oggettoInterfaccia : Identifier Identifier '{' definizioneProprieta* '}' ;
definizioneProprieta : Identifier '=' valoreProprieta ';'
| oggettoInterfaccia;
valoreProprieta : BoolLiteral | IntLiteral | FloatLiteral | StringLiteral | StringLiteralM | Identifier ;
dichiarazione : dichiarazioneReference
| dichiarazioneUsing
| dichiarazioneClass
| dichiarazioneFunzione
| dichiarazioneVariabile
;
dichiarazioneReference : 'reference' StringLiteral ';' ;
dichiarazioneUsing : 'using' Identifier '=' StringLiteral ';' ;
dichiarazioneClass : 'class' Identifier ';' ;
dichiarazioneFunzione : Identifier Identifier '(' parametri ')' '{' stmList '}' ;
parametri : parametro (',' parametro)* ;
parametro : Identifier
| Identifier Identifier
;
dichiarazioneVariabile : Identifier listaVariabili ';' ;
listaVariabili : variabile (',' variabile)* ;
variabile : Identifier
| Identifier '=' exprOrArray
;
stmList : stm* ;
stm : blocco
| dichiarazioneVariabile
| etichetta
| istruzioneIf
| istruzioneWhile
| istruzioneFor
| istruzioneDo
| istruzioneGoto
| istruzioneBreak
| istruzioneContinue
| istruzioneReturn
| expr ';'
| assegnamento ';'
| ';'
| 'ConnectEvent' '(' Identifier ',' Identifier ',' Identifier ')' ';'
| istruzioneTry
;
blocco : '{' stmList '}' ;
istruzioneIf : 'if' '(' expr ')' stm ( 'else' stm )? ;
istruzioneFor : 'for' '(' stm condizioneFor ';' incrementoFor? ')' stm ;
condizioneFor : expr? ;
incrementoFor : expr
| assegnamento
;
istruzioneWhile : 'while' '(' expr ')' stm ;
istruzioneDo : 'do' stm 'while' '(' expr ')' ; // TODO si deve aggiungere ';' ?
etichetta : Identifier ':' ;
istruzioneGoto : 'goto' Identifier ';' ;
istruzioneBreak : 'break' ';' ;
istruzioneContinue : 'continue' ';' ;
istruzioneReturn : 'return' exprOrArray ';' | 'return' ';' ;
istruzioneTry : 'try' blocco 'catch' '(' Identifier ')' blocco ;
assegnamento : Identifier '=' exprOrArray
| Identifier '[' expr ']' '=' exprOrArray
| Identifier '.' Identifier '=' exprOrArray
;
exprOrArray : expr
| '{' exprList '}'
;
exprList : exprOrArray ',' exprList
| exprOrArray
;
expr : expr '+=' expr
| expr '-=' expr
| expr '?' expr ':' expr
| expr Or_op expr
| expr And_op expr
| expr '==' expr
| expr Not_eq expr
| expr '<' expr
| expr '>' expr
| expr '<=' expr
| expr '>=' expr
| expr 'as' Identifier
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '%' expr
| expr Not_op expr
| '-' expr
| '+' expr
| '--' expr
| '++' expr
| expr '--'
| expr '++'
| expr '[' expr ']'
| callFun
| Identifier '.' Identifier '(' methodParams ')'
| Identifier '.' Identifier
| Identifier
| literal
| '(' expr ')'
;
methodParams : methodParam (',' methodParam)* ;
methodParam : exprOrArray ;
callFun : Identifier '(' methodParams ')'
| 'new' Identifier '(' methodParams ')'
;
literal : NullLiteral
| BoolLiteral
| IntLiteral
| FloatLiteral
| StringLiteral
| StringLiteralJs
| StringLiteralM
| StringLiteralJsM
;

If your grammar had ambiguities, the Gold Parser (my understanding: LALR(1)) would not parse source text correctly. [I assume your are ignoring complaints it should produce about shift-reduce and reduce-reduce conflicts?] It would pick one of the parses. And, being LALR(1), it will do so in linear time, so it is not surprise that it is fast; this is a key utility of LALR(1) parsers.
Ambiguity in a grammar often (not always) means that there are parses you should have eliminated, but did not. If Gold is picking among parses, and some are wrong, there is no reason to believe you are getting a correct parse.
So, in fact, if you can get the wrong answer in milliseconds with Gold, why does it matter if ANTLR gets the wrong answer somewhat more slowly?
I suggest you remove the ambiguities. (As a starting place, your expression subgrammar looks highly ambiguous to me). I think ANTLR will "speed up".

Related

antlr getting only few symbols

I have such grammar
grammar Hello;
oclFile : ( 'package' packageName
oclExpressions
'endpackage'
)+;
packageName : pathName;
oclExpressions : ( constraint )*;
constraint : contextDeclaration ( Stereotype '#' number name? ':' oclExpression)+;
contextDeclaration : 'context' ( operationContext | classifierContext );
classifierContext : ( name ':' name ) | name;
operationContext : name '::' operationName '(' formalParameterList ')' ( ':' returnType )?;
Stereotype : ( 'pre' | 'post' | 'inv' );
operationName : name | '=' | '+' | '-' | '<' | '<=' | '>=' | '>' | '/' | '*' | '<>' | 'implies' | 'not' | 'or' | 'xor' | 'and';
formalParameterList : ( name ':' typeSpecifier (',' name ':' typeSpecifier )*)?;
typeSpecifier : simpleTypeSpecifier | collectionType;
collectionType : collectionKind '(' simpleTypeSpecifier ')';
oclExpression : ( letExpression )* expression;
returnType : typeSpecifier;
expression : logicalExpression;
letExpression : 'let' name ( '(' formalParameterList ')' )? ( ':' typeSpecifier )? '=' expression ';';
ifExpression : 'if' expression 'then' expression 'else' expression 'endif';
logicalExpression : relationalExpression ( logicalOperator relationalExpression)*;
relationalExpression : additiveExpression (relationalOperator additiveExpression)?;
additiveExpression : multiplicativeExpression ( addOperator multiplicativeExpression)*;
multiplicativeExpression : unaryExpression ( multiplyOperator unaryExpression)*;
unaryExpression : ( unaryOperator postfixExpression) | postfixExpression;
postfixExpression : primaryExpression ( ('.' | '->')propertyCall )*;
primaryExpression : literalCollection | literal | propertyCall | '(' expression ')' | ifExpression;
propertyCallParameters : '(' ( declarator )? ( actualParameterList )? ')';
literal : number | enumLiteral;
enumLiteral : name '::' name ( '::' name )*;
simpleTypeSpecifier : pathName;
literalCollection : collectionKind '{' ( collectionItem (',' collectionItem )*)? '}';
collectionItem : expression ('..' expression )?;
propertyCall : pathName ('#' number)? ( timeExpression )? ( qualifiers )? ( propertyCallParameters )?;
qualifiers : '[' actualParameterList ']';
declarator : name ( ',' name )* ( ':' simpleTypeSpecifier )? ( ';' name ':' typeSpecifier '=' expression )? '|';
pathName : name ( '::' name )*;
timeExpression : '#' 'pre';
actualParameterList : expression (',' expression)*;
logicalOperator : 'and' | 'or' | 'xor' | 'implies';
collectionKind : 'Set' | 'Bag' | 'Sequence' | 'Collection';
relationalOperator : '=' | '>' | '<' | '>=' | '<=' | '<>';
addOperator : '+' | '-';
multiplyOperator : '*' | '/';
unaryOperator : '-' | 'not';
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
DIGITS : '0'..'9' ;
name : (LOWERCASE | UPPERCASE | '_') ( LOWERCASE | UPPERCASE | DIGITS | '_' )* ;
number : DIGITS (DIGITS)* ( '.' DIGITS (DIGITS)* )?;
WS : [ \t\r\n]+ -> skip ;
and such expression
package RobotsTestModel
context motor
inv#0:
self.power = 100
endpackage
but I'm getting only "mot" in variable text:
public override bool VisitConstraint([NotNull] HelloParser.ConstraintContext context)
{
VisitContextDeclaration(context.contextDeclaration());
string text = context.contextDeclaration().classifierContext().GetText();
It's also strange that sometimes it's retrieve motoo if I change, but without certainty. It writes like it expects Stereotype instead of "or" -- the last part of "motor". Where should I work to fix it?
Your grammar produces tokens m, o, t, or tokens from input motor, because name is not a lexer rule and does not work in token recognition. And only m, o, t tokens are matched by rule name, it does not expect or token, that's why you see this strange results.
You should make lexer rules for names and numbers instead of parser ones:
NAME : [a-zA-Z_]+ [a-zA-Z0-9_]* ;
NUMBER : [0-9]+ ('.' [0-9]+)? ;
So motor will be a single token.

Possible generation bug for C# in Antlr?

Using Antlr 4.3 and this grammar
http://www.harward.us/~nharward/antlr/OracleNetServicesV3.g
following *Lexer.cs code for C# is generated :
private void WHITESPACE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 1: skip(); break;
}
}
private void NEWLINE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 2: skip(); break;
}
}
private void COMMENT_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 0: skip(); break;
}
}
But the method skip() in the runtime is defined as:
public virtual void Skip()
Which of course gives a compilation error.
The same skip() method is generated with Antlr 3.5.2 as well.
Is this a bug or i am doing something wrong?
You can easily make this v4 compatible and language independent by using the command
-> channel(HIDDEN)
Here is an updated grammar that implements this change
configuration_file
: ( parameter )*
;
parameter
: keyword EQUALS ( value
| LEFT_PAREN value_list RIGHT_PAREN
| ( LEFT_PAREN parameter RIGHT_PAREN )+
)
;
keyword
: WORD
;
value
: WORD
| QUOTED_STRING
;
value_list
: value ( COMMA value )*
;
QUOTED_STRING
: SINGLE_QUOTE ~'\''* SINGLE_QUOTE
| DOUBLE_QUOTE ~'"'* DOUBLE_QUOTE
;
WORD
: ( 'A' .. 'Z'
| 'a' .. 'z'
| '0' .. '9'
| '<'
| '>'
| '/'
| '.'
| ':'
| ';'
| '-'
| '_'
| '$'
| '+'
| '*'
| '&'
| '!'
| '%'
| '?'
| '#'
| '\\' .
)+
;
LEFT_PAREN
: '('
;
RIGHT_PAREN
: ')'
;
EQUALS
: '='
;
COMMA
: ','
;
SINGLE_QUOTE
: '\''
;
DOUBLE_QUOTE
: '"'
;
COMMENT
: '#' ( ~( '\n' ) )* -> channel(HIDDEN)
;
WHITESPACE
: ( '\t'
| ' '
) -> channel(HIDDEN)
;
NEWLINE
: ( '\r' )? '\n' -> channel(HIDDEN)
;
As written in my comment, it's because auf the skip() in the grammar file, which is Java dependend.
So there is no bug in Antlr. :)

ANTLR4 - How do I get know where in the input program ANTLR cannot make deterministic desision

I have written a grammar in ANTLR 4 and I wanted to parsed it for testing input program, but it threw an ViableException during parsing. Now, I wonder, if the grammar is non-deterministic or only bad written for ANTLR 4 and I want to know, how to fix it? Is there any helpful tool for optimizing and validating LL(*) grammars? I have found only some default tools which did not discovered some error, or did not validated it correctly. ANTLR throws error at line 1 with part of line 2 for 2 standalone commands, and I do not know if the error is exactly here or it can be anywhere deeper in the program.
Error Message which was caught at line 1 by my SyntaxError listenner base on ANTLR BaseErrorListenner:
2015-05-02 14:31:45 [ERROR] [Templates\Java\DocBook\Plan Examples\One Document.PeKLXT]: Exception (Utils.ParsingException) was thrown... Message: Parsing failed... First Error at: no viable alternative at input '##import: "XDoc_Generator.dll";\t//XDocProject dll\n\n\n\n##registerVar: XDoc_Generator.XDocProject ' errors: 1
Caused by: v PeKLXT.NET.Engine.parse(String inputProgramFile) v c:\Users\petrk_000\Documents\Visual Studio 2012\Projects\Argutec_XDoc\PeKLXT.NET\Engine.cs:řádek 51
As you can see, ANTLR has written to the error message only first rule and the part of the second rule, so is here in this context an error or it can be somewhere deeper in this program?
'##import: "XDoc_Generator.dll";\t//XDocProject dll
\n\n\n\n##registerVar: XDoc_Generator.XDocProject '
The whole code for these 2 rules:
##import: "XDoc_Generator.dll"; //XDocProject dll
##registerVar: XDoc_Generator.XDocProject #project = (XDocProject)(#this);
And here is the grammar for these rules, I have checked it for determinism, also by some internet tool, but I have not found a mismatch:
PeKLXT.g4:
compileUnit
: (LINECOMMENT | MULTILINECOMMENT)* importCommand*(registerGlobalVariableCommand LINECOMMENT?)* outerBodyCommand* EOF #PeKLXTProgram
| (LINECOMMENT | MULTILINECOMMENT)* importCommand* (MULTILINECOMMENT)? functionDeclaration)+ EOF #FunctionCodeFile
;
Function:
functionDeclaration
: '##function' ':' InnerIdentifier '(' parameters? ')' innerBody ';'?
;
OuterBody Command: Each command starts with different key word!
outerBodyCommand
: createCommand
| outerOpenCommand
| outerIfCommand
| outerLoopCommand
| setVariableCommand
;
Other used commands and expressions for this context:
importCommand
: '##import' ':' stringExpression ';' LINECOMMENT?
;
registerGlobalVariableCommand
: '##registerVar' ':' type emptyArrayDimensions? identifier ('=' assignmentExpression)? ';' //local variable starts with ##
| '##registerVar' ':' '$xdocument' '=' stringExpression ';'
| '##registerVar' ':' stat='$$$static' '=' dolarVariableExpression ';'
;
emptyArrayDimensions
: emptyIndexerUnit+
;
emptyIndexerUnit
: '[' (',')* ']'
;
assignmentExpression
: conditionExpression
;
String expressions:
stringExpression
: stringForm (AdditiveOp stringForm)*
;
stringForm
: unaryExpression
;
stringMethod
: StringMethodName '(' parameterValues? ')'
;
Unary Expression (the expressions unit it equals to condition expression in case of no operators between two or more expressions):
unaryExpression
: strictlyUnaryExpression
| op='+' strictlyUnaryExpression
| op='-' strictlyUnaryExpression
| op='!' strictlyUnaryExpression
| '(' unaryExpression ')'
;
strictlyUnaryExpression
: literal('->' stringMethod)*
| objectCall('->' stringMethod)*
| castExpression ('->' stringMethod)*
| dolarVariableExpression ('->' stringMethod)*
;
literal
: StringLiteral
| IntegerLiteral
| DecimalFloatingPointLiteral
| BooleanLiteral
| NullLiteral
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
Dollar Variable Expression:
dolarVariableExpression
: dolarVariableContext
| '$$$tmp' ':' innerSelect
;
innerSelect
: '##innerSelect' ':' (first='first'|last='last')? '(' innerSelect ')' ('where' '(' conditionExpression ')')?
| '##innerSelect' ':' (first='first'|last='last')? '(' localDolarIdentifier ':' dolarVariableContext ')' ('where' '(' conditionExpression ')')?
;
Types and identifiers:
//types
type
: PrimitiveType
| typeIdentifier generics?
;
//string [,,] neco = new string [2, 2,4] {{{"5", "5", "5", "5"}, {"5", "5", "5", "5"}}, {{"5", "5", "5", "5"}, {"5", "5", "5", "5"}}};
emptyArrayDimensions
: emptyIndexerUnit+
;
generics
: '<' type (',' type)* '>'
;
//identifiers
//complexIdentifier
//: identifier ('.' InnerIdentifier)*
//;
identifier
: '#' InnerIdentifier #IdentifierGlobal
| localIdentifier #IdentifierLocal
| '###' InnerIdentifier #IdentifierStatic
;
localIdentifier
: '##' InnerIdentifier
;
typeIdentifier
: InnerIdentifier ('.' InnerIdentifier)*
;
dolarVariableContext
: dolarPartCall ('.' dolarPartCall)* (InnerIdentifier)?
//| '$$' Pointer ('.' dolarPartCall)* (InnerIdentifier)?
| localDolarIdentifier (indexerGet)? ('.' dolarPartCall)* (InnerIdentifier)?
;
localDolarIdentifier
: '$$' InnerIdentifier
;
dolarPartCall
: '$' InnerIdentifier (indexerGet)?
| '$' Pointer (indexerGet)?
;
fragment InnerIdentifier
: IdentifierStart IdentifierPart*
;
IdentifierStart : [_a-zA-Z]+;
IdentifierPart : [_a-zA-Z0-9]+;
Characters to skip:
WS : [ \t\r\n\u000C]+ -> skip
;
MULTILINECOMMENT
: '/*' .*? '*/'
;
LINECOMMENT
: '//' ~[\r\n]*
;
COMMENT
: '<!--' (.)*? '-->' -> skip
;
Thanks a lot for help!

ANTLR Grammar and generated code problems

I'm trying to create an expression parser using ANTLR
The expression will go inside an if statement so its root is a condition.
I have the following grammar, which "compiles" to parser/lexer files with no problems, however the generated code itself has some errors, essentially two "empty" if statements
i.e.
if (())
Not sure what I'm doing wrong, any help will be greatly appreciated.
Thanks.
Grammar .g file below:
grammar Expression;
options {
language=CSharp3;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Antlr3 }
#lexer::namespace { Antlr3 }
public parse
: orcond EOF -> ^(ROOT orcond)
;
orcond
: andcond ('||' andcond)*
;
andcond
: condition ('&&' condition)*
;
condition
: exp (('<' | '>' | '==' | '!=' | '<=' | '>=')^ exp)?
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' parenthesisvalid ')' -> parenthesisvalid
;
parenthesisvalid
: fullobjectref
| orcond
;
fullobjectref
: objectref ('.' objectref)?
;
objectref
: objectname ('()' | '(' params ')' | '[' params ']')?
;
objectname
: (('a'..'z') | ('A'..'Z'))^ (('a'..'z') | ('A'..'Z') | ('0'..'9') | '_')*
;
params
: paramitem (',' paramitem)?
;
paramitem
: unaryExp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
Don't use the range operator, .., inside parser rules.
Remove the parser rule objectname and create the lexer rule:
Objectname
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
;
Additionally it will be more neatly to use fragment blocks. In this case your code will be something like this:
Objectname : LETTER (LETTER | DIGIT | '_')*;
Number: DIGIT+ ('.' DIGIT+)?;
fragment DIGIT : '0'..'9' ;
fragment LETTER : ('a'..'z' | 'A'..'Z');

ANTLR parser and tree grammars for one simple language

Edit:
Here is the updated tree and parser grammars:
parser grammar:
options {
language = CSharp2;
output=AST;
}
tokens {
UNARY_MINUS;
CALL;
}
program : (function)* main_function
;
function: 'function' IDENTIFIER '(' (parameter (',' parameter)*)? ')' 'returns' TYPE declaration* statement* 'end' 'function'
-> ^('function' IDENTIFIER parameter* TYPE declaration* statement*)
;
main_function
: 'function' 'main' '(' ')' 'returns' TYPE declaration* statement* 'end' 'function'
-> ^('function' 'main' TYPE declaration* statement*)
;
parameter
: 'param' IDENTIFIER ':' TYPE
-> ^('param' IDENTIFIER TYPE)
;
declaration
: 'variable' IDENTIFIER ( ',' IDENTIFIER)* ':' TYPE ';'
-> ^('variable' TYPE IDENTIFIER+ )
| 'array' array ':' TYPE ';'
-> ^('array' array TYPE)
;
statement
: ';'! | block | assignment | if_statement | switch_statement | while_do_statement | for_statement | call_statement | return_statement
;
call_statement
: call ';'!
;
return_statement
: 'return' expression ';'
-> ^('return' expression)
;
block : 'begin' declaration* statement* 'end'
-> ^('begin' declaration* statement*)
| '{' declaration* statement* '}'
-> ^('{' declaration* statement*)
;
assignment
: IDENTIFIER ':=' expression ';'
-> ^(':=' IDENTIFIER expression )
| array ':=' expression ';'
-> ^(':=' array expression)
;
array : IDENTIFIER '[' expression (',' expression)* ']'
-> ^(IDENTIFIER expression+)
;
if_statement
: 'if' '(' expression ')' 'then' statement ('else' statement)? 'end' 'if'
-> ^('if' expression statement statement?)
;
switch_statement
: 'switch' '(' expression ')' case_part+ ('default' ':' statement)? 'end' 'switch'
-> ^('switch' expression case_part+ statement?)
;
case_part
: 'case' literal (',' literal)* ':' statement
-> ^('case' literal+ statement)
;
literal
: INTEGER | FLOAT | BOOLEAN | STRING
;
while_do_statement
: 'while' '(' expression ')' 'do' statement 'end' ' while'
-> ^('while' expression statement)
;
for_statement
: 'for' '(' IDENTIFIER ':=' expression 'to' expression ')' 'do' statement 'end' 'for'
-> ^('for' IDENTIFIER expression expression statement)
;
expression
: conjuction ( 'or'^ conjuction)*
;
conjuction
: equality ('and'^ equality)*
;
equality: relation (('=' | '/=')^ relation)?
;
relation: addition (('<' | '<=' | '>' | '>=')^ addition)?
;
addition: multiplication (('+' | '-')^ multiplication)*
;
multiplication
: unary_operation (('*' | '/' | '%')^ unary_operation)*
;
unary_operation
: '-' primary
-> ^(UNARY_MINUS primary)
| 'not' primary
-> ^('not' primary)
| primary
;
primary : IDENTIFIER
| array
| literal
| '('! expression ')'!
| '(' TYPE ')' '(' expression ')'
-> ^(TYPE expression)
| call
;
call : IDENTIFIER '(' arguments ')'
-> ^(CALL IDENTIFIER arguments)
;
arguments
: (expression (','! expression)*)?
;
BOOLEAN : 'true' | 'false'
;
T YPE : 'integer' | 'boolean' | 'float' | 'string' | 'array' | 'void'
;
IDENTIFIER : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INTEGER : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')+
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' .* '"'
;
And here is the updated tree grammar (I altered expressions, and so on...):
options {
language = 'CSharp2';
//tokenVocab= token vocab needed
ASTLabelType=CommonTree; // what is Java type of nodes?
}
program : (function)* main_function
;
function: ^('function' IDENTIFIER parameter* TYPE declaration* statement*)
;
main_function
: ^('function' 'main' TYPE declaration* statement*)
;
parameter
: ^('param' IDENTIFIER TYPE)
;
declaration
: ^('variable' TYPE IDENTIFIER+)
| ^('array' array TYPE )
;
statement
: block | assignment | if_statement | switch_statement | while_do_statement | for_statement | call_statement | return_statement
;
call_statement
: call
;
return_statement
: ^('return' expression)
;
block : ^('begin' declaration* statement*)
| ^('{' declaration* statement*)
;
assignment
: ^(':=' IDENTIFIER expression )
| ^(':=' array expression)
;
array : ^(IDENTIFIER expression+)
;
if_statement
: ^('if' expression statement statement?)
;
switch_statement
: ^('switch' expression case_part+ statement?)
;
case_part
: ^('case' literal+ statement)
;
literal
: INTEGER | FLOAT | BOOLEAN | STRING
;
while_do_statement
: ^('while' expression statement)
;
for_statement
: ^('for' IDENTIFIER expression expression statement)
;
expression
: ^('or' expression expression)
| ^('and' expression expression)
| ^('=' expression expression)
| ^('/=' expression expression)
| ^('<' expression expression)
| ^('<=' expression expression)
| ^('>' expression expression)
| ^('>=' expression expression)
| ^('+' expression expression)
| ^('-' expression expression)
| ^(UNARY_MINUS expression)
| ^('not' expression)
| IDENTIFIER
| array
| literal
| ^(TYPE expression)
| call
;
call : ^(CALL IDENTIFIER arguments)
;
arguments
: (expression (expression)*)?
;
I succesfluly generated tree graph with DOTTreeGenerator and StringTemplate classes so it seems that all is working at the moment. But any suggestions (about bad habits or something else in this grammars) are appreciated since I don't have a lot of experience with ANTLR or language recognition.
See updates on http://vladimir-radojicic.blogspot.com
The only thing I was going to suggest, besides introducing imaginary tokens to make sure your tree grammar produces a "unique AST" and simplifying the expression in the tree-grammar, which you both already did (again: well done!), is that you shouldn't use literal tokens inside your parser grammar. Especially not when they can possibly be matched by other lexer rule(s). For example, all your reserved words (like for, while, end, etc.) can also be matched by the lexer rule IDENTIFIER. It's better to create explicit tokens inside the lexer (and put these rules before the IDENTIFIER rule!):
...
FOR : 'for';
WHILE : 'while';
END : 'end';
...
IDENTIFIER
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
...
Ideally, the tree grammar does not contain any quoted tokens. AFAIK, you can't import grammar X inside a grammar Y properly: the literal tokens inside grammar X are then not available in grammar Y. And when you split your your combined grammar in a parser- and lexer grammar, these literal tokens are not allowed. With small grammars like your, these last remarks are of no concern to you (and you could leave your grammar "as is"), but remember them when you create larger grammars.
Best of luck!
EDIT
Imaginary tokens are not only handy when there's no real token that can be made as root of the tree. The way I look at imaginary tokens is that they make your tree "unique", so that the tree grammar can only "walk" your tree in one possible way. Take subtraction and unary minus for example. If you wouldn't have created an imaginary token called UNARY_MINUS, but simply did this:
unary_operation
: '-' primary -> ^('-' primary)
| 'not' primary -> ^('not' primary)
| primary
;
then you'd have something like this in your tree grammar:
expression
: ^('-' expression expression)
| ...
| ^('-' expression)
| ...
;
Now both subtraction and unary minus start with the same tokens, which the tree grammar does not like! It's easy to see with this - (minus) example, but there can be quite some tricky cases (even with small grammars like yours!) that are not so obvious. So, always let the parser create "unique trees" while rewriting to AST's.
Hope that clarifies it (a bit).

Categories

Resources