I'm trying to create an expression parser using ANTLR
The expression will go inside an if statement so its root is a condition.
I have the following grammar, which "compiles" to parser/lexer files with no problems, however the generated code itself has some errors, essentially two "empty" if statements
i.e.
if (())
Not sure what I'm doing wrong, any help will be greatly appreciated.
Thanks.
Grammar .g file below:
grammar Expression;
options {
language=CSharp3;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Antlr3 }
#lexer::namespace { Antlr3 }
public parse
: orcond EOF -> ^(ROOT orcond)
;
orcond
: andcond ('||' andcond)*
;
andcond
: condition ('&&' condition)*
;
condition
: exp (('<' | '>' | '==' | '!=' | '<=' | '>=')^ exp)?
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' parenthesisvalid ')' -> parenthesisvalid
;
parenthesisvalid
: fullobjectref
| orcond
;
fullobjectref
: objectref ('.' objectref)?
;
objectref
: objectname ('()' | '(' params ')' | '[' params ']')?
;
objectname
: (('a'..'z') | ('A'..'Z'))^ (('a'..'z') | ('A'..'Z') | ('0'..'9') | '_')*
;
params
: paramitem (',' paramitem)?
;
paramitem
: unaryExp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
Don't use the range operator, .., inside parser rules.
Remove the parser rule objectname and create the lexer rule:
Objectname
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
;
Additionally it will be more neatly to use fragment blocks. In this case your code will be something like this:
Objectname : LETTER (LETTER | DIGIT | '_')*;
Number: DIGIT+ ('.' DIGIT+)?;
fragment DIGIT : '0'..'9' ;
fragment LETTER : ('a'..'z' | 'A'..'Z');
Related
I have such grammar
grammar Hello;
oclFile : ( 'package' packageName
oclExpressions
'endpackage'
)+;
packageName : pathName;
oclExpressions : ( constraint )*;
constraint : contextDeclaration ( Stereotype '#' number name? ':' oclExpression)+;
contextDeclaration : 'context' ( operationContext | classifierContext );
classifierContext : ( name ':' name ) | name;
operationContext : name '::' operationName '(' formalParameterList ')' ( ':' returnType )?;
Stereotype : ( 'pre' | 'post' | 'inv' );
operationName : name | '=' | '+' | '-' | '<' | '<=' | '>=' | '>' | '/' | '*' | '<>' | 'implies' | 'not' | 'or' | 'xor' | 'and';
formalParameterList : ( name ':' typeSpecifier (',' name ':' typeSpecifier )*)?;
typeSpecifier : simpleTypeSpecifier | collectionType;
collectionType : collectionKind '(' simpleTypeSpecifier ')';
oclExpression : ( letExpression )* expression;
returnType : typeSpecifier;
expression : logicalExpression;
letExpression : 'let' name ( '(' formalParameterList ')' )? ( ':' typeSpecifier )? '=' expression ';';
ifExpression : 'if' expression 'then' expression 'else' expression 'endif';
logicalExpression : relationalExpression ( logicalOperator relationalExpression)*;
relationalExpression : additiveExpression (relationalOperator additiveExpression)?;
additiveExpression : multiplicativeExpression ( addOperator multiplicativeExpression)*;
multiplicativeExpression : unaryExpression ( multiplyOperator unaryExpression)*;
unaryExpression : ( unaryOperator postfixExpression) | postfixExpression;
postfixExpression : primaryExpression ( ('.' | '->')propertyCall )*;
primaryExpression : literalCollection | literal | propertyCall | '(' expression ')' | ifExpression;
propertyCallParameters : '(' ( declarator )? ( actualParameterList )? ')';
literal : number | enumLiteral;
enumLiteral : name '::' name ( '::' name )*;
simpleTypeSpecifier : pathName;
literalCollection : collectionKind '{' ( collectionItem (',' collectionItem )*)? '}';
collectionItem : expression ('..' expression )?;
propertyCall : pathName ('#' number)? ( timeExpression )? ( qualifiers )? ( propertyCallParameters )?;
qualifiers : '[' actualParameterList ']';
declarator : name ( ',' name )* ( ':' simpleTypeSpecifier )? ( ';' name ':' typeSpecifier '=' expression )? '|';
pathName : name ( '::' name )*;
timeExpression : '#' 'pre';
actualParameterList : expression (',' expression)*;
logicalOperator : 'and' | 'or' | 'xor' | 'implies';
collectionKind : 'Set' | 'Bag' | 'Sequence' | 'Collection';
relationalOperator : '=' | '>' | '<' | '>=' | '<=' | '<>';
addOperator : '+' | '-';
multiplyOperator : '*' | '/';
unaryOperator : '-' | 'not';
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
DIGITS : '0'..'9' ;
name : (LOWERCASE | UPPERCASE | '_') ( LOWERCASE | UPPERCASE | DIGITS | '_' )* ;
number : DIGITS (DIGITS)* ( '.' DIGITS (DIGITS)* )?;
WS : [ \t\r\n]+ -> skip ;
and such expression
package RobotsTestModel
context motor
inv#0:
self.power = 100
endpackage
but I'm getting only "mot" in variable text:
public override bool VisitConstraint([NotNull] HelloParser.ConstraintContext context)
{
VisitContextDeclaration(context.contextDeclaration());
string text = context.contextDeclaration().classifierContext().GetText();
It's also strange that sometimes it's retrieve motoo if I change, but without certainty. It writes like it expects Stereotype instead of "or" -- the last part of "motor". Where should I work to fix it?
Your grammar produces tokens m, o, t, or tokens from input motor, because name is not a lexer rule and does not work in token recognition. And only m, o, t tokens are matched by rule name, it does not expect or token, that's why you see this strange results.
You should make lexer rules for names and numbers instead of parser ones:
NAME : [a-zA-Z_]+ [a-zA-Z0-9_]* ;
NUMBER : [0-9]+ ('.' [0-9]+)? ;
So motor will be a single token.
Using Antlr 4.3 and this grammar
http://www.harward.us/~nharward/antlr/OracleNetServicesV3.g
following *Lexer.cs code for C# is generated :
private void WHITESPACE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 1: skip(); break;
}
}
private void NEWLINE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 2: skip(); break;
}
}
private void COMMENT_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 0: skip(); break;
}
}
But the method skip() in the runtime is defined as:
public virtual void Skip()
Which of course gives a compilation error.
The same skip() method is generated with Antlr 3.5.2 as well.
Is this a bug or i am doing something wrong?
You can easily make this v4 compatible and language independent by using the command
-> channel(HIDDEN)
Here is an updated grammar that implements this change
configuration_file
: ( parameter )*
;
parameter
: keyword EQUALS ( value
| LEFT_PAREN value_list RIGHT_PAREN
| ( LEFT_PAREN parameter RIGHT_PAREN )+
)
;
keyword
: WORD
;
value
: WORD
| QUOTED_STRING
;
value_list
: value ( COMMA value )*
;
QUOTED_STRING
: SINGLE_QUOTE ~'\''* SINGLE_QUOTE
| DOUBLE_QUOTE ~'"'* DOUBLE_QUOTE
;
WORD
: ( 'A' .. 'Z'
| 'a' .. 'z'
| '0' .. '9'
| '<'
| '>'
| '/'
| '.'
| ':'
| ';'
| '-'
| '_'
| '$'
| '+'
| '*'
| '&'
| '!'
| '%'
| '?'
| '#'
| '\\' .
)+
;
LEFT_PAREN
: '('
;
RIGHT_PAREN
: ')'
;
EQUALS
: '='
;
COMMA
: ','
;
SINGLE_QUOTE
: '\''
;
DOUBLE_QUOTE
: '"'
;
COMMENT
: '#' ( ~( '\n' ) )* -> channel(HIDDEN)
;
WHITESPACE
: ( '\t'
| ' '
) -> channel(HIDDEN)
;
NEWLINE
: ( '\r' )? '\n' -> channel(HIDDEN)
;
As written in my comment, it's because auf the skip() in the grammar file, which is Java dependend.
So there is no bug in Antlr. :)
I have written a grammar in ANTLR 4 and I wanted to parsed it for testing input program, but it threw an ViableException during parsing. Now, I wonder, if the grammar is non-deterministic or only bad written for ANTLR 4 and I want to know, how to fix it? Is there any helpful tool for optimizing and validating LL(*) grammars? I have found only some default tools which did not discovered some error, or did not validated it correctly. ANTLR throws error at line 1 with part of line 2 for 2 standalone commands, and I do not know if the error is exactly here or it can be anywhere deeper in the program.
Error Message which was caught at line 1 by my SyntaxError listenner base on ANTLR BaseErrorListenner:
2015-05-02 14:31:45 [ERROR] [Templates\Java\DocBook\Plan Examples\One Document.PeKLXT]: Exception (Utils.ParsingException) was thrown... Message: Parsing failed... First Error at: no viable alternative at input '##import: "XDoc_Generator.dll";\t//XDocProject dll\n\n\n\n##registerVar: XDoc_Generator.XDocProject ' errors: 1
Caused by: v PeKLXT.NET.Engine.parse(String inputProgramFile) v c:\Users\petrk_000\Documents\Visual Studio 2012\Projects\Argutec_XDoc\PeKLXT.NET\Engine.cs:řádek 51
As you can see, ANTLR has written to the error message only first rule and the part of the second rule, so is here in this context an error or it can be somewhere deeper in this program?
'##import: "XDoc_Generator.dll";\t//XDocProject dll
\n\n\n\n##registerVar: XDoc_Generator.XDocProject '
The whole code for these 2 rules:
##import: "XDoc_Generator.dll"; //XDocProject dll
##registerVar: XDoc_Generator.XDocProject #project = (XDocProject)(#this);
And here is the grammar for these rules, I have checked it for determinism, also by some internet tool, but I have not found a mismatch:
PeKLXT.g4:
compileUnit
: (LINECOMMENT | MULTILINECOMMENT)* importCommand*(registerGlobalVariableCommand LINECOMMENT?)* outerBodyCommand* EOF #PeKLXTProgram
| (LINECOMMENT | MULTILINECOMMENT)* importCommand* (MULTILINECOMMENT)? functionDeclaration)+ EOF #FunctionCodeFile
;
Function:
functionDeclaration
: '##function' ':' InnerIdentifier '(' parameters? ')' innerBody ';'?
;
OuterBody Command: Each command starts with different key word!
outerBodyCommand
: createCommand
| outerOpenCommand
| outerIfCommand
| outerLoopCommand
| setVariableCommand
;
Other used commands and expressions for this context:
importCommand
: '##import' ':' stringExpression ';' LINECOMMENT?
;
registerGlobalVariableCommand
: '##registerVar' ':' type emptyArrayDimensions? identifier ('=' assignmentExpression)? ';' //local variable starts with ##
| '##registerVar' ':' '$xdocument' '=' stringExpression ';'
| '##registerVar' ':' stat='$$$static' '=' dolarVariableExpression ';'
;
emptyArrayDimensions
: emptyIndexerUnit+
;
emptyIndexerUnit
: '[' (',')* ']'
;
assignmentExpression
: conditionExpression
;
String expressions:
stringExpression
: stringForm (AdditiveOp stringForm)*
;
stringForm
: unaryExpression
;
stringMethod
: StringMethodName '(' parameterValues? ')'
;
Unary Expression (the expressions unit it equals to condition expression in case of no operators between two or more expressions):
unaryExpression
: strictlyUnaryExpression
| op='+' strictlyUnaryExpression
| op='-' strictlyUnaryExpression
| op='!' strictlyUnaryExpression
| '(' unaryExpression ')'
;
strictlyUnaryExpression
: literal('->' stringMethod)*
| objectCall('->' stringMethod)*
| castExpression ('->' stringMethod)*
| dolarVariableExpression ('->' stringMethod)*
;
literal
: StringLiteral
| IntegerLiteral
| DecimalFloatingPointLiteral
| BooleanLiteral
| NullLiteral
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
Dollar Variable Expression:
dolarVariableExpression
: dolarVariableContext
| '$$$tmp' ':' innerSelect
;
innerSelect
: '##innerSelect' ':' (first='first'|last='last')? '(' innerSelect ')' ('where' '(' conditionExpression ')')?
| '##innerSelect' ':' (first='first'|last='last')? '(' localDolarIdentifier ':' dolarVariableContext ')' ('where' '(' conditionExpression ')')?
;
Types and identifiers:
//types
type
: PrimitiveType
| typeIdentifier generics?
;
//string [,,] neco = new string [2, 2,4] {{{"5", "5", "5", "5"}, {"5", "5", "5", "5"}}, {{"5", "5", "5", "5"}, {"5", "5", "5", "5"}}};
emptyArrayDimensions
: emptyIndexerUnit+
;
generics
: '<' type (',' type)* '>'
;
//identifiers
//complexIdentifier
//: identifier ('.' InnerIdentifier)*
//;
identifier
: '#' InnerIdentifier #IdentifierGlobal
| localIdentifier #IdentifierLocal
| '###' InnerIdentifier #IdentifierStatic
;
localIdentifier
: '##' InnerIdentifier
;
typeIdentifier
: InnerIdentifier ('.' InnerIdentifier)*
;
dolarVariableContext
: dolarPartCall ('.' dolarPartCall)* (InnerIdentifier)?
//| '$$' Pointer ('.' dolarPartCall)* (InnerIdentifier)?
| localDolarIdentifier (indexerGet)? ('.' dolarPartCall)* (InnerIdentifier)?
;
localDolarIdentifier
: '$$' InnerIdentifier
;
dolarPartCall
: '$' InnerIdentifier (indexerGet)?
| '$' Pointer (indexerGet)?
;
fragment InnerIdentifier
: IdentifierStart IdentifierPart*
;
IdentifierStart : [_a-zA-Z]+;
IdentifierPart : [_a-zA-Z0-9]+;
Characters to skip:
WS : [ \t\r\n\u000C]+ -> skip
;
MULTILINECOMMENT
: '/*' .*? '*/'
;
LINECOMMENT
: '//' ~[\r\n]*
;
COMMENT
: '<!--' (.)*? '-->' -> skip
;
Thanks a lot for help!
I'm migrating my custom DSL from GoldParser to ANTLR4, but I'm stuck at the parsing step because it takes too much to finish. A source of 1000 lines is parsed in 34 seconds versus the milliseconds range I had in GoldParser.
This is the C# code I use for parsing:
var input = new AntlrInputStream(prg);
var lexer = new PCLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new PCParser(tokens);
var tree = parser.programma(); // root rule is "programma"
I suspect the problem is in the grammar which has a lot of ambiguities, indeed it was the reason why I decided to migrate it from GoldParser (not being able to further improve it, I realized it was a lot easier to rewrite it in Antlr4 and do not care of ambiguities).
My question is: is there anything I can do to have milliseconds order-of-magnitude parsing, or it's just normal that ANTLR4 is inherently slow? I'm new to Antlr and I don't know what to expect.
Regading the grammar, it's a sort of pseudo-C:
grammar PC;
fragment Number : [0-9] ;
fragment DoubleStringCharacter : ~["\r\n] ;
fragment SingleStringCharacter : ~['\r\n] ;
fragment DoubleStringCharacterM : ~["] ;
fragment SingleStringCharacterM : ~['] ;
BlockComment : '/*' .*? '*/' -> skip ;
LineComment : '//' ~[\r\n]* -> skip ;
WhiteSpaces : [\t\u000B\u000C\u0020\u00A0]+ -> skip ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
Quote : '\'' ;
DoubleQuote : '"' ;
NullLiteral : 'null' ;
BoolLiteral : 'true' | 'false' ;
IntLiteral : (Number)+ ;
FloatLiteral : (Number)* '.' (Number)+ ;
StringLiteral : DoubleQuote DoubleStringCharacter* DoubleQuote ;
StringLiteralJs : Quote SingleStringCharacter* Quote ;
StringLiteralM : '#' DoubleQuote DoubleStringCharacter* DoubleQuote ;
StringLiteralJsM : '#' Quote SingleStringCharacter* Quote ;
Or_op : 'or' | '||' ;
And_op : 'and' | '&&' ;
Not_op : 'not' | '!' ;
Not_eq : '!=' | '<>' ;
programma : interfaccia? dichiarazione* ;
interfaccia : 'interfaccia' '{' oggettoInterfaccia* '}' ;
oggettoInterfaccia : Identifier Identifier '{' definizioneProprieta* '}' ;
definizioneProprieta : Identifier '=' valoreProprieta ';'
| oggettoInterfaccia;
valoreProprieta : BoolLiteral | IntLiteral | FloatLiteral | StringLiteral | StringLiteralM | Identifier ;
dichiarazione : dichiarazioneReference
| dichiarazioneUsing
| dichiarazioneClass
| dichiarazioneFunzione
| dichiarazioneVariabile
;
dichiarazioneReference : 'reference' StringLiteral ';' ;
dichiarazioneUsing : 'using' Identifier '=' StringLiteral ';' ;
dichiarazioneClass : 'class' Identifier ';' ;
dichiarazioneFunzione : Identifier Identifier '(' parametri ')' '{' stmList '}' ;
parametri : parametro (',' parametro)* ;
parametro : Identifier
| Identifier Identifier
;
dichiarazioneVariabile : Identifier listaVariabili ';' ;
listaVariabili : variabile (',' variabile)* ;
variabile : Identifier
| Identifier '=' exprOrArray
;
stmList : stm* ;
stm : blocco
| dichiarazioneVariabile
| etichetta
| istruzioneIf
| istruzioneWhile
| istruzioneFor
| istruzioneDo
| istruzioneGoto
| istruzioneBreak
| istruzioneContinue
| istruzioneReturn
| expr ';'
| assegnamento ';'
| ';'
| 'ConnectEvent' '(' Identifier ',' Identifier ',' Identifier ')' ';'
| istruzioneTry
;
blocco : '{' stmList '}' ;
istruzioneIf : 'if' '(' expr ')' stm ( 'else' stm )? ;
istruzioneFor : 'for' '(' stm condizioneFor ';' incrementoFor? ')' stm ;
condizioneFor : expr? ;
incrementoFor : expr
| assegnamento
;
istruzioneWhile : 'while' '(' expr ')' stm ;
istruzioneDo : 'do' stm 'while' '(' expr ')' ; // TODO si deve aggiungere ';' ?
etichetta : Identifier ':' ;
istruzioneGoto : 'goto' Identifier ';' ;
istruzioneBreak : 'break' ';' ;
istruzioneContinue : 'continue' ';' ;
istruzioneReturn : 'return' exprOrArray ';' | 'return' ';' ;
istruzioneTry : 'try' blocco 'catch' '(' Identifier ')' blocco ;
assegnamento : Identifier '=' exprOrArray
| Identifier '[' expr ']' '=' exprOrArray
| Identifier '.' Identifier '=' exprOrArray
;
exprOrArray : expr
| '{' exprList '}'
;
exprList : exprOrArray ',' exprList
| exprOrArray
;
expr : expr '+=' expr
| expr '-=' expr
| expr '?' expr ':' expr
| expr Or_op expr
| expr And_op expr
| expr '==' expr
| expr Not_eq expr
| expr '<' expr
| expr '>' expr
| expr '<=' expr
| expr '>=' expr
| expr 'as' Identifier
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '%' expr
| expr Not_op expr
| '-' expr
| '+' expr
| '--' expr
| '++' expr
| expr '--'
| expr '++'
| expr '[' expr ']'
| callFun
| Identifier '.' Identifier '(' methodParams ')'
| Identifier '.' Identifier
| Identifier
| literal
| '(' expr ')'
;
methodParams : methodParam (',' methodParam)* ;
methodParam : exprOrArray ;
callFun : Identifier '(' methodParams ')'
| 'new' Identifier '(' methodParams ')'
;
literal : NullLiteral
| BoolLiteral
| IntLiteral
| FloatLiteral
| StringLiteral
| StringLiteralJs
| StringLiteralM
| StringLiteralJsM
;
If your grammar had ambiguities, the Gold Parser (my understanding: LALR(1)) would not parse source text correctly. [I assume your are ignoring complaints it should produce about shift-reduce and reduce-reduce conflicts?] It would pick one of the parses. And, being LALR(1), it will do so in linear time, so it is not surprise that it is fast; this is a key utility of LALR(1) parsers.
Ambiguity in a grammar often (not always) means that there are parses you should have eliminated, but did not. If Gold is picking among parses, and some are wrong, there is no reason to believe you are getting a correct parse.
So, in fact, if you can get the wrong answer in milliseconds with Gold, why does it matter if ANTLR gets the wrong answer somewhat more slowly?
I suggest you remove the ambiguities. (As a starting place, your expression subgrammar looks highly ambiguous to me). I think ANTLR will "speed up".
Sorry for my bad English.
I wrote ANTLR4-grammar for GDB/MI output commands from this manual:
grammar GdbOutput;
output : out_of_band_record | result_record | terminator_record;
result_record : TOKEN? '^' RESULT_CLASS (',' result)*;
out_of_band_record : async_record
| stream_record;
async_record : exec_async_output
| status_async_output
| notify_async_output;
exec_async_output : TOKEN? '*' async_output;
status_async_output : TOKEN? '+' async_output;
notify_async_output : TOKEN? '=' async_output;
async_output : async_class (',' result)*;
RESULT_CLASS : 'done'
| 'running'
| 'connected'
| 'error'
| 'exit';
async_class : 'stopped'; //TODO
result : VARIABLE '=' value;
value : const
| tuple
| list;
const : c_string;
c_string : '"' STRING_LITERAL '"';
tuple : '{}'
| '{' result (',' result)* '}';
list : '[]'
| '[' value (',' value)* ']'
| '[' result (',' result)* ']';
stream_record : console_stream_output
| target_stream_output
| log_stream_output;
console_stream_output : '~' c_string;
target_stream_output : '#' c_string;
log_stream_output : '&' c_string;
terminator_record : '(gdb)';
VARIABLE : [a-z-]*;
STRING_LITERAL : (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))*;
TOKEN : [0-9]+;
I tried some output strings from GDB:
"~\"Reading symbols from C:\\src.exe...\"" - OK (out_of_band_record -> stream_record -> console_stream_output)
"(gdb)" -> OK (terminator_record)
"^done,bkpt={number=\"1\",type=\"breakpoint\",disp=\"keep\",enabled=\"y\",addr=\"0x00000000004014e4\",file=\"src.s\",fullname=\"C:\\src.s\",line=\"17\",thread-groups=[\"i1\"],times=\"0\",original-location=\"main\"}" - FAIL (with exception: line 1:0 no viable alternative at input '^done,bkpt={number=')
"^done,bkpt=\"1\"" - FAIL (line 1:0 no viable alternative at input '^done,bkpt=)
"^done,bkpt={}" - FAIL (line 1:0 no viable alternative at input '^done,bkpt={}')
Why my parser didn't recognize strings #3-5?
P.S.: C# target for ANTLR v.4.2.0 prerelease from Nuget
For starters, let your STRING_LITERAL match the quotes too: don't match them in a parser rule. And let your VARIABLE rule match at least one character (change the * into a +).