ANTLR3 common values in 2 different domain values - c#

I need to define a language-parser for the following search criteria:
CRITERIA_1=<values-set-#1> AND/OR CRITERIA_2=<values-set-#2>;
Where <values-set-#1> can have values from 1-50 and <values-set-#2> can be from the following set (5, A, B, C) - case is not important here.
I have decided to use ANTLR3 (v3.4) with output in C# (CSharp3) and it used to work pretty smooth until now. The problem is that it fails to parse the string when I provide values from both data-sets (I.e. in this case '5'). For example, if I provide the following string
CRITERIA_1=5;
It returns the following error where the value node was supposed to be:
<unexpected: [#1,11:11='5',<27>,1:11], resync=5>
The grammar definition file is the following:
grammar ZeGrammar;
options {
language=CSharp3;
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
k=3;
}
tokens
{
ROOT;
CRITERIA_1;
CRITERIA_2;
OR = 'OR';
AND = 'AND';
EOF = ';';
LPAREN = '(';
RPAREN = ')';
}
public
start
: expr EOF -> ^(ROOT expr)
;
expr
: subexpr ((AND|OR)^ subexpr)*
;
subexpr
: grouppedsubexpr
| 'CRITERIA_1=' rangeval1_expr -> ^(CRITERIA_1 rangeval1_expr)
| 'CRITERIA_2=' rangeval2_expr -> ^(CRITERIA_2 rangeval2_expr)
;
grouppedsubexpr
: LPAREN! expr RPAREN!
;
rangeval1_expr
: rangeval1_subexpr
| RANGE1_VALUES
;
rangeval1_subexpr
: LPAREN! rangeval1_expr (OR^ rangeval1_expr)* RPAREN!
;
RANGE1_VALUES
: (('0'..'4')? ('0'..'9') | '5''0')
;
rangeval2_expr
: rangeval2_subexpr
| RANGE2_VALUES
;
rangeval2_subexpr
: LPAREN! rangeval2_expr (OR^ rangeval2_expr)* RPAREN!
;
RANGE2_VALUES
: '5' | ('a'|'A') | ('b'|'B') | ('c'|'C')
;
And if I remove the value '5' from RANGE2_VALUES it works fine. Can anyone hint me on what I am doing wrong?

You must realize that the lexer does not produce tokens based on what the parser tries to match. So, in your case, the input "5" will always be tokenized as a RANGE1_VALUES and never as a RANGE2_VALUES because both RANGE1_VALUES and RANGE2_VALUES can match this input but RANGE1_VALUES comes first (so RANGE1_VALUES takes precedence over RANGE2_VALUES).
A possible fix would be to remove both RANGE1_VALUES and RANGE2_VALUES rules and replace them with the following lexer rules:
D0_4
: '0'..'4'
;
D5
: '5'
;
D6_50
: '6'..'9' // 6-9
| '1'..'4' '0'..'9' // 10-49
| '50' // 50
;
A_B_C
: ('a'|'A')
| ('b'|'B')
| ('c'|'C')
;
and the introduce these new parser rules:
range1_values
: D0_4
| D5
| D6_50
;
range2_values
: A_B_C
| D5
;
and change all RANGE1_VALUES and RANGE2_VALUES calls in your parser rules with range1_values and range2_values respectively.
EDIT
Instead of trying to solve this at the lexer-level, you might simply match any integer value and check inside the parser rule if the value is the correct one (or correct range) using a semantic predicate:
range1_values
: INT {Integer.valueOf($INT.text) <= 50}?
;
range2_values
: A_B_C
| INT {Integer.valueOf($INT.text) == 5}?
;
INT
: '0'..'9'+
;
A_B_C
: 'a'..'c'
| 'A'..'C'
;

Related

antlr4 in dot net "mismatched input 'begin' expecting {';', '+', '-', '*', DIV, MOD}

I'm using antlr4 in C#.
everything works fine except when i use 'block' everything goes crazy.
for example this is my input code :
a:int;
a:=2;
if(a==2) begin
a:= a * 2;
a:=a + 5;
end
and this is my grammer :
grammar Our;
options{
language=CSharp;
TokenLabelType=CommonToken;
ASTLabelType=CommonTree;
}
statements : statement statements
|EOF;
statement :
expression SEMI
| ifstmt
| whilestmt
| forstmt
| readstmt SEMI
| writestmt SEMI
| vardef SEMI
| block
;
block : BEGIN statements END ;
expression : ID ASSIGN expression
| boolexp;
boolexp : relexp AND boolexp
| relexp OR boolexp
| relexp;
relexp : modexp EQUAL relexp
| modexp LE relexp
| modexp GE relexp
| modexp NOTEQUAL relexp
| modexp GT relexp
| modexp LT relexp
| modexp;
modexp : modexp MOD exp
//| exp DIV modexp
| exp;
exp : exp ADD term
| exp SUB term
| term;
term : term MUL factor
| term DIV factor
| factor POW term
| factor;
factor : LPAREN expression RPAREN
| LPAREN vartype RPAREN factor
| ID
| SUB factor
| ID LPAREN explist RPAREN
| ID LPAREN RPAREN
| ID LPAREN LPAREN NUM RPAREN RPAREN
| ID LPAREN LPAREN NUM COMMA NUM RPAREN RPAREN
| const;
explist : exp COMMA explist
|exp;
const : NUM
| BooleanLiteral
| STRING;
ifstmt : IF LPAREN boolexp RPAREN statement
| IF LPAREN boolexp RPAREN statement ELSE statement ;
whilestmt : WHILE LPAREN boolexp RPAREN statement ;
forstmt : FOR ID ASSIGN exp COLON exp statement;
readstmt : READ LPAREN idlist RPAREN ;
idlist : ID COMMA idlist
|ID;
writestmt : WRITE LPAREN explist RPAREN ;
vardef : idlist COLON vartype;
vartype : basictypes
| basictypes LPAREN NUM RPAREN
| basictypes LPAREN NUM COMMA NUM RPAREN ;
basictypes : INT
| FLOAT
| CHAR
| STRING
| BOOLEAN ;
BEGIN : 'begin';
END : 'end';
To : 'to';
NEXT : 'next';
REAL : 'real';
BOOLEAN : 'boolean';
CHAR : 'char';
DO : 'do';
DOUBLE : 'double';
ELSE : 'else';
FLOAT : 'float';
FOR : 'for';
FOREACH : 'foreach';
FUNCTION : 'function';
IF : 'if';
INT : 'int';
READ : 'read';
RETURN : 'return';
VOID : 'void';
WHILE : 'while';
WEND : 'wend';
WRITE : 'write';
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
SEMI : ';';
COMMA : ',';
ASSIGN : ':=';
GT : '>';
LT : '<';
COLON : ':';
EQUAL : '==';
LE : '<=';
GE : '>=';
NOTEQUAL : '!=';
AND : '&&'|'and';
OR : '||'|'or';
INC : '++';
DEC : '--';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/'|'div';
MOD : '%'|'mod';
ADD_ASSIGN : '+=';
SUB_ASSIGN : '-=';
MUL_ASSIGN : '*=';
DIV_ASSIGN : '/=';
POW : '^';
BooleanLiteral : 'true'|'false';
STRING : '\"'([a-zA-Z]|NUM)*'\"';
ID : ([a-z]|[A-Z])([a-z]|[A-z]|[0-9])*;
NUM : ('+'|'-')?[0-9]([0-9]*)('.'[0-9][0-9]*)?;
WS : [ \t\r\n\u000C]+ -> skip ;
COMMENT : '/*' .*? '*/' ;
LINE_COMMENT : '//' ~[\r\n]*;
when i run the parser i get the following error message :
no viable alternative at input 'if(a==2)begina:=a*2;a:=a+5;end'
mismatched input 'begin' expecting {';', '+', '-', '*', DIV, MOD}
no viable alternative at input 'end'
thanks in advance.
The problem is your rule for a list of statements:
statements : statement statements | EOF ;
This rule has two options: a statement followed by another list of statements, or EOF. The only non-recursive option is the EOF, which becomes a problem when you use this in your rule for a block:
block : BEGIN statements END ;
You can never encounter EOF in the middle of a block, so when the parser reads the line before the end in your sample input, the next thing that it expects to read is another statement. The word end on its own isn't a valid statement, which is why it throws the error that you are seeing.
One possible fix is to make the recursive part of your statements rule optional:
statements : statement statements? | EOF ;
This will allow your sample input to parse successfully. In my opinion, a better option is to remove the recursion altogether:
statements : statement* | EOF ;
Finally, you can see that the EOF is still one of the options for the statements rule. This doesn't make much sense when you use this rule in the as part of the rule for block, since you shouldn't ever find an EOF in the middle of a block. What I would do would be to move this to a new top level parser rule:
program : statements EOF ;
statements : statement* ;

Possible generation bug for C# in Antlr?

Using Antlr 4.3 and this grammar
http://www.harward.us/~nharward/antlr/OracleNetServicesV3.g
following *Lexer.cs code for C# is generated :
private void WHITESPACE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 1: skip(); break;
}
}
private void NEWLINE_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 2: skip(); break;
}
}
private void COMMENT_action(RuleContext _localctx, int actionIndex) {
switch (actionIndex) {
case 0: skip(); break;
}
}
But the method skip() in the runtime is defined as:
public virtual void Skip()
Which of course gives a compilation error.
The same skip() method is generated with Antlr 3.5.2 as well.
Is this a bug or i am doing something wrong?
You can easily make this v4 compatible and language independent by using the command
-> channel(HIDDEN)
Here is an updated grammar that implements this change
configuration_file
: ( parameter )*
;
parameter
: keyword EQUALS ( value
| LEFT_PAREN value_list RIGHT_PAREN
| ( LEFT_PAREN parameter RIGHT_PAREN )+
)
;
keyword
: WORD
;
value
: WORD
| QUOTED_STRING
;
value_list
: value ( COMMA value )*
;
QUOTED_STRING
: SINGLE_QUOTE ~'\''* SINGLE_QUOTE
| DOUBLE_QUOTE ~'"'* DOUBLE_QUOTE
;
WORD
: ( 'A' .. 'Z'
| 'a' .. 'z'
| '0' .. '9'
| '<'
| '>'
| '/'
| '.'
| ':'
| ';'
| '-'
| '_'
| '$'
| '+'
| '*'
| '&'
| '!'
| '%'
| '?'
| '#'
| '\\' .
)+
;
LEFT_PAREN
: '('
;
RIGHT_PAREN
: ')'
;
EQUALS
: '='
;
COMMA
: ','
;
SINGLE_QUOTE
: '\''
;
DOUBLE_QUOTE
: '"'
;
COMMENT
: '#' ( ~( '\n' ) )* -> channel(HIDDEN)
;
WHITESPACE
: ( '\t'
| ' '
) -> channel(HIDDEN)
;
NEWLINE
: ( '\r' )? '\n' -> channel(HIDDEN)
;
As written in my comment, it's because auf the skip() in the grammar file, which is Java dependend.
So there is no bug in Antlr. :)

ANTLR: no viable alternative at input

Sorry for my bad English.
I wrote ANTLR4-grammar for GDB/MI output commands from this manual:
grammar GdbOutput;
output : out_of_band_record | result_record | terminator_record;
result_record : TOKEN? '^' RESULT_CLASS (',' result)*;
out_of_band_record : async_record
| stream_record;
async_record : exec_async_output
| status_async_output
| notify_async_output;
exec_async_output : TOKEN? '*' async_output;
status_async_output : TOKEN? '+' async_output;
notify_async_output : TOKEN? '=' async_output;
async_output : async_class (',' result)*;
RESULT_CLASS : 'done'
| 'running'
| 'connected'
| 'error'
| 'exit';
async_class : 'stopped'; //TODO
result : VARIABLE '=' value;
value : const
| tuple
| list;
const : c_string;
c_string : '"' STRING_LITERAL '"';
tuple : '{}'
| '{' result (',' result)* '}';
list : '[]'
| '[' value (',' value)* ']'
| '[' result (',' result)* ']';
stream_record : console_stream_output
| target_stream_output
| log_stream_output;
console_stream_output : '~' c_string;
target_stream_output : '#' c_string;
log_stream_output : '&' c_string;
terminator_record : '(gdb)';
VARIABLE : [a-z-]*;
STRING_LITERAL : (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))*;
TOKEN : [0-9]+;
I tried some output strings from GDB:
"~\"Reading symbols from C:\\src.exe...\"" - OK (out_of_band_record -> stream_record -> console_stream_output)
"(gdb)" -> OK (terminator_record)
"^done,bkpt={number=\"1\",type=\"breakpoint\",disp=\"keep\",enabled=\"y\",addr=\"0x00000000004014e4\",file=\"src.s\",fullname=\"C:\\src.s\",line=\"17\",thread-groups=[\"i1\"],times=\"0\",original-location=\"main\"}" - FAIL (with exception: line 1:0 no viable alternative at input '^done,bkpt={number=')
"^done,bkpt=\"1\"" - FAIL (line 1:0 no viable alternative at input '^done,bkpt=)
"^done,bkpt={}" - FAIL (line 1:0 no viable alternative at input '^done,bkpt={}')
Why my parser didn't recognize strings #3-5?
P.S.: C# target for ANTLR v.4.2.0 prerelease from Nuget
For starters, let your STRING_LITERAL match the quotes too: don't match them in a parser rule. And let your VARIABLE rule match at least one character (change the * into a +).

ANTLR rule to skip method body

My task is to create ANTLR grammar, to analyse C# source code files and generate class hierarchy. Then, I will use it to generate class diagram.
I wrote rules to parse namespaces, class declarations and method declarations. Now I have problem with skipping methods bodies. I don't need to parse them, because bodies are useless in my task.
I wrote simple rule:
body:
'{' .* '}'
;
but it does not work properly, when method looks like:
void foo()
{
...
{
...
}
...
}
rule matches first brace what is ok, then it matches
...
{
...
as 'any'(.*) and then third brace as final brace, what is not ok, and rule ends.
Anybody could help me to write proper rule for method bodies? As I said before, I don't want to parse them - only to skip.
UPDATE:
here is solution of my problem strongly based on Adam12 answer
body:
'{' ( ~('{' | '}') | body)* '}'
;
You have to use recursive rules that match parentheses pairs.
rule1 : '('
(
nestedParan
| (~')')*
)
')';
nestedParan : '('
(
nestedParan
| (~')')*
)
')';
This code assumes you are using the parser here so strings and comments are already excluded. ANTLR doesn't allow negation of multiple alternatives in parser rules so the code above relies on the fact that alternatives are tried in order. It should give a warning that alternatives 1 and 2 both match '(' and thus choose the first alternative, which is what we want.
You can handle the recursion of (nested) blocks in your lexer. The trick is to let your class definition also include the opening { so that not the entire contents of the class is gobbled up by this recursive lexer rule.
A quick demo that is without a doubt not complete, but is a decent start to "fuzzy parse/lex" a Java (or C# with some slight modifications) source file:
grammar T;
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));})* EOF
;
Skip
: (StringLiteral | CharLiteral | Comment) {skip();}
;
PackageDecl
: 'package' Spaces Ids {setText($Ids.text);}
;
ClassDecl
: 'class' Spaces Id Spaces? '{' {setText($Id.text);}
;
Method
: Id Spaces? ('(' {setText($Id.text);}
| /* no method after all! */ {skip();}
)
;
MethodOrStaticBlock
: Block {skip();}
;
Any
: . {skip();}
;
// fragments
fragment Spaces
: (' ' | '\t' | '\r' | '\n')+
;
fragment Ids
: Id ('.' Id)*
;
fragment Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
fragment Block
: '{' ( ~('{' | '}' | '"' | '\'' | '/')
| {input.LA(2) != '/'}?=> '/'
| StringLiteral
| CharLiteral
| Comment
| Block
)*
'}'
;
fragment Comment
: '/*' .* '*/'
| '//' ~('\r' | '\n')*
;
fragment CharLiteral
: '\'' ('\\\'' | ~('\\' | '\'' | '\r' | '\n'))+ '\''
;
fragment StringLiteral
: '"' ('\\"' | ~('\\' | '"' | '\r' | '\n'))* '"'
;
I ran the generated parser against the following Java source file:
/*
... package NO.PACKAGE; ...
*/
package foo.bar;
public final class Mu {
static String x;
static {
x = "class NotAClass!";
}
void m1() {
// {
while(true) {
double a = 2.0 / 2;
if(a == 1.0) { break; } // }
/* } */
}
}
static class Inner {
int m2 () {return 42; /*comment}*/ }
}
}
which produced the following output:
PackageDecl 'foo.bar'
ClassDecl 'Mu'
Method 'm1'
ClassDecl 'Inner'
Method 'm2'

ANTLR Grammar and generated code problems

I'm trying to create an expression parser using ANTLR
The expression will go inside an if statement so its root is a condition.
I have the following grammar, which "compiles" to parser/lexer files with no problems, however the generated code itself has some errors, essentially two "empty" if statements
i.e.
if (())
Not sure what I'm doing wrong, any help will be greatly appreciated.
Thanks.
Grammar .g file below:
grammar Expression;
options {
language=CSharp3;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Antlr3 }
#lexer::namespace { Antlr3 }
public parse
: orcond EOF -> ^(ROOT orcond)
;
orcond
: andcond ('||' andcond)*
;
andcond
: condition ('&&' condition)*
;
condition
: exp (('<' | '>' | '==' | '!=' | '<=' | '>=')^ exp)?
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' parenthesisvalid ')' -> parenthesisvalid
;
parenthesisvalid
: fullobjectref
| orcond
;
fullobjectref
: objectref ('.' objectref)?
;
objectref
: objectname ('()' | '(' params ')' | '[' params ']')?
;
objectname
: (('a'..'z') | ('A'..'Z'))^ (('a'..'z') | ('A'..'Z') | ('0'..'9') | '_')*
;
params
: paramitem (',' paramitem)?
;
paramitem
: unaryExp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
Don't use the range operator, .., inside parser rules.
Remove the parser rule objectname and create the lexer rule:
Objectname
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
;
Additionally it will be more neatly to use fragment blocks. In this case your code will be something like this:
Objectname : LETTER (LETTER | DIGIT | '_')*;
Number: DIGIT+ ('.' DIGIT+)?;
fragment DIGIT : '0'..'9' ;
fragment LETTER : ('a'..'z' | 'A'..'Z');

Categories

Resources