ANTLR4 Hexadecimal Parsing

ANTLR4 Hexadecimal Parsing - c#

I'm having issues debugging an ANTLR grammar I'm working on for Gameboy Assembly.
It seems to work normally, but for some reason it cannot handle 0x notation for Hexadecimal in certain edge cases.
If my input string is "JR 0x10" antlr fails with a 'no viable alternative at input' error. As I understand it, it means I either have no rule to parse the token stream or '0x' is not properly being understood. If I use "JR $10" (one of the alternate notations I support) it works perfectly. But '0x' and '$' are expressed in the same rule.
Here is my g4 file:
grammar GBASM;
eval : exp EOF;
exp : exp op | exp sys | op | sys;
sys : include | section | label | data;
op : monad | biad arg | triad arg SEPARATOR arg;
monad : NOP|RLCA|RRCA|STOP|RLA|RRA|DAA|CPL|SCF|CCF|HALT|RETI|DI|EI|RST|RET;
biad : INC|DEC|SUB|AND|XOR|OR|CP|POP|PUSH|RLC|RRC|RL|RR|SLA|SRA|SWAP|SRL|JP|JR;
triad : RET|JR|JP|CALL|LD|LDD|LDI|LDH|ADD|ADC|SBC|BIT|RES|SET;
arg : (register|value|negvalue|flag|offset|jump|memory);
memory : MEMSTART (register|value|jump) MEMEND;
offset : register Plus value | register negvalue;
register : A|B|C|D|E|F|H|L|AF|BC|DE|HL|SP|HLPLUS|HLMINUS;
flag : NZ | NC | Z | C;
data : DB db;
db : string_data | value | string_data SEPARATOR db | value SEPARATOR db;
include : INCLUDE string_data;
section : SECTION string_data SEPARATOR HOME '[' value ']';
string_data: STRINGLITERAL;
jump : LIMSTRING;
label : LIMSTRING ':';
Z : 'Z';
A : 'A';
B : 'B';
C : 'C';
D : 'D';
E : 'E';
F : 'F';
H : 'H';
L : 'L';
AF : 'AF';
BC : 'BC';
DE : 'DE';
HL : 'HL';
SP : 'SP';
NZ : 'NZ';
NC : 'NC';
value : HexInteger | Integer;
negvalue : (Neg Integer) | (Neg HexInteger);
Neg : '-';
Plus : '+';
HexInteger : (HexPrefix HexDigit+) | (HexDigit+ HexPostfix);
Integer : Digit+;
fragment Digit : ('0'..'9');
HLPLUS : 'HL+' | 'HLI';
HLMINUS : 'HL-' | 'HLD';
MEMSTART : '(';
MEMEND : ')';
LD : 'LD' | 'ld';
JR : 'JR' | 'jr';
JP : 'JP' | 'jp';
OR : 'OR' | 'or';
CP : 'CP' | 'cp';
RL : 'RL' | 'rl';
RR : 'RR' | 'rr';
DI : 'DI' | 'di';
EI : 'EI' | 'ei';
DB : 'DB';
LDD : 'LDD' | 'ldd';
LDI : 'LDI' | 'ldi';
ADD: 'ADD' | 'add';
ADC : 'ADC' | 'adc';
SBC : 'SBC' | 'sbc';
BIT : 'BIT' | 'bit';
RES : 'RES' | 'res';
SET : 'SET' | 'set';
RET: 'RET' | 'ret';
INC : 'INC' | 'inc';
DEC : 'DEC' | 'dec';
SUB : 'SUB' | 'sub';
AND : 'AND' | 'and';
XOR : 'XOR' | 'xor';
RLC : 'RLC' | 'rlc';
RRC : 'RRC' | 'rrc';
POP: 'POP' | 'pop';
SLA : 'SLA' | 'sla';
SRA : 'SRA' | 'sra';
SRL : 'SRL' | 'srl';
NOP : 'NOP' | 'nop';
RLA : 'RLA' | 'rla';
RRA : 'RRA' | 'rra';
DAA : 'DAA' | 'daa';
CPL : 'CPL' | 'cpl';
SCF : 'SCF' | 'scf';
CCF : 'CCF' | 'ccf';
LDH : 'LDH' | 'ldh';
RST : 'RST' | 'rst';
CALL : 'CALL' | 'call';
PUSH : 'PUSH' | 'push';
SWAP : 'SWAP' | 'swap';
RLCA : 'RLCA' | 'rlca';
RRCA : 'RRCA' | 'rrca';
STOP : 'STOP 0' | 'STOP' | 'stop 0' | 'stop';
HALT: 'HALT' | 'halt';
RETI: 'RETI' | 'reti';
HOME: 'HOME';
SECTION: 'SECTION';
INCLUDE: 'INCLUDE';
fragment HexPrefix : ('0x' | '$');
fragment HexPostfix : ('h' | 'H');
fragment HexDigit : ('0'..'9'|'a'..'f'|'A'..'F');
STRINGLITERAL : '"' ~["\r\n]* '"';
LIMSTRING : ('_'|'a'..'z'|'A'..'Z'|'0'..'9')+;
SEPARATOR : ',';
WS : (' '|'\t'|'\n'|'\r') ->channel(HIDDEN);
COMMENT : ';' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
In the failing case it looks like I terminate on 'op', in the passing case it correctly drills down to 'value' and my parser snags the information. Is there some quirk of ANTLR4 grammar that I'm missing?
I'm generating a C# parser in case that's relevant.

It turns out it was the order of my hexadecimal rules.
The reason I didn't see anything change was because Visual Studio was looking at an old copy of my grammar (because Microsofts file-path system is somewhat... alternative).
My modified grammar works perfectly.
Thanks!

Related

antlr getting only few symbols

I have such grammar
grammar Hello;
oclFile : ( 'package' packageName
oclExpressions
'endpackage'
)+;
packageName : pathName;
oclExpressions : ( constraint )*;
constraint : contextDeclaration ( Stereotype '#' number name? ':' oclExpression)+;
contextDeclaration : 'context' ( operationContext | classifierContext );
classifierContext : ( name ':' name ) | name;
operationContext : name '::' operationName '(' formalParameterList ')' ( ':' returnType )?;
Stereotype : ( 'pre' | 'post' | 'inv' );
operationName : name | '=' | '+' | '-' | '<' | '<=' | '>=' | '>' | '/' | '*' | '<>' | 'implies' | 'not' | 'or' | 'xor' | 'and';
formalParameterList : ( name ':' typeSpecifier (',' name ':' typeSpecifier )*)?;
typeSpecifier : simpleTypeSpecifier | collectionType;
collectionType : collectionKind '(' simpleTypeSpecifier ')';
oclExpression : ( letExpression )* expression;
returnType : typeSpecifier;
expression : logicalExpression;
letExpression : 'let' name ( '(' formalParameterList ')' )? ( ':' typeSpecifier )? '=' expression ';';
ifExpression : 'if' expression 'then' expression 'else' expression 'endif';
logicalExpression : relationalExpression ( logicalOperator relationalExpression)*;
relationalExpression : additiveExpression (relationalOperator additiveExpression)?;
additiveExpression : multiplicativeExpression ( addOperator multiplicativeExpression)*;
multiplicativeExpression : unaryExpression ( multiplyOperator unaryExpression)*;
unaryExpression : ( unaryOperator postfixExpression) | postfixExpression;
postfixExpression : primaryExpression ( ('.' | '->')propertyCall )*;
primaryExpression : literalCollection | literal | propertyCall | '(' expression ')' | ifExpression;
propertyCallParameters : '(' ( declarator )? ( actualParameterList )? ')';
literal : number | enumLiteral;
enumLiteral : name '::' name ( '::' name )*;
simpleTypeSpecifier : pathName;
literalCollection : collectionKind '{' ( collectionItem (',' collectionItem )*)? '}';
collectionItem : expression ('..' expression )?;
propertyCall : pathName ('#' number)? ( timeExpression )? ( qualifiers )? ( propertyCallParameters )?;
qualifiers : '[' actualParameterList ']';
declarator : name ( ',' name )* ( ':' simpleTypeSpecifier )? ( ';' name ':' typeSpecifier '=' expression )? '|';
pathName : name ( '::' name )*;
timeExpression : '#' 'pre';
actualParameterList : expression (',' expression)*;
logicalOperator : 'and' | 'or' | 'xor' | 'implies';
collectionKind : 'Set' | 'Bag' | 'Sequence' | 'Collection';
relationalOperator : '=' | '>' | '<' | '>=' | '<=' | '<>';
addOperator : '+' | '-';
multiplyOperator : '*' | '/';
unaryOperator : '-' | 'not';
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
DIGITS : '0'..'9' ;
name : (LOWERCASE | UPPERCASE | '_') ( LOWERCASE | UPPERCASE | DIGITS | '_' )* ;
number : DIGITS (DIGITS)* ( '.' DIGITS (DIGITS)* )?;
WS : [ \t\r\n]+ -> skip ;
and such expression
package RobotsTestModel
context motor
inv#0:
self.power = 100
endpackage
but I'm getting only "mot" in variable text:
public override bool VisitConstraint([NotNull] HelloParser.ConstraintContext context)
{
VisitContextDeclaration(context.contextDeclaration());
string text = context.contextDeclaration().classifierContext().GetText();
It's also strange that sometimes it's retrieve motoo if I change, but without certainty. It writes like it expects Stereotype instead of "or" -- the last part of "motor". Where should I work to fix it?

Your grammar produces tokens m, o, t, or tokens from input motor, because name is not a lexer rule and does not work in token recognition. And only m, o, t tokens are matched by rule name, it does not expect or token, that's why you see this strange results.
You should make lexer rules for names and numbers instead of parser ones:
NAME : [a-zA-Z_]+ [a-zA-Z0-9_]* ;
NUMBER : [0-9]+ ('.' [0-9]+)? ;
So motor will be a single token.

antlr4 in dot net "mismatched input 'begin' expecting {';', '+', '-', '*', DIV, MOD}

I'm using antlr4 in C#.
everything works fine except when i use 'block' everything goes crazy.
for example this is my input code :
a:int;
a:=2;
if(a==2) begin
a:= a * 2;
a:=a + 5;
end
and this is my grammer :
grammar Our;
options{
language=CSharp;
TokenLabelType=CommonToken;
ASTLabelType=CommonTree;
}
statements : statement statements
|EOF;
statement :
expression SEMI
| ifstmt
| whilestmt
| forstmt
| readstmt SEMI
| writestmt SEMI
| vardef SEMI
| block
;
block : BEGIN statements END ;
expression : ID ASSIGN expression
| boolexp;
boolexp : relexp AND boolexp
| relexp OR boolexp
| relexp;
relexp : modexp EQUAL relexp
| modexp LE relexp
| modexp GE relexp
| modexp NOTEQUAL relexp
| modexp GT relexp
| modexp LT relexp
| modexp;
modexp : modexp MOD exp
//| exp DIV modexp
| exp;
exp : exp ADD term
| exp SUB term
| term;
term : term MUL factor
| term DIV factor
| factor POW term
| factor;
factor : LPAREN expression RPAREN
| LPAREN vartype RPAREN factor
| ID
| SUB factor
| ID LPAREN explist RPAREN
| ID LPAREN RPAREN
| ID LPAREN LPAREN NUM RPAREN RPAREN
| ID LPAREN LPAREN NUM COMMA NUM RPAREN RPAREN
| const;
explist : exp COMMA explist
|exp;
const : NUM
| BooleanLiteral
| STRING;
ifstmt : IF LPAREN boolexp RPAREN statement
| IF LPAREN boolexp RPAREN statement ELSE statement ;
whilestmt : WHILE LPAREN boolexp RPAREN statement ;
forstmt : FOR ID ASSIGN exp COLON exp statement;
readstmt : READ LPAREN idlist RPAREN ;
idlist : ID COMMA idlist
|ID;
writestmt : WRITE LPAREN explist RPAREN ;
vardef : idlist COLON vartype;
vartype : basictypes
| basictypes LPAREN NUM RPAREN
| basictypes LPAREN NUM COMMA NUM RPAREN ;
basictypes : INT
| FLOAT
| CHAR
| STRING
| BOOLEAN ;
BEGIN : 'begin';
END : 'end';
To : 'to';
NEXT : 'next';
REAL : 'real';
BOOLEAN : 'boolean';
CHAR : 'char';
DO : 'do';
DOUBLE : 'double';
ELSE : 'else';
FLOAT : 'float';
FOR : 'for';
FOREACH : 'foreach';
FUNCTION : 'function';
IF : 'if';
INT : 'int';
READ : 'read';
RETURN : 'return';
VOID : 'void';
WHILE : 'while';
WEND : 'wend';
WRITE : 'write';
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
SEMI : ';';
COMMA : ',';
ASSIGN : ':=';
GT : '>';
LT : '<';
COLON : ':';
EQUAL : '==';
LE : '<=';
GE : '>=';
NOTEQUAL : '!=';
AND : '&&'|'and';
OR : '||'|'or';
INC : '++';
DEC : '--';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/'|'div';
MOD : '%'|'mod';
ADD_ASSIGN : '+=';
SUB_ASSIGN : '-=';
MUL_ASSIGN : '*=';
DIV_ASSIGN : '/=';
POW : '^';
BooleanLiteral : 'true'|'false';
STRING : '\"'([a-zA-Z]|NUM)*'\"';
ID : ([a-z]|[A-Z])([a-z]|[A-z]|[0-9])*;
NUM : ('+'|'-')?[0-9]([0-9]*)('.'[0-9][0-9]*)?;
WS : [ \t\r\n\u000C]+ -> skip ;
COMMENT : '/*' .*? '*/' ;
LINE_COMMENT : '//' ~[\r\n]*;
when i run the parser i get the following error message :
no viable alternative at input 'if(a==2)begina:=a*2;a:=a+5;end'
mismatched input 'begin' expecting {';', '+', '-', '*', DIV, MOD}
no viable alternative at input 'end'
thanks in advance.

The problem is your rule for a list of statements:
statements : statement statements | EOF ;
This rule has two options: a statement followed by another list of statements, or EOF. The only non-recursive option is the EOF, which becomes a problem when you use this in your rule for a block:
block : BEGIN statements END ;
You can never encounter EOF in the middle of a block, so when the parser reads the line before the end in your sample input, the next thing that it expects to read is another statement. The word end on its own isn't a valid statement, which is why it throws the error that you are seeing.
One possible fix is to make the recursive part of your statements rule optional:
statements : statement statements? | EOF ;
This will allow your sample input to parse successfully. In my opinion, a better option is to remove the recursion altogether:
statements : statement* | EOF ;
Finally, you can see that the EOF is still one of the options for the statements rule. This doesn't make much sense when you use this rule in the as part of the rule for block, since you shouldn't ever find an EOF in the middle of a block. What I would do would be to move this to a new top level parser rule:
program : statements EOF ;
statements : statement* ;

ANTLR: no viable alternative at input

Sorry for my bad English.
I wrote ANTLR4-grammar for GDB/MI output commands from this manual:
grammar GdbOutput;
output : out_of_band_record | result_record | terminator_record;
result_record : TOKEN? '^' RESULT_CLASS (',' result)*;
out_of_band_record : async_record
| stream_record;
async_record : exec_async_output
| status_async_output
| notify_async_output;
exec_async_output : TOKEN? '*' async_output;
status_async_output : TOKEN? '+' async_output;
notify_async_output : TOKEN? '=' async_output;
async_output : async_class (',' result)*;
RESULT_CLASS : 'done'
| 'running'
| 'connected'
| 'error'
| 'exit';
async_class : 'stopped'; //TODO
result : VARIABLE '=' value;
value : const
| tuple
| list;
const : c_string;
c_string : '"' STRING_LITERAL '"';
tuple : '{}'
| '{' result (',' result)* '}';
list : '[]'
| '[' value (',' value)* ']'
| '[' result (',' result)* ']';
stream_record : console_stream_output
| target_stream_output
| log_stream_output;
console_stream_output : '~' c_string;
target_stream_output : '#' c_string;
log_stream_output : '&' c_string;
terminator_record : '(gdb)';
VARIABLE : [a-z-]*;
STRING_LITERAL : (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))*;
TOKEN : [0-9]+;
I tried some output strings from GDB:
"~\"Reading symbols from C:\\src.exe...\"" - OK (out_of_band_record -> stream_record -> console_stream_output)
"(gdb)" -> OK (terminator_record)
"^done,bkpt={number=\"1\",type=\"breakpoint\",disp=\"keep\",enabled=\"y\",addr=\"0x00000000004014e4\",file=\"src.s\",fullname=\"C:\\src.s\",line=\"17\",thread-groups=[\"i1\"],times=\"0\",original-location=\"main\"}" - FAIL (with exception: line 1:0 no viable alternative at input '^done,bkpt={number=')
"^done,bkpt=\"1\"" - FAIL (line 1:0 no viable alternative at input '^done,bkpt=)
"^done,bkpt={}" - FAIL (line 1:0 no viable alternative at input '^done,bkpt={}')
Why my parser didn't recognize strings #3-5?
P.S.: C# target for ANTLR v.4.2.0 prerelease from Nuget

For starters, let your STRING_LITERAL match the quotes too: don't match them in a parser rule. And let your VARIABLE rule match at least one character (change the * into a +).

ANTLR Grammar and generated code problems

I'm trying to create an expression parser using ANTLR
The expression will go inside an if statement so its root is a condition.
I have the following grammar, which "compiles" to parser/lexer files with no problems, however the generated code itself has some errors, essentially two "empty" if statements
i.e.
if (())
Not sure what I'm doing wrong, any help will be greatly appreciated.
Thanks.
Grammar .g file below:
grammar Expression;
options {
language=CSharp3;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Antlr3 }
#lexer::namespace { Antlr3 }
public parse
: orcond EOF -> ^(ROOT orcond)
;
orcond
: andcond ('||' andcond)*
;
andcond
: condition ('&&' condition)*
;
condition
: exp (('<' | '>' | '==' | '!=' | '<=' | '>=')^ exp)?
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' parenthesisvalid ')' -> parenthesisvalid
;
parenthesisvalid
: fullobjectref
| orcond
;
fullobjectref
: objectref ('.' objectref)?
;
objectref
: objectname ('()' | '(' params ')' | '[' params ']')?
;
objectname
: (('a'..'z') | ('A'..'Z'))^ (('a'..'z') | ('A'..'Z') | ('0'..'9') | '_')*
;
params
: paramitem (',' paramitem)?
;
paramitem
: unaryExp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;

Don't use the range operator, .., inside parser rules.
Remove the parser rule objectname and create the lexer rule:
Objectname
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
;

Additionally it will be more neatly to use fragment blocks. In this case your code will be something like this:
Objectname : LETTER (LETTER | DIGIT | '_')*;
Number: DIGIT+ ('.' DIGIT+)?;
fragment DIGIT : '0'..'9' ;
fragment LETTER : ('a'..'z' | 'A'..'Z');

ANTLR3 common values in 2 different domain values

I need to define a language-parser for the following search criteria:
CRITERIA_1=<values-set-#1> AND/OR CRITERIA_2=<values-set-#2>;
Where <values-set-#1> can have values from 1-50 and <values-set-#2> can be from the following set (5, A, B, C) - case is not important here.
I have decided to use ANTLR3 (v3.4) with output in C# (CSharp3) and it used to work pretty smooth until now. The problem is that it fails to parse the string when I provide values from both data-sets (I.e. in this case '5'). For example, if I provide the following string
CRITERIA_1=5;
It returns the following error where the value node was supposed to be:
<unexpected: [#1,11:11='5',<27>,1:11], resync=5>
The grammar definition file is the following:
grammar ZeGrammar;
options {
language=CSharp3;
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
k=3;
}
tokens
{
ROOT;
CRITERIA_1;
CRITERIA_2;
OR = 'OR';
AND = 'AND';
EOF = ';';
LPAREN = '(';
RPAREN = ')';
}
public
start
: expr EOF -> ^(ROOT expr)
;
expr
: subexpr ((AND|OR)^ subexpr)*
;
subexpr
: grouppedsubexpr
| 'CRITERIA_1=' rangeval1_expr -> ^(CRITERIA_1 rangeval1_expr)
| 'CRITERIA_2=' rangeval2_expr -> ^(CRITERIA_2 rangeval2_expr)
;
grouppedsubexpr
: LPAREN! expr RPAREN!
;
rangeval1_expr
: rangeval1_subexpr
| RANGE1_VALUES
;
rangeval1_subexpr
: LPAREN! rangeval1_expr (OR^ rangeval1_expr)* RPAREN!
;
RANGE1_VALUES
: (('0'..'4')? ('0'..'9') | '5''0')
;
rangeval2_expr
: rangeval2_subexpr
| RANGE2_VALUES
;
rangeval2_subexpr
: LPAREN! rangeval2_expr (OR^ rangeval2_expr)* RPAREN!
;
RANGE2_VALUES
: '5' | ('a'|'A') | ('b'|'B') | ('c'|'C')
;
And if I remove the value '5' from RANGE2_VALUES it works fine. Can anyone hint me on what I am doing wrong?

You must realize that the lexer does not produce tokens based on what the parser tries to match. So, in your case, the input "5" will always be tokenized as a RANGE1_VALUES and never as a RANGE2_VALUES because both RANGE1_VALUES and RANGE2_VALUES can match this input but RANGE1_VALUES comes first (so RANGE1_VALUES takes precedence over RANGE2_VALUES).
A possible fix would be to remove both RANGE1_VALUES and RANGE2_VALUES rules and replace them with the following lexer rules:
D0_4
: '0'..'4'
;
D5
: '5'
;
D6_50
: '6'..'9' // 6-9
| '1'..'4' '0'..'9' // 10-49
| '50' // 50
;
A_B_C
: ('a'|'A')
| ('b'|'B')
| ('c'|'C')
;
and the introduce these new parser rules:
range1_values
: D0_4
| D5
| D6_50
;
range2_values
: A_B_C
| D5
;
and change all RANGE1_VALUES and RANGE2_VALUES calls in your parser rules with range1_values and range2_values respectively.
EDIT
Instead of trying to solve this at the lexer-level, you might simply match any integer value and check inside the parser rule if the value is the correct one (or correct range) using a semantic predicate:
range1_values
: INT {Integer.valueOf($INT.text) <= 50}?
;
range2_values
: A_B_C
| INT {Integer.valueOf($INT.text) == 5}?
;
INT
: '0'..'9'+
;
A_B_C
: 'a'..'c'
| 'A'..'C'
;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

ANTLR4 Hexadecimal Parsing - c#

It turns out it was the order of my hexadecimal rules. The reason I didn't see anything change was because Visual Studio was looking at an old copy of my grammar (because Microsofts file-path system is somewhat... alternative). My modified grammar works perfectly. Thanks!

Related

antlr getting only few symbols

antlr4 in dot net "mismatched input 'begin' expecting {';', '+', '-', '*', DIV, MOD}

ANTLR: no viable alternative at input

ANTLR Grammar and generated code problems

ANTLR3 common values in 2 different domain values

Categories

Resources