ANTLR4 Mismatch in simple grammar - c#

I am using the latest version of Antlr (4.3) to parse this simple source file. I'm using the Visual Studio add-in, but that should not have anything to do with my problem.
Source file:
OBJECT Codeunit 80 Sales-Post
{
OBJECT-PROPERTIES
{
Date=11/12/10;
Time=12:00:00;
Version List=NAVW16.00.10,NAVBE6.00.01;
}
}
This should be pretty straightforward to parse, but I keep getting 2 errors while parsing:
line 1:16 mismatched input '80' expecting DOCUMENT_ID
line 5:4 mismatched input 'Date' expecting {DOCUMENT_PROPERTY_ID, '}'}
Complete grammar:
grammar Cal;
/*
* Parser Rules
*/
document
: document_header OPEN_BRACE document_content CLOSE_BRACE
;
document_header
: OBJECT_DEFINITION DOCUMENT_TYPE DOCUMENT_ID DOCUMENT_NAME
;
document_content
: document_properties
;
document_properties
: OBJECT_PROPERTIES OPEN_BRACE document_property* CLOSE_BRACE
;
document_property
: DOCUMENT_PROPERTY_ID EQ DOCUMENT_PROPERTY_VALUE LINE_TERM
;
/*
* Lexer Rules
*/
OBJECT_PROPERTIES
: 'OBJECT-PROPERTIES'
;
OBJECT_DEFINITION
: 'OBJECT'
;
DOCUMENT_TYPE
: 'Codeunit'
| 'Table'
;
DOCUMENT_PROPERTY_VALUE
: ([0-9a-zA-Z]|'_'|'-'|'.'|'/'|','|':')+
;
DOCUMENT_PROPERTY_ID
: 'Date'
| 'Time'
| 'Version List'
;
DOCUMENT_ID
: [0-9]+
;
DOCUMENT_NAME
: ID
;
OPEN_BRACE
: '{'
;
CLOSE_BRACE
: '}'
;
LINE_TERM
: ';'
;
EQ
: '='
;
ID
: ([a-zA-Z]|'_'|'-')+
;
INT
: [0-9]+
;
WS
: [ \t]+ -> channel(HIDDEN)
;
NEWLINE
:'\r'? '\n' -> channel(HIDDEN)
;
This is the output of the token stream (tokens are surrounded with '<>':
<OBJECT> < > <Codeunit> < > <80> < > <Sales-Post> <
> <{> <
> < > <OBJECT-PROPERTIES> <
> < > <{> <
> < > <Date> <=> <11/12/10> <;> <
> < > <Time> <=> <12:00:00> <;> <
> < > <Version List> <=> <NAVW16.00.10,NAVBE6.00.01> <;> <
> < > <}> <
> <}> <<EOF>

In ANTLR, when two lexer rules can match the same token (of the same length), the rule that appears first wins.
80 can be matched by DOCUMENT_ID, but also by DOCUMENT_PROPERTY_VALUE and INT, so just reorder these rules here.
You have the same problem here with DOCUMENT_PROPERTY_ID which is below DOCUMENT_PROPERTY_VALUE (both can match Date).
I suggest you put DOCUMENT_PROPERTY_VALUE just above WS: most specific rules (ie keywords) go first, and broader rules last.
You also have to get rid of DOCUMENT_ID or INT, as they have the same definition. One of them will never match. You don't seem to use INT in the parser, so just remove the rule.

Related

Can I use Antlr with damaged/incomplete input and if so - how?

Can rules/parser/lexer be set up so as to accept input that conforms to the expected structure, but the static (predefined) tokens are not written in full?
Example:
I have an ANTLR4 grammar (C# target) that I use to parse some input and use it to run specific methods of my application.
(made-up):
grammar:
setWage
: SETWAGE userId=STRING value=NUMBER
;
SETWAGE
: 'setWage'
;
input:
setWage john.doe 2000
A listener that walks the parse tree in method for setWage rule (after getting text from labeled tokens) would call for example:
SalaryManager.SetWage(User.GetById("john.doe"), 2000);
My question: can Antlr (or the grammar) be set up so as to allow for example for such input:
setW john.doe 2000
assuming that there are no rules for e.g. "setWater" or "setWindow", or assuming that there are and I'm fine with Antlr choosing one of those by itself (albeit, consistently the same one).
Please note that this question is mostly academical and I'm not looking for a better way to achieve that input->action linking.
You probably know this already, but you can elaborate the set of possible input matches
SETWAGE : 'setW' | 'setWa' | 'setWag' | 'setWage' ;
or
SETWAGE : 'set' ('W' ('a' ('g' ('e')? )? )? ) ;
Not sure if the latter satisfies your requirement that "the static (predefined) tokens are not written in full".
Hard-coding the "synonyms" could be tedious, but how many do you need?
Here's an example I wrote to validate the approach. (Java target, but that shouldn't matter)
actions.g4
grammar actions ;
actions : action+;
action : setWage | deductSum ;
setWage : SETWAGEOP userId=SYMBOL value=NUMBER ;
deductSum : DEDUCTSUMOP userId=SYMBOL value=NUMBER ;
//SETWAGEOP : 'setW' | 'setWa' | 'setWag' | 'setWage' ;
SETWAGEOP : 'set' ('W' ('a' ('g' ('e')? )? )? ) ;
DEDUCTSUMOP : 'deduct' ('S' ('u' ('m')? )? ) ;
WS : [ \t\n\r]+ -> channel(HIDDEN) ;
SYMBOL : [a-zA-Z][a-zA-Z0-9\.]* ;
NUMBER : [0-9]+ ;
testinput
setW john.doe 2000
deductS john.doe 50
setWag joe.doe.III 2002
deductSu joe.doe 40
setWage jane.doe 2004
deductSum john.doe 50
Transcript:
$ antlr4 actions.g4 ; javac actions*.java ; grun actions actions -tree < testinput
(actions (action (setWage setW john.doe 2000)) (action (deductSum deductS john.doe 50)) (action (setWage setWag joe.doe.III 2002)) (action (deductSum deductSu joe.doe 40)) (action (setWage setWage jane.doe 2004)) (action (deductSum deductSum john.doe 50)))

Using ANTLR Parser and Lexer Separatly

I used ANTLR version 4 for creating compiler.First Phase was the Lexer part. I created "CompilerLexer.g4" file and putted lexer rules in it.It works fine.
CompilerLexer.g4:
lexer grammar CompilerLexer;
INT : 'int' ; //1
FLOAT : 'float' ; //2
BEGIN : 'begin' ; //3
END : 'end' ; //4
To : 'to' ; //5
NEXT : 'next' ; //6
REAL : 'real' ; //7
BOOLEAN : 'bool' ; //8
.
.
.
NOTEQUAL : '!=' ; //46
AND : '&&' ; //47
OR : '||' ; //48
POW : '^' ; //49
ID : [a-zA-Z]+ ; //50
WS
: ' ' -> channel(HIDDEN) //50
;
Now it is time for phase 2 which is the parser.I created "CompilerParser.g4" file and putted grammars in it but have dozens warning and errors.
CompilerParser.g4:
parser grammar CompilerParser;
options { tokenVocab = CompilerLexer; }
STATEMENT : EXPRESSION SEMIC
| IFSTMT
| WHILESTMT
| FORSTMT
| READSTMT SEMIC
| WRITESTMT SEMIC
| VARDEF SEMIC
| BLOCK
;
BLOCK : BEGIN STATEMENTS END
;
STATEMENTS : STATEMENT STATEMENTS*
;
EXPRESSION : ID ASSIGN EXPRESSION
| BOOLEXP
;
RELEXP : MODEXP (GT | LT | EQUAL | NOTEQUAL | LE | GE | AND | OR) RELEXP
| MODEXP
;
.
.
.
VARDEF : (ID COMA)* ID COLON VARTYPE
;
VARTYPE : INT
| FLOAT
| CHAR
| STRING
;
compileUnit
: EOF
;
Warning and errors:
implicit definition of token 'BLOCK' in parser
implicit definition of token 'BOOLEXP' in parser
implicit definition of token 'EXP' in parser
implicit definition of token 'EXPLIST' in parser
lexer rule 'BLOCK' not allowed in parser
lexer rule 'EXP' not allowed in parser
lexer rule 'EXPLIST' not allowed in parser
lexer rule 'EXPRESSION' not allowed in parser
Have dozens of these warning and errors. What is the cause?
General Questions: What is difference between using combined grammar and using lexer and parser separately? How should join separate grammar and lexer files?
Lexer rules start with a capital letter, and parser rules start with a lowercase letter. In a parser grammar, you can't define tokens. And since ANTLR thinks all your upper-cased rules lexer rules, it produces theses errors/warning.
EDIT
user2998131 wrote:
General Questions: What is difference between using combined grammar and using lexer and parser separately?
Separating the lexer and parser rules will keeps things organized. Also, when creating separate lexer and parser grammars, you can't (accidentally) put literal tokens inside your parser grammar but will need to define all tokens in your lexer grammar. This will make it apparent which lexer rules get matched before others, and you can't make any typo's inside recurring literal tokens:
grammar P;
r1 : 'foo' r2;
r2 : r3 'foo '; // added an accidental space after 'foo'
But when you have a parser grammar, you can't make that mistake. You will have to use the lexer rule that matches 'foo':
parser grammar P
options { tokenVocab=L; }
r1 : FOO r2;
r2 : r3 FOO;
lexer grammar L;
FOO : 'foo';
user2998131 wrote:
How should join separate grammar and lexer files?
Just like you do in your parser grammar: you point to the proper tokenVocab inside the options { ... } block.
Note that you can also import grammars, which is something different: https://github.com/antlr/antlr4/blob/master/doc/grammars.md#grammar-imports

Regex: Matching individual characters only (no characters containted in strings)

Aim: To split a string Regex.Split(...) based on a pattern of individual characters, leaving the character matched at the beginning of a split list.
Problem: One of the characters can appear in other parts of the string I don't wish to split on and I'm getting more list items than intended.
Example of string to split: T 2 TBS PO And > Qd PRN MIX X A 3 TB \ A 4 TB Xmon UG
Outcome desired:
T 2 TBS PO And
> Qd PRN MIX
X A 3 TB
\ A 4 TB Xmon UG
Pattern: (?=[#\+X\\>])
This works for everything except the X. Instead of the desired outcome, I'm getting it split in undesired places.
Current outcome:
T 2 TBS PO And
> Qd PRN MI
X
X A 3 TB
\ A 4 TB X
mon UG
Basically, I need it to not split on a string of characters only when it's on it's own.
Thanks in advance for your help
UPDATE: Oops! I seemed to have forgot to mention that that the centre of the pattern, the characters to split by, have been pulled from a table and technically, I don't know there's the X there beforehand (they may also change.)
For this reason, Jonny 5/Jerry's suggestion seems the most viable to me. I'll test when I get into work.
You could put some \s in there to make sure the characters you are matching are alone:
(?<=\s)(?=[#\+\\>X]\s)
(?<=\s) makes sure the character is preceded by a space, and the space that follows makes sure the character is followed by a space.
Note: where 'space' is mentioned above, it actually means whitespace, tab, newline, carriage return.
Just split your regex into two and piped them:
(?=[#\+\\>])|(?=\bX\b)
(?=[#\+\\>]) checking for your regular characters.
(?=\bX\b) is checking for alone X
Why not roll your own instead of using a regular expression:
public IEnumerable<string> CustomSplit( string source )
{
StringBuilder buf = new StringBuilder();
for ( int i = 0 ; i < source.Length ; ++i )
{
char curr = source[i] ;
char next = i+1 < source.Length ? source[i+1] : ' ' ;
bool isDelimiter = curr == '#'
| curr == '+'
| curr == '\\'
| curr == '>'
| ( curr == 'X' && char.IsWhiteSpace(next) )
;
if ( isDelimiter )
{
if ( buf.Length > 0 ) yield return buf.ToString() ;
buf.Length = 0 ;
}
buf.Append(curr) ;
}
// return the last element, if there is one.
if ( buf.Length > 0 ) yield return buf.ToString() ;
}

ANTLR "[Token name] is already defined" errors when building in visual studio 2012

I have a very simple ANTLR parser building in Visual Studio 2012. It works. But when it builds the grammar file, it emits a warning for every token, saying that the token is already defined. What could be causing this?
Here is the grammar file SimpleCalc.g4:
grammar SimpleCalc;
options {
language=CSharp2;
}
tokens {
PLUS,
MINUS,
TIMES,
DIV
}
#members {
}
expr : term ( (PLUS|MINUS) term )* ;
term : factor ( ( TIMES|DIV ) factor )* ;
factor : NUMBER ;
DIV : '/';
PLUS : '+';
TIMES: '*';
MINUS: '-';
NUMBER : (DIGIT)+ {System.Console.WriteLine("Found number"); };
WHITESPACE: ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ -> skip ;
fragment DIGIT : '0'..'9';
And here are the warnings:
[path]\SimpleCalc.g4(8,3): warning AC0108: token name 'PLUS' is already defined
[path]\SimpleCalc.g4(9,3): warning AC0108: token name 'MINUS' is already defined
[path]\SimpleCalc.g4(10,3): warning AC0108: token name 'TIMES' is already defined
[path]\SimpleCalc.g4(11,3): warning AC0108: token name 'DIV' is already defined
I would get rid of the unnecessary tokens {...} block.

Using ANTLR 3.3?

I'm trying to get started with ANTLR and C# but I'm finding it extraordinarily difficult due to the lack of documentation/tutorials. I've found a couple half-hearted tutorials for older versions, but it seems there have been some major changes to the API since.
Can anyone give me a simple example of how to create a grammar and use it in a short program?
I've finally managed to get my grammar file compiling into a lexer and parser, and I can get those compiled and running in Visual Studio (after having to recompile the ANTLR source because the C# binaries seem to be out of date too! -- not to mention the source doesn't compile without some fixes), but I still have no idea what to do with my parser/lexer classes. Supposedly it can produce an AST given some input...and then I should be able to do something fancy with that.
Let's say you want to parse simple expressions consisting of the following tokens:
- subtraction (also unary);
+ addition;
* multiplication;
/ division;
(...) grouping (sub) expressions;
integer and decimal numbers.
An ANTLR grammar could look like this:
grammar Expression;
options {
language=CSharp2;
}
parse
: exp EOF
;
exp
: addExp
;
addExp
: mulExp (('+' | '-') mulExp)*
;
mulExp
: unaryExp (('*' | '/') unaryExp)*
;
unaryExp
: '-' atom
| atom
;
atom
: Number
| '(' exp ')'
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Now to create a proper AST, you add output=AST; in your options { ... } section, and you mix some "tree operators" in your grammar defining which tokens should be the root of a tree. There are two ways to do this:
add ^ and ! after your tokens. The ^ causes the token to become a root and the ! excludes the token from the ast;
by using "rewrite rules": ... -> ^(Root Child Child ...).
Take the rule foo for example:
foo
: TokenA TokenB TokenC TokenD
;
and let's say you want TokenB to become the root and TokenA and TokenC to become its children, and you want to exclude TokenD from the tree. Here's how to do that using option 1:
foo
: TokenA TokenB^ TokenC TokenD!
;
and here's how to do that using option 2:
foo
: TokenA TokenB TokenC TokenD -> ^(TokenB TokenA TokenC)
;
So, here's the grammar with the tree operators in it:
grammar Expression;
options {
language=CSharp2;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse
: exp EOF -> ^(ROOT exp)
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' exp ')' -> exp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
I also added a Space rule to ignore any white spaces in the source file and added some extra tokens and namespaces for the lexer and parser. Note that the order is important (options { ... } first, then tokens { ... } and finally the #... {}-namespace declarations).
That's it.
Now generate a lexer and parser from your grammar file:
java -cp antlr-3.2.jar org.antlr.Tool Expression.g
and put the .cs files in your project together with the C# runtime DLL's.
You can test it using the following class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Preorder(ITree Tree, int Depth)
{
if(Tree == null)
{
return;
}
for (int i = 0; i < Depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
Preorder(Tree.GetChild(0), Depth + 1);
Preorder(Tree.GetChild(1), Depth + 1);
}
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("(12.5 + 56 / -7) * 0.5");
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
ExpressionParser.parse_return ParseReturn = Parser.parse();
CommonTree Tree = (CommonTree)ParseReturn.Tree;
Preorder(Tree, 0);
}
}
}
which produces the following output:
ROOT
*
+
12.5
/
56
UNARY_MIN
7
0.5
which corresponds to the following AST:
(diagram created using graph.gafol.net)
Note that ANTLR 3.3 has just been released and the CSharp target is "in beta". That's why I used ANTLR 3.2 in my example.
In case of rather simple languages (like my example above), you could also evaluate the result on the fly without creating an AST. You can do that by embedding plain C# code inside your grammar file, and letting your parser rules return a specific value.
Here's an example:
grammar Expression;
options {
language=CSharp2;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse returns [double value]
: exp EOF {$value = $exp.value;}
;
exp returns [double value]
: addExp {$value = $addExp.value;}
;
addExp returns [double value]
: a=mulExp {$value = $a.value;}
( '+' b=mulExp {$value += $b.value;}
| '-' b=mulExp {$value -= $b.value;}
)*
;
mulExp returns [double value]
: a=unaryExp {$value = $a.value;}
( '*' b=unaryExp {$value *= $b.value;}
| '/' b=unaryExp {$value /= $b.value;}
)*
;
unaryExp returns [double value]
: '-' atom {$value = -1.0 * $atom.value;}
| atom {$value = $atom.value;}
;
atom returns [double value]
: Number {$value = Double.Parse($Number.Text, CultureInfo.InvariantCulture);}
| '(' exp ')' {$value = $exp.value;}
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
which can be tested with the class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Main (string[] args)
{
string expression = "(12.5 + 56 / -7) * 0.5";
ANTLRStringStream Input = new ANTLRStringStream(expression);
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
Console.WriteLine(expression + " = " + Parser.parse());
}
}
}
and produces the following output:
(12.5 + 56 / -7) * 0.5 = 2.25
EDIT
In the comments, Ralph wrote:
Tip for those using Visual Studio: you can put something like java -cp "$(ProjectDir)antlr-3.2.jar" org.antlr.Tool "$(ProjectDir)Expression.g" in the pre-build events, then you can just modify your grammar and run the project without having to worry about rebuilding the lexer/parser.
Have you looked at Irony.net? It's aimed at .Net and therefore works really well, has proper tooling, proper examples and just works. The only problem is that it is still a bit 'alpha-ish' so documentation and versions seem to change a bit, but if you just stick with a version, you can do nifty things.
p.s. sorry for the bad answer where you ask a problem about X and someone suggests something different using Y ;^)
My personal experience is that before learning ANTLR on C#/.NET, you should spare enough time to learn ANTLR on Java. That gives you knowledge on all the building blocks and later you can apply on C#/.NET.
I wrote a few blog posts recently,
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-i/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-ii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iv/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-v/
The assumption is that you are familiar with ANTLR on Java and is ready to migrate your grammar file to C#/.NET.
There is a great article on how to use antlr and C# together here:
http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx
it's a "how it was done" article by the creator of NCalc which is a mathematical expression evaluator for C# - http://ncalc.codeplex.com
You can also download the grammar for NCalc here:
http://ncalc.codeplex.com/SourceControl/changeset/view/914d819f2865#Grammar%2fNCalc.g
example of how NCalc works:
Expression e = new Expression("Round(Pow(Pi, 2) + Pow([Pi2], 2) + X, 2)");
e.Parameters["Pi2"] = new Expression("Pi * Pi");
e.Parameters["X"] = 10;
e.EvaluateParameter += delegate(string name, ParameterArgs args)
{
if (name == "Pi")
args.Result = 3.14;
};
Debug.Assert(117.07 == e.Evaluate());
hope its helpful

Categories

Resources