Using ANTLR Parser and Lexer Separatly - c#

I used ANTLR version 4 for creating compiler.First Phase was the Lexer part. I created "CompilerLexer.g4" file and putted lexer rules in it.It works fine.
CompilerLexer.g4:
lexer grammar CompilerLexer;
INT : 'int' ; //1
FLOAT : 'float' ; //2
BEGIN : 'begin' ; //3
END : 'end' ; //4
To : 'to' ; //5
NEXT : 'next' ; //6
REAL : 'real' ; //7
BOOLEAN : 'bool' ; //8
.
.
.
NOTEQUAL : '!=' ; //46
AND : '&&' ; //47
OR : '||' ; //48
POW : '^' ; //49
ID : [a-zA-Z]+ ; //50
WS
: ' ' -> channel(HIDDEN) //50
;
Now it is time for phase 2 which is the parser.I created "CompilerParser.g4" file and putted grammars in it but have dozens warning and errors.
CompilerParser.g4:
parser grammar CompilerParser;
options { tokenVocab = CompilerLexer; }
STATEMENT : EXPRESSION SEMIC
| IFSTMT
| WHILESTMT
| FORSTMT
| READSTMT SEMIC
| WRITESTMT SEMIC
| VARDEF SEMIC
| BLOCK
;
BLOCK : BEGIN STATEMENTS END
;
STATEMENTS : STATEMENT STATEMENTS*
;
EXPRESSION : ID ASSIGN EXPRESSION
| BOOLEXP
;
RELEXP : MODEXP (GT | LT | EQUAL | NOTEQUAL | LE | GE | AND | OR) RELEXP
| MODEXP
;
.
.
.
VARDEF : (ID COMA)* ID COLON VARTYPE
;
VARTYPE : INT
| FLOAT
| CHAR
| STRING
;
compileUnit
: EOF
;
Warning and errors:
implicit definition of token 'BLOCK' in parser
implicit definition of token 'BOOLEXP' in parser
implicit definition of token 'EXP' in parser
implicit definition of token 'EXPLIST' in parser
lexer rule 'BLOCK' not allowed in parser
lexer rule 'EXP' not allowed in parser
lexer rule 'EXPLIST' not allowed in parser
lexer rule 'EXPRESSION' not allowed in parser
Have dozens of these warning and errors. What is the cause?
General Questions: What is difference between using combined grammar and using lexer and parser separately? How should join separate grammar and lexer files?

Lexer rules start with a capital letter, and parser rules start with a lowercase letter. In a parser grammar, you can't define tokens. And since ANTLR thinks all your upper-cased rules lexer rules, it produces theses errors/warning.
EDIT
user2998131 wrote:
General Questions: What is difference between using combined grammar and using lexer and parser separately?
Separating the lexer and parser rules will keeps things organized. Also, when creating separate lexer and parser grammars, you can't (accidentally) put literal tokens inside your parser grammar but will need to define all tokens in your lexer grammar. This will make it apparent which lexer rules get matched before others, and you can't make any typo's inside recurring literal tokens:
grammar P;
r1 : 'foo' r2;
r2 : r3 'foo '; // added an accidental space after 'foo'
But when you have a parser grammar, you can't make that mistake. You will have to use the lexer rule that matches 'foo':
parser grammar P
options { tokenVocab=L; }
r1 : FOO r2;
r2 : r3 FOO;
lexer grammar L;
FOO : 'foo';
user2998131 wrote:
How should join separate grammar and lexer files?
Just like you do in your parser grammar: you point to the proper tokenVocab inside the options { ... } block.
Note that you can also import grammars, which is something different: https://github.com/antlr/antlr4/blob/master/doc/grammars.md#grammar-imports

Related

ANTLR4 Nested modes?

I'm attempting to parse the following string:
<<! variable, my_variable, A description of my variable !>>
From the reading I've been doing here, I believe I need to use modes to distinguish between the lexers for the literal string 'variable', the variable name (my_variable), and the variable description.
The problem I'm having is that I'm not sure how to structure this. Is it possible to nest modes? Is there a better/smarter way to organize my lexer rules?
lexer grammar VariableLexer;
variableMarkdown : DELIMITER_OPEN SPACE VARIABLE COMMA SPACE variable_name COMMA SPACE description SPACE DELIMITER_CLOSE;
description : WORDS ;
variable_name : ID ;
DELIMITER_OPEN : '<<!' ;
DELIMITER_CLOSE : '!>>';
COMMA : ',' ;
SPACE : ' ' ;
VARIABLE : 'variable' -> pushMode(VariableName);
mode VariableName;
ID : LOWERCASE ( LOWERCASE | NUMBER | UNDERSCORE )* -> pushMode(VariableDescription) ;
mode VariableDescription;
WORDS : ( UPPERCASE | LOWERCASE | NUMBER | SPACE )+ -> popMode;
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment UNDERSCORE : '_' ;
fragment NUMBER : '0'..'9' ;
First - you can't have parser rules in lexer grammar - parser rules start with small letter, lexer ones with capital.
I'd do it like so (may not be correct syntax but you'll get the idea):
//default mode is implicitly defined by (or in) ANTLR4
VARIABLE : 'variable' (' ')* ',' -> mode(mode_VariableName);
...
mode mode_VariableName;
//define token with anything ending with comma, many ways to do this...
fragment varNameFrag: [a-zA-Z_0-9];
VARIABLE_NAME: varNameFrag varNameFrag* (' ')* ',' -> mode(mode_varDesc);
mode mode_varDesc;
//similar again for variable description
VAR_DESC: //I'll write just a comment here but should more or less match anything except
END_VAR: '!>>' -> mode(DEFAULT_MODE)
Basically in this way you are jumping to modes you need instead of pushing and popping.

ANTLR rule to skip method body

My task is to create ANTLR grammar, to analyse C# source code files and generate class hierarchy. Then, I will use it to generate class diagram.
I wrote rules to parse namespaces, class declarations and method declarations. Now I have problem with skipping methods bodies. I don't need to parse them, because bodies are useless in my task.
I wrote simple rule:
body:
'{' .* '}'
;
but it does not work properly, when method looks like:
void foo()
{
...
{
...
}
...
}
rule matches first brace what is ok, then it matches
...
{
...
as 'any'(.*) and then third brace as final brace, what is not ok, and rule ends.
Anybody could help me to write proper rule for method bodies? As I said before, I don't want to parse them - only to skip.
UPDATE:
here is solution of my problem strongly based on Adam12 answer
body:
'{' ( ~('{' | '}') | body)* '}'
;
You have to use recursive rules that match parentheses pairs.
rule1 : '('
(
nestedParan
| (~')')*
)
')';
nestedParan : '('
(
nestedParan
| (~')')*
)
')';
This code assumes you are using the parser here so strings and comments are already excluded. ANTLR doesn't allow negation of multiple alternatives in parser rules so the code above relies on the fact that alternatives are tried in order. It should give a warning that alternatives 1 and 2 both match '(' and thus choose the first alternative, which is what we want.
You can handle the recursion of (nested) blocks in your lexer. The trick is to let your class definition also include the opening { so that not the entire contents of the class is gobbled up by this recursive lexer rule.
A quick demo that is without a doubt not complete, but is a decent start to "fuzzy parse/lex" a Java (or C# with some slight modifications) source file:
grammar T;
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));})* EOF
;
Skip
: (StringLiteral | CharLiteral | Comment) {skip();}
;
PackageDecl
: 'package' Spaces Ids {setText($Ids.text);}
;
ClassDecl
: 'class' Spaces Id Spaces? '{' {setText($Id.text);}
;
Method
: Id Spaces? ('(' {setText($Id.text);}
| /* no method after all! */ {skip();}
)
;
MethodOrStaticBlock
: Block {skip();}
;
Any
: . {skip();}
;
// fragments
fragment Spaces
: (' ' | '\t' | '\r' | '\n')+
;
fragment Ids
: Id ('.' Id)*
;
fragment Id
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
fragment Block
: '{' ( ~('{' | '}' | '"' | '\'' | '/')
| {input.LA(2) != '/'}?=> '/'
| StringLiteral
| CharLiteral
| Comment
| Block
)*
'}'
;
fragment Comment
: '/*' .* '*/'
| '//' ~('\r' | '\n')*
;
fragment CharLiteral
: '\'' ('\\\'' | ~('\\' | '\'' | '\r' | '\n'))+ '\''
;
fragment StringLiteral
: '"' ('\\"' | ~('\\' | '"' | '\r' | '\n'))* '"'
;
I ran the generated parser against the following Java source file:
/*
... package NO.PACKAGE; ...
*/
package foo.bar;
public final class Mu {
static String x;
static {
x = "class NotAClass!";
}
void m1() {
// {
while(true) {
double a = 2.0 / 2;
if(a == 1.0) { break; } // }
/* } */
}
}
static class Inner {
int m2 () {return 42; /*comment}*/ }
}
}
which produced the following output:
PackageDecl 'foo.bar'
ClassDecl 'Mu'
Method 'm1'
ClassDecl 'Inner'
Method 'm2'

ANTLR3 common values in 2 different domain values

I need to define a language-parser for the following search criteria:
CRITERIA_1=<values-set-#1> AND/OR CRITERIA_2=<values-set-#2>;
Where <values-set-#1> can have values from 1-50 and <values-set-#2> can be from the following set (5, A, B, C) - case is not important here.
I have decided to use ANTLR3 (v3.4) with output in C# (CSharp3) and it used to work pretty smooth until now. The problem is that it fails to parse the string when I provide values from both data-sets (I.e. in this case '5'). For example, if I provide the following string
CRITERIA_1=5;
It returns the following error where the value node was supposed to be:
<unexpected: [#1,11:11='5',<27>,1:11], resync=5>
The grammar definition file is the following:
grammar ZeGrammar;
options {
language=CSharp3;
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
k=3;
}
tokens
{
ROOT;
CRITERIA_1;
CRITERIA_2;
OR = 'OR';
AND = 'AND';
EOF = ';';
LPAREN = '(';
RPAREN = ')';
}
public
start
: expr EOF -> ^(ROOT expr)
;
expr
: subexpr ((AND|OR)^ subexpr)*
;
subexpr
: grouppedsubexpr
| 'CRITERIA_1=' rangeval1_expr -> ^(CRITERIA_1 rangeval1_expr)
| 'CRITERIA_2=' rangeval2_expr -> ^(CRITERIA_2 rangeval2_expr)
;
grouppedsubexpr
: LPAREN! expr RPAREN!
;
rangeval1_expr
: rangeval1_subexpr
| RANGE1_VALUES
;
rangeval1_subexpr
: LPAREN! rangeval1_expr (OR^ rangeval1_expr)* RPAREN!
;
RANGE1_VALUES
: (('0'..'4')? ('0'..'9') | '5''0')
;
rangeval2_expr
: rangeval2_subexpr
| RANGE2_VALUES
;
rangeval2_subexpr
: LPAREN! rangeval2_expr (OR^ rangeval2_expr)* RPAREN!
;
RANGE2_VALUES
: '5' | ('a'|'A') | ('b'|'B') | ('c'|'C')
;
And if I remove the value '5' from RANGE2_VALUES it works fine. Can anyone hint me on what I am doing wrong?
You must realize that the lexer does not produce tokens based on what the parser tries to match. So, in your case, the input "5" will always be tokenized as a RANGE1_VALUES and never as a RANGE2_VALUES because both RANGE1_VALUES and RANGE2_VALUES can match this input but RANGE1_VALUES comes first (so RANGE1_VALUES takes precedence over RANGE2_VALUES).
A possible fix would be to remove both RANGE1_VALUES and RANGE2_VALUES rules and replace them with the following lexer rules:
D0_4
: '0'..'4'
;
D5
: '5'
;
D6_50
: '6'..'9' // 6-9
| '1'..'4' '0'..'9' // 10-49
| '50' // 50
;
A_B_C
: ('a'|'A')
| ('b'|'B')
| ('c'|'C')
;
and the introduce these new parser rules:
range1_values
: D0_4
| D5
| D6_50
;
range2_values
: A_B_C
| D5
;
and change all RANGE1_VALUES and RANGE2_VALUES calls in your parser rules with range1_values and range2_values respectively.
EDIT
Instead of trying to solve this at the lexer-level, you might simply match any integer value and check inside the parser rule if the value is the correct one (or correct range) using a semantic predicate:
range1_values
: INT {Integer.valueOf($INT.text) <= 50}?
;
range2_values
: A_B_C
| INT {Integer.valueOf($INT.text) == 5}?
;
INT
: '0'..'9'+
;
A_B_C
: 'a'..'c'
| 'A'..'C'
;

ANTLR parser and tree grammars for one simple language

Edit:
Here is the updated tree and parser grammars:
parser grammar:
options {
language = CSharp2;
output=AST;
}
tokens {
UNARY_MINUS;
CALL;
}
program : (function)* main_function
;
function: 'function' IDENTIFIER '(' (parameter (',' parameter)*)? ')' 'returns' TYPE declaration* statement* 'end' 'function'
-> ^('function' IDENTIFIER parameter* TYPE declaration* statement*)
;
main_function
: 'function' 'main' '(' ')' 'returns' TYPE declaration* statement* 'end' 'function'
-> ^('function' 'main' TYPE declaration* statement*)
;
parameter
: 'param' IDENTIFIER ':' TYPE
-> ^('param' IDENTIFIER TYPE)
;
declaration
: 'variable' IDENTIFIER ( ',' IDENTIFIER)* ':' TYPE ';'
-> ^('variable' TYPE IDENTIFIER+ )
| 'array' array ':' TYPE ';'
-> ^('array' array TYPE)
;
statement
: ';'! | block | assignment | if_statement | switch_statement | while_do_statement | for_statement | call_statement | return_statement
;
call_statement
: call ';'!
;
return_statement
: 'return' expression ';'
-> ^('return' expression)
;
block : 'begin' declaration* statement* 'end'
-> ^('begin' declaration* statement*)
| '{' declaration* statement* '}'
-> ^('{' declaration* statement*)
;
assignment
: IDENTIFIER ':=' expression ';'
-> ^(':=' IDENTIFIER expression )
| array ':=' expression ';'
-> ^(':=' array expression)
;
array : IDENTIFIER '[' expression (',' expression)* ']'
-> ^(IDENTIFIER expression+)
;
if_statement
: 'if' '(' expression ')' 'then' statement ('else' statement)? 'end' 'if'
-> ^('if' expression statement statement?)
;
switch_statement
: 'switch' '(' expression ')' case_part+ ('default' ':' statement)? 'end' 'switch'
-> ^('switch' expression case_part+ statement?)
;
case_part
: 'case' literal (',' literal)* ':' statement
-> ^('case' literal+ statement)
;
literal
: INTEGER | FLOAT | BOOLEAN | STRING
;
while_do_statement
: 'while' '(' expression ')' 'do' statement 'end' ' while'
-> ^('while' expression statement)
;
for_statement
: 'for' '(' IDENTIFIER ':=' expression 'to' expression ')' 'do' statement 'end' 'for'
-> ^('for' IDENTIFIER expression expression statement)
;
expression
: conjuction ( 'or'^ conjuction)*
;
conjuction
: equality ('and'^ equality)*
;
equality: relation (('=' | '/=')^ relation)?
;
relation: addition (('<' | '<=' | '>' | '>=')^ addition)?
;
addition: multiplication (('+' | '-')^ multiplication)*
;
multiplication
: unary_operation (('*' | '/' | '%')^ unary_operation)*
;
unary_operation
: '-' primary
-> ^(UNARY_MINUS primary)
| 'not' primary
-> ^('not' primary)
| primary
;
primary : IDENTIFIER
| array
| literal
| '('! expression ')'!
| '(' TYPE ')' '(' expression ')'
-> ^(TYPE expression)
| call
;
call : IDENTIFIER '(' arguments ')'
-> ^(CALL IDENTIFIER arguments)
;
arguments
: (expression (','! expression)*)?
;
BOOLEAN : 'true' | 'false'
;
T YPE : 'integer' | 'boolean' | 'float' | 'string' | 'array' | 'void'
;
IDENTIFIER : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INTEGER : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')+
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' .* '"'
;
And here is the updated tree grammar (I altered expressions, and so on...):
options {
language = 'CSharp2';
//tokenVocab= token vocab needed
ASTLabelType=CommonTree; // what is Java type of nodes?
}
program : (function)* main_function
;
function: ^('function' IDENTIFIER parameter* TYPE declaration* statement*)
;
main_function
: ^('function' 'main' TYPE declaration* statement*)
;
parameter
: ^('param' IDENTIFIER TYPE)
;
declaration
: ^('variable' TYPE IDENTIFIER+)
| ^('array' array TYPE )
;
statement
: block | assignment | if_statement | switch_statement | while_do_statement | for_statement | call_statement | return_statement
;
call_statement
: call
;
return_statement
: ^('return' expression)
;
block : ^('begin' declaration* statement*)
| ^('{' declaration* statement*)
;
assignment
: ^(':=' IDENTIFIER expression )
| ^(':=' array expression)
;
array : ^(IDENTIFIER expression+)
;
if_statement
: ^('if' expression statement statement?)
;
switch_statement
: ^('switch' expression case_part+ statement?)
;
case_part
: ^('case' literal+ statement)
;
literal
: INTEGER | FLOAT | BOOLEAN | STRING
;
while_do_statement
: ^('while' expression statement)
;
for_statement
: ^('for' IDENTIFIER expression expression statement)
;
expression
: ^('or' expression expression)
| ^('and' expression expression)
| ^('=' expression expression)
| ^('/=' expression expression)
| ^('<' expression expression)
| ^('<=' expression expression)
| ^('>' expression expression)
| ^('>=' expression expression)
| ^('+' expression expression)
| ^('-' expression expression)
| ^(UNARY_MINUS expression)
| ^('not' expression)
| IDENTIFIER
| array
| literal
| ^(TYPE expression)
| call
;
call : ^(CALL IDENTIFIER arguments)
;
arguments
: (expression (expression)*)?
;
I succesfluly generated tree graph with DOTTreeGenerator and StringTemplate classes so it seems that all is working at the moment. But any suggestions (about bad habits or something else in this grammars) are appreciated since I don't have a lot of experience with ANTLR or language recognition.
See updates on http://vladimir-radojicic.blogspot.com
The only thing I was going to suggest, besides introducing imaginary tokens to make sure your tree grammar produces a "unique AST" and simplifying the expression in the tree-grammar, which you both already did (again: well done!), is that you shouldn't use literal tokens inside your parser grammar. Especially not when they can possibly be matched by other lexer rule(s). For example, all your reserved words (like for, while, end, etc.) can also be matched by the lexer rule IDENTIFIER. It's better to create explicit tokens inside the lexer (and put these rules before the IDENTIFIER rule!):
...
FOR : 'for';
WHILE : 'while';
END : 'end';
...
IDENTIFIER
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
...
Ideally, the tree grammar does not contain any quoted tokens. AFAIK, you can't import grammar X inside a grammar Y properly: the literal tokens inside grammar X are then not available in grammar Y. And when you split your your combined grammar in a parser- and lexer grammar, these literal tokens are not allowed. With small grammars like your, these last remarks are of no concern to you (and you could leave your grammar "as is"), but remember them when you create larger grammars.
Best of luck!
EDIT
Imaginary tokens are not only handy when there's no real token that can be made as root of the tree. The way I look at imaginary tokens is that they make your tree "unique", so that the tree grammar can only "walk" your tree in one possible way. Take subtraction and unary minus for example. If you wouldn't have created an imaginary token called UNARY_MINUS, but simply did this:
unary_operation
: '-' primary -> ^('-' primary)
| 'not' primary -> ^('not' primary)
| primary
;
then you'd have something like this in your tree grammar:
expression
: ^('-' expression expression)
| ...
| ^('-' expression)
| ...
;
Now both subtraction and unary minus start with the same tokens, which the tree grammar does not like! It's easy to see with this - (minus) example, but there can be quite some tricky cases (even with small grammars like yours!) that are not so obvious. So, always let the parser create "unique trees" while rewriting to AST's.
Hope that clarifies it (a bit).

Using ANTLR 3.3?

I'm trying to get started with ANTLR and C# but I'm finding it extraordinarily difficult due to the lack of documentation/tutorials. I've found a couple half-hearted tutorials for older versions, but it seems there have been some major changes to the API since.
Can anyone give me a simple example of how to create a grammar and use it in a short program?
I've finally managed to get my grammar file compiling into a lexer and parser, and I can get those compiled and running in Visual Studio (after having to recompile the ANTLR source because the C# binaries seem to be out of date too! -- not to mention the source doesn't compile without some fixes), but I still have no idea what to do with my parser/lexer classes. Supposedly it can produce an AST given some input...and then I should be able to do something fancy with that.
Let's say you want to parse simple expressions consisting of the following tokens:
- subtraction (also unary);
+ addition;
* multiplication;
/ division;
(...) grouping (sub) expressions;
integer and decimal numbers.
An ANTLR grammar could look like this:
grammar Expression;
options {
language=CSharp2;
}
parse
: exp EOF
;
exp
: addExp
;
addExp
: mulExp (('+' | '-') mulExp)*
;
mulExp
: unaryExp (('*' | '/') unaryExp)*
;
unaryExp
: '-' atom
| atom
;
atom
: Number
| '(' exp ')'
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Now to create a proper AST, you add output=AST; in your options { ... } section, and you mix some "tree operators" in your grammar defining which tokens should be the root of a tree. There are two ways to do this:
add ^ and ! after your tokens. The ^ causes the token to become a root and the ! excludes the token from the ast;
by using "rewrite rules": ... -> ^(Root Child Child ...).
Take the rule foo for example:
foo
: TokenA TokenB TokenC TokenD
;
and let's say you want TokenB to become the root and TokenA and TokenC to become its children, and you want to exclude TokenD from the tree. Here's how to do that using option 1:
foo
: TokenA TokenB^ TokenC TokenD!
;
and here's how to do that using option 2:
foo
: TokenA TokenB TokenC TokenD -> ^(TokenB TokenA TokenC)
;
So, here's the grammar with the tree operators in it:
grammar Expression;
options {
language=CSharp2;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse
: exp EOF -> ^(ROOT exp)
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' exp ')' -> exp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
I also added a Space rule to ignore any white spaces in the source file and added some extra tokens and namespaces for the lexer and parser. Note that the order is important (options { ... } first, then tokens { ... } and finally the #... {}-namespace declarations).
That's it.
Now generate a lexer and parser from your grammar file:
java -cp antlr-3.2.jar org.antlr.Tool Expression.g
and put the .cs files in your project together with the C# runtime DLL's.
You can test it using the following class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Preorder(ITree Tree, int Depth)
{
if(Tree == null)
{
return;
}
for (int i = 0; i < Depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
Preorder(Tree.GetChild(0), Depth + 1);
Preorder(Tree.GetChild(1), Depth + 1);
}
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("(12.5 + 56 / -7) * 0.5");
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
ExpressionParser.parse_return ParseReturn = Parser.parse();
CommonTree Tree = (CommonTree)ParseReturn.Tree;
Preorder(Tree, 0);
}
}
}
which produces the following output:
ROOT
*
+
12.5
/
56
UNARY_MIN
7
0.5
which corresponds to the following AST:
(diagram created using graph.gafol.net)
Note that ANTLR 3.3 has just been released and the CSharp target is "in beta". That's why I used ANTLR 3.2 in my example.
In case of rather simple languages (like my example above), you could also evaluate the result on the fly without creating an AST. You can do that by embedding plain C# code inside your grammar file, and letting your parser rules return a specific value.
Here's an example:
grammar Expression;
options {
language=CSharp2;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse returns [double value]
: exp EOF {$value = $exp.value;}
;
exp returns [double value]
: addExp {$value = $addExp.value;}
;
addExp returns [double value]
: a=mulExp {$value = $a.value;}
( '+' b=mulExp {$value += $b.value;}
| '-' b=mulExp {$value -= $b.value;}
)*
;
mulExp returns [double value]
: a=unaryExp {$value = $a.value;}
( '*' b=unaryExp {$value *= $b.value;}
| '/' b=unaryExp {$value /= $b.value;}
)*
;
unaryExp returns [double value]
: '-' atom {$value = -1.0 * $atom.value;}
| atom {$value = $atom.value;}
;
atom returns [double value]
: Number {$value = Double.Parse($Number.Text, CultureInfo.InvariantCulture);}
| '(' exp ')' {$value = $exp.value;}
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
which can be tested with the class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Main (string[] args)
{
string expression = "(12.5 + 56 / -7) * 0.5";
ANTLRStringStream Input = new ANTLRStringStream(expression);
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
Console.WriteLine(expression + " = " + Parser.parse());
}
}
}
and produces the following output:
(12.5 + 56 / -7) * 0.5 = 2.25
EDIT
In the comments, Ralph wrote:
Tip for those using Visual Studio: you can put something like java -cp "$(ProjectDir)antlr-3.2.jar" org.antlr.Tool "$(ProjectDir)Expression.g" in the pre-build events, then you can just modify your grammar and run the project without having to worry about rebuilding the lexer/parser.
Have you looked at Irony.net? It's aimed at .Net and therefore works really well, has proper tooling, proper examples and just works. The only problem is that it is still a bit 'alpha-ish' so documentation and versions seem to change a bit, but if you just stick with a version, you can do nifty things.
p.s. sorry for the bad answer where you ask a problem about X and someone suggests something different using Y ;^)
My personal experience is that before learning ANTLR on C#/.NET, you should spare enough time to learn ANTLR on Java. That gives you knowledge on all the building blocks and later you can apply on C#/.NET.
I wrote a few blog posts recently,
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-i/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-ii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iv/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-v/
The assumption is that you are familiar with ANTLR on Java and is ready to migrate your grammar file to C#/.NET.
There is a great article on how to use antlr and C# together here:
http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx
it's a "how it was done" article by the creator of NCalc which is a mathematical expression evaluator for C# - http://ncalc.codeplex.com
You can also download the grammar for NCalc here:
http://ncalc.codeplex.com/SourceControl/changeset/view/914d819f2865#Grammar%2fNCalc.g
example of how NCalc works:
Expression e = new Expression("Round(Pow(Pi, 2) + Pow([Pi2], 2) + X, 2)");
e.Parameters["Pi2"] = new Expression("Pi * Pi");
e.Parameters["X"] = 10;
e.EvaluateParameter += delegate(string name, ParameterArgs args)
{
if (name == "Pi")
args.Result = 3.14;
};
Debug.Assert(117.07 == e.Evaluate());
hope its helpful

Categories

Resources