I'm using Irony.net for generating a parse tree out of the source. Essentially I'm using ExpressionEvaluatorGrammer like grammer for binary expressions (arithmetic, relational and logical/conditional). I want to convert the resultant parse tree into Linq expression by traversing it. However, the tree does not seem to have a formation directly convertable to linq conditional expression. Hypothetical example of such an expression:
1 == 1 && 4 - 1 == 3
generates (pseudo xml tree for brevity):
<binary>
<binary>
<binary>
<literal>1</literal>
<op>==</op>
<literal>1</literal>
</binary>
<op>&&</op>
<binary>
<literal>4</literal>
<op>-</op>
<literal>1</literal>
</binary>
</binary>
<op>==</op>
<literal>3</literal>
</binary>
In the tree above, the arithmetic expression (4 - 1) becomes the right expression to the && logical operation as the parent node closes after it. In the ideal world, it should have been a left expression of the nodes representing "== 3".
How do you traverse such a tree to generate a proper and operation? Or, is there a way to generate the tree in the form I desire?
Edit: here's the grammer (partial) definition. I have taken it from ExpressionEvaluatorGrammer that comes with Irony.interpreter.
RegisterOperators(15, "&", "&&", "|", "||");
RegisterOperators(20, "==", "<", "<=", ">", ">=", "!=");
RegisterOperators(30, "+", "-");
RegisterOperators(40, "*", "/");
Expr.Rule = Term
Term.Rule = number | ParExpr | stringLit | FunctionCall | identifier | MemberAccess | IndexedAccess;
ParExpr.Rule = "(" + Expr + ")";
BinExpr.Rule = Expr + BinOp + Expr;
BinOp.Rule = ToTerm("+") | "-" | "*" | "/" | "**" | "==" | "<" | "<=" | ">" | ">=" | "!=" | "&&" | "||" | "&" | "|";
You cannot fix this by traversing the tree in a magical/special way. Your parser is incorrect! Probably, it is just misconfigured. You absolutely need to get the correct tree from it in order to process it further.
Probably you have wrong operator precedence rules in it. It looks like it, at least. Try adding parenthesis to see if it fixed up the tree.
Assuming the operator precedence is correct, you should walk the tree recursively using the Visitor Pattern, returning an Expression at each level:
XName xBinary = "binary";
XName xLiteral = "literal";
Expression Visit(XElement elt)
{
if (elt.Name == xBinary)
{
return VisitBinary(elt);
}
else if (elt.Name == xLiteral)
{
return VisitLiteral(elt);
} // ...
throw new NotSupportedException();
}
Now that you have the Visit structure, you simply write each specific visitor to use your main Visit:
Expression VisitLiteral(XElement elt)
{
Debug.Assert(elt.Name == xLiteral);
return Expression.Constant((int)elt);
}
Expression VisitBinary(XElement elt)
{
Debug.Assert(elt.Name == xBinary);
Debug.Assert(elt.Elements().Count() >= 3);
var lhs = elt.Elements().ElementAt(0);
var op = elt.Elements().ElementAt(1);
var rhs = elt.Elements().ElementAt(2);
switch((string)op)
{
case "+":
// by chaining LHS and RHS to Visit we allow the tree to be constructed
// properly as Visit performs the per-element dispatch
return Expression.Add(Visit(lhs), Visit(rhs));
case "&&":
return Expression.AndAlso(Visit(lhs), Visit(rhs));
default:
throw new NotSupportedException();
}
}
Related
I'm new to C# so expect some mistakes ahead. Any help / guidance would be greatly appreciated.
I want to limit the accepted inputs for a string to just:
a-z
A-Z
hyphen
Period
If the character is a letter, a hyphen, or period, it's to be accepted. Anything else will return an error.
The code I have so far is
string foo = "Hello!";
foreach (char c in foo)
{
/* Is there a similar way
To do this in C# as
I am basing the following
Off of my Python 3 knowledge
*/
if (c.IsLetter == true) // *Q: Can I cut out the == true part ?*
{
// Do what I want with letters
}
else if (c.IsDigit == true)
{
// Do what I want with numbers
}
else if (c.Isletter == "-") // Hyphen | If there's an 'or', include period as well
{
// Do what I want with symbols
}
}
I know that's a pretty poor set of code.
I had a thought whilst writing this:
Is it possible to create a list of the allowed characters and check the variable against that?
Something like:
foreach (char c in foo)
{
if (c != list)
{
// Unaccepted message here
}
else if (c == list)
{
// Accepted
}
}
Thanks in advance!
Easily accomplished with a Regex:
using System.Text.RegularExpressions;
var isOk = Regex.IsMatch(foo, #"^[A-Za-z0-9\-\.]+$");
Rundown:
match from the start
| set of possible matches
| |
|+-------------+
|| |any number of matches is ok
|| ||match until the end of the string
|| |||
vv vvv
^[A-Za-z0-9\-\.]+$
^ ^ ^ ^ ^
| | | | |
| | | | match dot
| | | match hyphen
| | match 0 to 9
| match a-z (lowercase)
match A-Z (uppercase)
You can do this in a single line with regular expressions:
Regex.IsMatch(myInput, #"^[a-zA-Z0-9\.\-]*$")
^ -> match start of input
[a-zA-Z0-9\.\-] -> match any of a-z , A-Z , 0-9, . or -
* -> 0 or more times (you may prefer + which is 1 or more times)
$ -> match the end of input
You can use Regex.IsMatch function and specify your regular expression.
Or define manually chars what you need. Something like this:
string foo = "Hello!";
char[] availableSymbols = {'-', ',', '!'};
char[] availableLetters = {'A', 'a', 'H'}; //etc.
char[] availableNumbers = {'1', '2', '3'}; //etc
foreach (char c in foo)
{
if (availableLetters.Contains(c))
{
// Do what I want with letters
}
else if (availableNumbers.Contains(c))
{
// Do what I want with numbers
}
else if (availableSymbols.Contains(c))
{
// Do what I want with symbols
}
}
Possible solution
You can use the CharUnicodeInfo.GetUnicodeCategory(char) method. It returns the UnicodeCategory of a character. The following unicode categories might be what you're look for:
UnicodeCategory.DecimalDigitNumber
UnicodeCategory.LowercaseLetter and UnicodeCategory.UppercaseLetter
An example:
string foo = "Hello!";
foreach (char c in foo)
{
UnicodeCategory cat = CharUnicodeInfo.GetUnicodeCategory(c);
if (cat == UnicodeCategory.LowercaseLetter || cat == UnicodeCategory.UppercaseLetter)
{
// Do what I want with letters
}
else if (cat == UnicodeCategory.DecimalDigitNumber)
{
// Do what I want with numbers
}
else if (c == '-' || c == '.')
{
// Do what I want with symbols
}
}
Answers to your other questions
Can I cut out the == true part?:
Yes, you can cut the == true part, it is not required in C#
If there's an 'or', include period as well.:
To create or expressions use the 'barbar' (||) operator as i've done in the above example.
Whenever you have some kind of collection of similar things, an array, a list, a string of characters, whatever, you'll see at the definition of the collection that it implements IEnumerable
public class String : ..., IEnumerable,
here T is a char. It means that you can ask the class: "give me your first T", "give me your next T", "give me your next T" and so on until there are no more elements.
This is the basis for all Linq. Ling has about 40 functions that act upon sequences. And if you need to do something with a sequence of the same kind of items, consider using LINQ.
The functions in LINQ can be found in class Enumerable. One of the function is Contains. You can use it to find out if a sequence contains a character.
char[] allowedChars = "abcdefgh....XYZ.-".ToCharArray();
Now you have a sequence of allowed characters. Suppose you have a character x and want to know if x is allowed:
char x = ...;
bool xIsAllowed = allowedChars.Contains(x);
Now Suppose you don't have one character x, but a complete string and you want only the characters in this string that are allowed:
string str = ...
var allowedInStr = str
.Where(characterInString => allowedChars.Contains(characterInString));
If you are going to do a lot with sequences of things, consider spending some time to familiarize yourself with LINQ:
Linq explained
You can use Regex.IsMatch with "^[a-zA-Z_.]*$" to check for valid characters.
string foo = "Hello!";
if (!Regex.IsMatch(foo, "^[a-zA-Z_\.]*$"))
{
throw new ArgumentException("Exception description here")
}
Other than that you can create a list of chars and use string.Contains method to check if it is ok.
string validChars = "abcABC./";
foreach (char c in foo)
{
if (!validChars.Contains(c))
{
// Throw exception
}
}
Also, you don't need to check for == true/false in if line. Both expressions are equal below
if (boolvariable) { /* do something */ }
if (boolvariable == true) { /* do something */ }
This is driving me up the wall now. The ternary formatting options in ReSharper -> Options -> C# do not cover indentation, just spacing of '?' and ':' characters, and line chopping.
What I want is:
var x = expr1
? expr2
: expr3;
But what I get is:
var x = expr1
? expr2
: expr3;
If the ternary operator formatting was offering no assistance, I thought that the Chained binary expressions may help, but no. That is set as follows.
var a = someOperand + operand2
+ operand3
+ operand4;
Any ideas?
Try enabling ReSharper | Options | Code Editing | C# | Formatting Style | Other | Align Multiline Constructs | Expression
I am using Irony to create a parser for a scripting language, but I've come across a little problem: how do I translate an EBNF expression like this in Irony?
'(' [ Ident { ',' Ident } ] ')'
I already tried some tricks like
Chunk.Rule = (Ident | Ident + "," + Chunk);
CallArgs.Rule = '(' + Chunk + ')' | '(' + ')';
But it's ugly and I'm not even sure if that works the way it should (haven't tried it yet...). Has anyone any suggestions?
EDIT:
I found out these helper methods (MakeStarList, MakePlusList) but couldn't find out how to use them, because of the complete lack of documentation of Irony... Has anyone any clue?
// Declare the non-terminals
var Ident = new NonTerminal("Ident");
var IdentList = new NonTerminal("Term");
// Rules
IdentList.Rule = ToTerm("(") + MakePlusRule(IdentList, ",", Ident) + ")";
Ident.Rule = // specify whatever Ident is (I assume you mean an identifier of some kind).
You can use the MakePlusRule helper method to define a one-or-many occurrence of some terminal. The MakePlusRule is basically just present your terminals as standard recursive list-idiom:
Ident | IdentList + "," + Ident
It also marks the terminal as representing a list, which will tell the parser to unfold the list-tree as a convenient list of child nodes.
The ANTLR website describes two approaches to implementing "include" directives. The first approach is to recognize the directive in the lexer and include the file lexically (by pushing the CharStream onto a stack and replacing it with one that reads the new file); the second is to recognize the directive in the parser, launch a sub-parser to parse the new file, and splice in the AST generated by the sub-parser. Neither of these are quite what I need.
In the language I'm parsing, recognizing the directive in the lexer is impractical for a few reasons:
There is no self-contained character pattern that always means "this is an include directive". For example, Include "foo"; at top level is an include directive, but in Array bar --> Include "foo"; or Constant Include "foo"; the word Include is an identifier.
The name of the file to include may be given as a string or as a constant identifier, and such constants can be defined with arbitrarily complex expressions.
So I want to trigger the inclusion from the parser. But to perform the inclusion, I can't launch a sub-parser and splice the AST together; I have to splice the tokens. It's legal for a block to begin with { in the main file and be terminated by } in the included file. A file included inside a function can even close the function definition and start a new one.
It seems like I'll need something like the first approach but at the level of TokenStreams instead of CharStreams. Is that a viable approach? How much state would I need to keep on the stack, and how would I make the parser switch back to the original token stream instead of terminating when it hits EOF? Or is there a better way to handle this?
==========
Here's an example of the language, demonstrating that blocks opened in the main file can be closed in the included file (and vice versa). Note that the # before Include is required when the directive is inside a function, but optional outside.
main.inf:
[ Main;
print "This is Main!";
if (0) {
#include "other.h";
print "This is OtherFunction!";
];
other.h:
} ! end if
]; ! end Main
[ OtherFunction;
A possibility is for each Include statement to let your parser create a new instance of your lexer and insert these new tokens the lexer creates at the index the parser is currently at (see the insertTokens(...) method in the parser's #members block.).
Here's a quick demo:
Inform6.g
grammar Inform6;
options {
output=AST;
}
tokens {
STATS;
F_DECL;
F_CALL;
EXPRS;
}
#parser::header {
import java.util.Map;
import java.util.HashMap;
}
#parser::members {
private Map<String, String> memory = new HashMap<String, String>();
private void putInMemory(String key, String str) {
String value;
if(str.startsWith("\"")) {
value = str.substring(1, str.length() - 1);
}
else {
value = memory.get(str);
}
memory.put(key, value);
}
private void insertTokens(String fileName) {
// possibly strip quotes from `fileName` in case it's a Str-token
try {
CommonTokenStream thatStream = new CommonTokenStream(new Inform6Lexer(new ANTLRFileStream(fileName)));
thatStream.fill();
List extraTokens = thatStream.getTokens();
extraTokens.remove(extraTokens.size() - 1); // remove EOF
CommonTokenStream thisStream = (CommonTokenStream)this.getTokenStream();
thisStream.getTokens().addAll(thisStream.index(), extraTokens);
} catch(Exception e) {
e.printStackTrace();
}
}
}
parse
: stats EOF -> stats
;
stats
: stat* -> ^(STATS stat*)
;
stat
: function_decl
| function_call
| include
| constant
| if_stat
;
if_stat
: If '(' expr ')' '{' stats '}' -> ^(If expr stats)
;
function_decl
: '[' id ';' stats ']' ';' -> ^(F_DECL id stats)
;
function_call
: Id exprs ';' -> ^(F_CALL Id exprs)
;
include
: Include Str ';' {insertTokens($Str.text);} -> /* omit statement from AST */
| Include id ';' {insertTokens(memory.get($id.text));} -> /* omit statement from AST */
;
constant
: Constant id expr ';' {putInMemory($id.text, $expr.text);} -> ^(Constant id expr)
;
exprs
: expr (',' expr)* -> ^(EXPRS expr+)
;
expr
: add_expr
;
add_expr
: mult_expr (('+' | '-')^ mult_expr)*
;
mult_expr
: atom (('*' | '/')^ atom)*
;
atom
: id
| Num
| Str
| '(' expr ')' -> expr
;
id
: Id
| Include
;
Comment : '!' ~('\r' | '\n')* {skip();};
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If : 'if';
Include : 'Include';
Constant : 'Constant';
Id : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')+;
Str : '"' ~'"'* '"';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
main.inf
Constant IMPORT "other.h";
[ Main;
print "This is Main!";
if (0) {
Include IMPORT;
print "This is OtherFunction!";
];
other.h
} ! end if
]; ! end Main
[ OtherFunction;
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// create lexer & parser
Inform6Lexer lexer = new Inform6Lexer(new ANTLRFileStream("main.inf"));
Inform6Parser parser = new Inform6Parser(new CommonTokenStream(lexer));
// print the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT((CommonTree)parser.parse().getTree());
System.out.println(st);
}
}
To run the demo, do the following on the command line:
java -cp antlr-3.3.jar org.antlr.Tool Inform6.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
The output you'll see corresponds to the following AST:
I'm trying to get started with ANTLR and C# but I'm finding it extraordinarily difficult due to the lack of documentation/tutorials. I've found a couple half-hearted tutorials for older versions, but it seems there have been some major changes to the API since.
Can anyone give me a simple example of how to create a grammar and use it in a short program?
I've finally managed to get my grammar file compiling into a lexer and parser, and I can get those compiled and running in Visual Studio (after having to recompile the ANTLR source because the C# binaries seem to be out of date too! -- not to mention the source doesn't compile without some fixes), but I still have no idea what to do with my parser/lexer classes. Supposedly it can produce an AST given some input...and then I should be able to do something fancy with that.
Let's say you want to parse simple expressions consisting of the following tokens:
- subtraction (also unary);
+ addition;
* multiplication;
/ division;
(...) grouping (sub) expressions;
integer and decimal numbers.
An ANTLR grammar could look like this:
grammar Expression;
options {
language=CSharp2;
}
parse
: exp EOF
;
exp
: addExp
;
addExp
: mulExp (('+' | '-') mulExp)*
;
mulExp
: unaryExp (('*' | '/') unaryExp)*
;
unaryExp
: '-' atom
| atom
;
atom
: Number
| '(' exp ')'
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Now to create a proper AST, you add output=AST; in your options { ... } section, and you mix some "tree operators" in your grammar defining which tokens should be the root of a tree. There are two ways to do this:
add ^ and ! after your tokens. The ^ causes the token to become a root and the ! excludes the token from the ast;
by using "rewrite rules": ... -> ^(Root Child Child ...).
Take the rule foo for example:
foo
: TokenA TokenB TokenC TokenD
;
and let's say you want TokenB to become the root and TokenA and TokenC to become its children, and you want to exclude TokenD from the tree. Here's how to do that using option 1:
foo
: TokenA TokenB^ TokenC TokenD!
;
and here's how to do that using option 2:
foo
: TokenA TokenB TokenC TokenD -> ^(TokenB TokenA TokenC)
;
So, here's the grammar with the tree operators in it:
grammar Expression;
options {
language=CSharp2;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse
: exp EOF -> ^(ROOT exp)
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' exp ')' -> exp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
I also added a Space rule to ignore any white spaces in the source file and added some extra tokens and namespaces for the lexer and parser. Note that the order is important (options { ... } first, then tokens { ... } and finally the #... {}-namespace declarations).
That's it.
Now generate a lexer and parser from your grammar file:
java -cp antlr-3.2.jar org.antlr.Tool Expression.g
and put the .cs files in your project together with the C# runtime DLL's.
You can test it using the following class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Preorder(ITree Tree, int Depth)
{
if(Tree == null)
{
return;
}
for (int i = 0; i < Depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
Preorder(Tree.GetChild(0), Depth + 1);
Preorder(Tree.GetChild(1), Depth + 1);
}
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("(12.5 + 56 / -7) * 0.5");
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
ExpressionParser.parse_return ParseReturn = Parser.parse();
CommonTree Tree = (CommonTree)ParseReturn.Tree;
Preorder(Tree, 0);
}
}
}
which produces the following output:
ROOT
*
+
12.5
/
56
UNARY_MIN
7
0.5
which corresponds to the following AST:
(diagram created using graph.gafol.net)
Note that ANTLR 3.3 has just been released and the CSharp target is "in beta". That's why I used ANTLR 3.2 in my example.
In case of rather simple languages (like my example above), you could also evaluate the result on the fly without creating an AST. You can do that by embedding plain C# code inside your grammar file, and letting your parser rules return a specific value.
Here's an example:
grammar Expression;
options {
language=CSharp2;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse returns [double value]
: exp EOF {$value = $exp.value;}
;
exp returns [double value]
: addExp {$value = $addExp.value;}
;
addExp returns [double value]
: a=mulExp {$value = $a.value;}
( '+' b=mulExp {$value += $b.value;}
| '-' b=mulExp {$value -= $b.value;}
)*
;
mulExp returns [double value]
: a=unaryExp {$value = $a.value;}
( '*' b=unaryExp {$value *= $b.value;}
| '/' b=unaryExp {$value /= $b.value;}
)*
;
unaryExp returns [double value]
: '-' atom {$value = -1.0 * $atom.value;}
| atom {$value = $atom.value;}
;
atom returns [double value]
: Number {$value = Double.Parse($Number.Text, CultureInfo.InvariantCulture);}
| '(' exp ')' {$value = $exp.value;}
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
which can be tested with the class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Main (string[] args)
{
string expression = "(12.5 + 56 / -7) * 0.5";
ANTLRStringStream Input = new ANTLRStringStream(expression);
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
Console.WriteLine(expression + " = " + Parser.parse());
}
}
}
and produces the following output:
(12.5 + 56 / -7) * 0.5 = 2.25
EDIT
In the comments, Ralph wrote:
Tip for those using Visual Studio: you can put something like java -cp "$(ProjectDir)antlr-3.2.jar" org.antlr.Tool "$(ProjectDir)Expression.g" in the pre-build events, then you can just modify your grammar and run the project without having to worry about rebuilding the lexer/parser.
Have you looked at Irony.net? It's aimed at .Net and therefore works really well, has proper tooling, proper examples and just works. The only problem is that it is still a bit 'alpha-ish' so documentation and versions seem to change a bit, but if you just stick with a version, you can do nifty things.
p.s. sorry for the bad answer where you ask a problem about X and someone suggests something different using Y ;^)
My personal experience is that before learning ANTLR on C#/.NET, you should spare enough time to learn ANTLR on Java. That gives you knowledge on all the building blocks and later you can apply on C#/.NET.
I wrote a few blog posts recently,
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-i/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-ii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iv/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-v/
The assumption is that you are familiar with ANTLR on Java and is ready to migrate your grammar file to C#/.NET.
There is a great article on how to use antlr and C# together here:
http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx
it's a "how it was done" article by the creator of NCalc which is a mathematical expression evaluator for C# - http://ncalc.codeplex.com
You can also download the grammar for NCalc here:
http://ncalc.codeplex.com/SourceControl/changeset/view/914d819f2865#Grammar%2fNCalc.g
example of how NCalc works:
Expression e = new Expression("Round(Pow(Pi, 2) + Pow([Pi2], 2) + X, 2)");
e.Parameters["Pi2"] = new Expression("Pi * Pi");
e.Parameters["X"] = 10;
e.EvaluateParameter += delegate(string name, ParameterArgs args)
{
if (name == "Pi")
args.Result = 3.14;
};
Debug.Assert(117.07 == e.Evaluate());
hope its helpful