mismatched Input when lexing and parsing with modes - c#

I'm having an ANTLR4 problem with mismatched input but can't solve it. I've found a lot of questions dealing with that, and the usually revolve around the lexer matching something else to the token, but I don't see it in my case.
I've got this lexer grammar:
FieldStart : '[' Definition ']' -> pushMode(INFIELD) ;
Definition : 'Element';
mode INFIELD;
FieldEnd : '[end]' -> popMode ;
ContentValue : ~[[]* ;
Which then runs on the following parser:
field : FieldStart ContentValue FieldEnd #Field_Found;
I simplified it to zoom in to the problem, but here's the point where I can't get any further.
I'm running that on the following input:
[Element]Va-lu*e[end]
and I get this output:
Type : 001 | FieldStart | [Element]
Type : 004 | ContentValue | Va-lu*e
Type : 003 | FieldEnd | [end]
Type : -001 | EOF | <EOF>
([] [Element] Va-lu*e [end])
I generated the output with C#, doing the following (shortened):
string tokens = "";
foreach (IToken CurrToken in TokenStream.GetTokens())
{
if (CurrToken.Type == -1)
{
tokens += "Type : " + CurrToken.Type.ToString("000") + " | " + "EOF" + " | " + CurrToken.Text + "\n";
}
else
{
tokens += "Type : " + CurrToken.Type.ToString("000") + " | " + Lexer.RuleNames[CurrToken.Type - 1] + " | " + CurrToken.Text + "\n";
}
}
tokens += "\n\n" + ParseTree.ToStringTree();
Upon parsing this via
IParseTree ParseTree = Parser.field();
I am presented this error:
"mismatched input 'Va-lu*e' expecting ContentValue"
I just don't find the error, can you help me here?
I assume it's got something to do with the lexer mode, but from as far as I read it looks like the parser doesn't care (or know) about the modes.
Thanks!

Modes are not available in a combined grammar. Split your grammar and it should work.
Also, always check the error messages:
error(120): ../Field.g4:14:5: lexical modes are only allowed in lexer grammars

I think I have now figured out how to solve my problem, there seems to be a required configuration when working with a split Lexer / Parser grammar structure AND using Lexer modes in Visual Studio (tested 2012 and 2013) with the ANTRL4 NuGet release:
I had to include
options { tokenVocab = GRAMMAR_NAME_Lexer; }
in my parser grammar at the beginning.
Otherwise, the lexer did create the tokens and the modes as expected but the parser will not recognize lexer tokens that are in another mode but the default mode.
I have also experienced that the "popMode" lexer command does sometimes cause my TokenStream to throw an invalid state exception, I could solve that with using "mode(DEFAULT_MODE)" instead of "popMode".
I hope this helps somebody, but I'd still like if someone who understands ANTLR could offer some additional clarification, since I just "solved" it by toying around until it worked.

Related

Antlr Lexer Parse error

Antlr version: antlr-dotnet-tool-3.5.0.2
TestGrammar.g
lexer grammar TestGrammar;
options
{
language=CSharp3;
backtrack=true;
}
DOT
: '.'
;
NUMBER
: ( '0'..'9' )+ ('.' ( '0'..'9' )+)?
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=Hidden;}
;
C# code:
var lexer = new TestGrammar(new ANTLRStringStream("2..3"));
while (true)
{
var token = lexer.NextToken();
Console.WriteLine(token);
if (token.Type == -1)
break;
}
Result:
[#-1,3:3='3',<5>,1:3]
[#-1,4:4='<EOF>',<-1>,1:4]
So, I test this grammar with input
2..3
I expect that the result will be the following:
NUMBER["2"] DOT["."] DOT["."] NUMBER["3"]
So, what am I doing wrong? Thank you!
Testing with the Java target:
TestGrammar lexer = new TestGrammar(new ANTLRInputStream("2..3"));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%s -> %s\n", t.getText(), TestGrammar.VOCABULARY.getSymbolicName(t.getType()));
}
produces the following output:
2 -> NUMBER
. -> DOT
. -> DOT
3 -> NUMBER
So, the grammar is correct. I doubt that the C# runtime would produce anything other than what I posted (with the C# equivalent test class). If it still doesn't work, please edit your question and add some code that demonstrates how to reproduce the error(s) you get.

Translating EBNF into Irony

I am using Irony to create a parser for a scripting language, but I've come across a little problem: how do I translate an EBNF expression like this in Irony?
'(' [ Ident { ',' Ident } ] ')'
I already tried some tricks like
Chunk.Rule = (Ident | Ident + "," + Chunk);
CallArgs.Rule = '(' + Chunk + ')' | '(' + ')';
But it's ugly and I'm not even sure if that works the way it should (haven't tried it yet...). Has anyone any suggestions?
EDIT:
I found out these helper methods (MakeStarList, MakePlusList) but couldn't find out how to use them, because of the complete lack of documentation of Irony... Has anyone any clue?
// Declare the non-terminals
var Ident = new NonTerminal("Ident");
var IdentList = new NonTerminal("Term");
// Rules
IdentList.Rule = ToTerm("(") + MakePlusRule(IdentList, ",", Ident) + ")";
Ident.Rule = // specify whatever Ident is (I assume you mean an identifier of some kind).
You can use the MakePlusRule helper method to define a one-or-many occurrence of some terminal. The MakePlusRule is basically just present your terminals as standard recursive list-idiom:
Ident | IdentList + "," + Ident
It also marks the terminal as representing a list, which will tell the parser to unfold the list-tree as a convenient list of child nodes.

How can my ANTLR parser (not lexer) trigger a lexical "include" (not AST splice)?

The ANTLR website describes two approaches to implementing "include" directives. The first approach is to recognize the directive in the lexer and include the file lexically (by pushing the CharStream onto a stack and replacing it with one that reads the new file); the second is to recognize the directive in the parser, launch a sub-parser to parse the new file, and splice in the AST generated by the sub-parser. Neither of these are quite what I need.
In the language I'm parsing, recognizing the directive in the lexer is impractical for a few reasons:
There is no self-contained character pattern that always means "this is an include directive". For example, Include "foo"; at top level is an include directive, but in Array bar --> Include "foo"; or Constant Include "foo"; the word Include is an identifier.
The name of the file to include may be given as a string or as a constant identifier, and such constants can be defined with arbitrarily complex expressions.
So I want to trigger the inclusion from the parser. But to perform the inclusion, I can't launch a sub-parser and splice the AST together; I have to splice the tokens. It's legal for a block to begin with { in the main file and be terminated by } in the included file. A file included inside a function can even close the function definition and start a new one.
It seems like I'll need something like the first approach but at the level of TokenStreams instead of CharStreams. Is that a viable approach? How much state would I need to keep on the stack, and how would I make the parser switch back to the original token stream instead of terminating when it hits EOF? Or is there a better way to handle this?
==========
Here's an example of the language, demonstrating that blocks opened in the main file can be closed in the included file (and vice versa). Note that the # before Include is required when the directive is inside a function, but optional outside.
main.inf:
[ Main;
print "This is Main!";
if (0) {
#include "other.h";
print "This is OtherFunction!";
];
other.h:
} ! end if
]; ! end Main
[ OtherFunction;
A possibility is for each Include statement to let your parser create a new instance of your lexer and insert these new tokens the lexer creates at the index the parser is currently at (see the insertTokens(...) method in the parser's #members block.).
Here's a quick demo:
Inform6.g
grammar Inform6;
options {
output=AST;
}
tokens {
STATS;
F_DECL;
F_CALL;
EXPRS;
}
#parser::header {
import java.util.Map;
import java.util.HashMap;
}
#parser::members {
private Map<String, String> memory = new HashMap<String, String>();
private void putInMemory(String key, String str) {
String value;
if(str.startsWith("\"")) {
value = str.substring(1, str.length() - 1);
}
else {
value = memory.get(str);
}
memory.put(key, value);
}
private void insertTokens(String fileName) {
// possibly strip quotes from `fileName` in case it's a Str-token
try {
CommonTokenStream thatStream = new CommonTokenStream(new Inform6Lexer(new ANTLRFileStream(fileName)));
thatStream.fill();
List extraTokens = thatStream.getTokens();
extraTokens.remove(extraTokens.size() - 1); // remove EOF
CommonTokenStream thisStream = (CommonTokenStream)this.getTokenStream();
thisStream.getTokens().addAll(thisStream.index(), extraTokens);
} catch(Exception e) {
e.printStackTrace();
}
}
}
parse
: stats EOF -> stats
;
stats
: stat* -> ^(STATS stat*)
;
stat
: function_decl
| function_call
| include
| constant
| if_stat
;
if_stat
: If '(' expr ')' '{' stats '}' -> ^(If expr stats)
;
function_decl
: '[' id ';' stats ']' ';' -> ^(F_DECL id stats)
;
function_call
: Id exprs ';' -> ^(F_CALL Id exprs)
;
include
: Include Str ';' {insertTokens($Str.text);} -> /* omit statement from AST */
| Include id ';' {insertTokens(memory.get($id.text));} -> /* omit statement from AST */
;
constant
: Constant id expr ';' {putInMemory($id.text, $expr.text);} -> ^(Constant id expr)
;
exprs
: expr (',' expr)* -> ^(EXPRS expr+)
;
expr
: add_expr
;
add_expr
: mult_expr (('+' | '-')^ mult_expr)*
;
mult_expr
: atom (('*' | '/')^ atom)*
;
atom
: id
| Num
| Str
| '(' expr ')' -> expr
;
id
: Id
| Include
;
Comment : '!' ~('\r' | '\n')* {skip();};
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If : 'if';
Include : 'Include';
Constant : 'Constant';
Id : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')+;
Str : '"' ~'"'* '"';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
main.inf
Constant IMPORT "other.h";
[ Main;
print "This is Main!";
if (0) {
Include IMPORT;
print "This is OtherFunction!";
];
other.h
} ! end if
]; ! end Main
[ OtherFunction;
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// create lexer & parser
Inform6Lexer lexer = new Inform6Lexer(new ANTLRFileStream("main.inf"));
Inform6Parser parser = new Inform6Parser(new CommonTokenStream(lexer));
// print the AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT((CommonTree)parser.parse().getTree());
System.out.println(st);
}
}
To run the demo, do the following on the command line:
java -cp antlr-3.3.jar org.antlr.Tool Inform6.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
The output you'll see corresponds to the following AST:

Using ANTLR 3.3?

I'm trying to get started with ANTLR and C# but I'm finding it extraordinarily difficult due to the lack of documentation/tutorials. I've found a couple half-hearted tutorials for older versions, but it seems there have been some major changes to the API since.
Can anyone give me a simple example of how to create a grammar and use it in a short program?
I've finally managed to get my grammar file compiling into a lexer and parser, and I can get those compiled and running in Visual Studio (after having to recompile the ANTLR source because the C# binaries seem to be out of date too! -- not to mention the source doesn't compile without some fixes), but I still have no idea what to do with my parser/lexer classes. Supposedly it can produce an AST given some input...and then I should be able to do something fancy with that.
Let's say you want to parse simple expressions consisting of the following tokens:
- subtraction (also unary);
+ addition;
* multiplication;
/ division;
(...) grouping (sub) expressions;
integer and decimal numbers.
An ANTLR grammar could look like this:
grammar Expression;
options {
language=CSharp2;
}
parse
: exp EOF
;
exp
: addExp
;
addExp
: mulExp (('+' | '-') mulExp)*
;
mulExp
: unaryExp (('*' | '/') unaryExp)*
;
unaryExp
: '-' atom
| atom
;
atom
: Number
| '(' exp ')'
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Now to create a proper AST, you add output=AST; in your options { ... } section, and you mix some "tree operators" in your grammar defining which tokens should be the root of a tree. There are two ways to do this:
add ^ and ! after your tokens. The ^ causes the token to become a root and the ! excludes the token from the ast;
by using "rewrite rules": ... -> ^(Root Child Child ...).
Take the rule foo for example:
foo
: TokenA TokenB TokenC TokenD
;
and let's say you want TokenB to become the root and TokenA and TokenC to become its children, and you want to exclude TokenD from the tree. Here's how to do that using option 1:
foo
: TokenA TokenB^ TokenC TokenD!
;
and here's how to do that using option 2:
foo
: TokenA TokenB TokenC TokenD -> ^(TokenB TokenA TokenC)
;
So, here's the grammar with the tree operators in it:
grammar Expression;
options {
language=CSharp2;
output=AST;
}
tokens {
ROOT;
UNARY_MIN;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse
: exp EOF -> ^(ROOT exp)
;
exp
: addExp
;
addExp
: mulExp (('+' | '-')^ mulExp)*
;
mulExp
: unaryExp (('*' | '/')^ unaryExp)*
;
unaryExp
: '-' atom -> ^(UNARY_MIN atom)
| atom
;
atom
: Number
| '(' exp ')' -> exp
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
I also added a Space rule to ignore any white spaces in the source file and added some extra tokens and namespaces for the lexer and parser. Note that the order is important (options { ... } first, then tokens { ... } and finally the #... {}-namespace declarations).
That's it.
Now generate a lexer and parser from your grammar file:
java -cp antlr-3.2.jar org.antlr.Tool Expression.g
and put the .cs files in your project together with the C# runtime DLL's.
You can test it using the following class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Preorder(ITree Tree, int Depth)
{
if(Tree == null)
{
return;
}
for (int i = 0; i < Depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
Preorder(Tree.GetChild(0), Depth + 1);
Preorder(Tree.GetChild(1), Depth + 1);
}
public static void Main (string[] args)
{
ANTLRStringStream Input = new ANTLRStringStream("(12.5 + 56 / -7) * 0.5");
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
ExpressionParser.parse_return ParseReturn = Parser.parse();
CommonTree Tree = (CommonTree)ParseReturn.Tree;
Preorder(Tree, 0);
}
}
}
which produces the following output:
ROOT
*
+
12.5
/
56
UNARY_MIN
7
0.5
which corresponds to the following AST:
(diagram created using graph.gafol.net)
Note that ANTLR 3.3 has just been released and the CSharp target is "in beta". That's why I used ANTLR 3.2 in my example.
In case of rather simple languages (like my example above), you could also evaluate the result on the fly without creating an AST. You can do that by embedding plain C# code inside your grammar file, and letting your parser rules return a specific value.
Here's an example:
grammar Expression;
options {
language=CSharp2;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse returns [double value]
: exp EOF {$value = $exp.value;}
;
exp returns [double value]
: addExp {$value = $addExp.value;}
;
addExp returns [double value]
: a=mulExp {$value = $a.value;}
( '+' b=mulExp {$value += $b.value;}
| '-' b=mulExp {$value -= $b.value;}
)*
;
mulExp returns [double value]
: a=unaryExp {$value = $a.value;}
( '*' b=unaryExp {$value *= $b.value;}
| '/' b=unaryExp {$value /= $b.value;}
)*
;
unaryExp returns [double value]
: '-' atom {$value = -1.0 * $atom.value;}
| atom {$value = $atom.value;}
;
atom returns [double value]
: Number {$value = Double.Parse($Number.Text, CultureInfo.InvariantCulture);}
| '(' exp ')' {$value = $exp.value;}
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){Skip();}
;
which can be tested with the class:
using System;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
namespace Demo.Antlr
{
class MainClass
{
public static void Main (string[] args)
{
string expression = "(12.5 + 56 / -7) * 0.5";
ANTLRStringStream Input = new ANTLRStringStream(expression);
ExpressionLexer Lexer = new ExpressionLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
ExpressionParser Parser = new ExpressionParser(Tokens);
Console.WriteLine(expression + " = " + Parser.parse());
}
}
}
and produces the following output:
(12.5 + 56 / -7) * 0.5 = 2.25
EDIT
In the comments, Ralph wrote:
Tip for those using Visual Studio: you can put something like java -cp "$(ProjectDir)antlr-3.2.jar" org.antlr.Tool "$(ProjectDir)Expression.g" in the pre-build events, then you can just modify your grammar and run the project without having to worry about rebuilding the lexer/parser.
Have you looked at Irony.net? It's aimed at .Net and therefore works really well, has proper tooling, proper examples and just works. The only problem is that it is still a bit 'alpha-ish' so documentation and versions seem to change a bit, but if you just stick with a version, you can do nifty things.
p.s. sorry for the bad answer where you ask a problem about X and someone suggests something different using Y ;^)
My personal experience is that before learning ANTLR on C#/.NET, you should spare enough time to learn ANTLR on Java. That gives you knowledge on all the building blocks and later you can apply on C#/.NET.
I wrote a few blog posts recently,
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-i/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-ii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iii/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-iv/
http://www.lextm.com/index.php/2012/07/how-to-use-antlr-on-net-part-v/
The assumption is that you are familiar with ANTLR on Java and is ready to migrate your grammar file to C#/.NET.
There is a great article on how to use antlr and C# together here:
http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx
it's a "how it was done" article by the creator of NCalc which is a mathematical expression evaluator for C# - http://ncalc.codeplex.com
You can also download the grammar for NCalc here:
http://ncalc.codeplex.com/SourceControl/changeset/view/914d819f2865#Grammar%2fNCalc.g
example of how NCalc works:
Expression e = new Expression("Round(Pow(Pi, 2) + Pow([Pi2], 2) + X, 2)");
e.Parameters["Pi2"] = new Expression("Pi * Pi");
e.Parameters["X"] = 10;
e.EvaluateParameter += delegate(string name, ParameterArgs args)
{
if (name == "Pi")
args.Result = 3.14;
};
Debug.Assert(117.07 == e.Evaluate());
hope its helpful

Can you improve this 'lines of code algorithm' in F#?

I've written a little script to iterate across files in folders to count lines of code.
The heart of the script is this function to count lines of whitespace, comments, and code. (Note that for the moment it is tailored to C# and doesn't know about multi-line comments).
It just doesn't look very nice to me - has anyone got a cleaner version?
// from list of strings return tuple with count of (whitespace, comments, code)
let loc (arr:List<string>) =
let innerloc (whitesp, comment, code) (l:string) =
let s = l.Trim([|' ';'\t'|]) // remove leading whitespace
match s with
| "" -> (whitesp + 1, comment, code) //blank lines
| "{" -> (whitesp + 1, comment, code) //opening blocks
| "}" -> (whitesp + 1, comment, code) //closing blocks
| _ when s.StartsWith("#") -> (whitesp + 1, comment, code) //regions
| _ when s.StartsWith("//") -> (whitesp, comment + 1, code) //comments
| _ -> (whitesp, comment, code + 1)
List.fold_left innerloc (0,0,0) arr
I think what you have is fine, but here's some variety to mix it up. (This solution repeats your problem of ignoring trailing whitespace.)
type Line =
| Whitespace = 0
| Comment = 1
| Code = 2
let Classify (l:string) =
let s = l.TrimStart([|' ';'\t'|])
match s with
| "" | "{" | "}" -> Line.Whitespace
| _ when s.StartsWith("#") -> Line.Whitespace
| _ when s.StartsWith("//") -> Line.Comment
| _ -> Line.Code
let Loc (arr:list<_>) =
let sums = Array.create 3 0
arr
|> List.iter (fun line ->
let i = Classify line |> int
sums.[i] <- sums.[i] + 1)
sums
"Classify" as a separate entity might be useful in another context.
A better site for this might be refactormycode - it's tailored exactly for these questions.
Can't see much wrong with that other than the fact you will count a single brace with trailing spaces as code instead of whitespace.

Categories

Resources