Sprache: left recursion in grammar - c#

I am developing a parser for a language similar to SQL and I have the problem of creating some of the rules of language, such as: expression IS NULL and expression IN (expression1, expression2, ...) with priority between logical and mathematical operators.
I uploaded a GitHub test project https://github.com/anpv/SpracheTest/ but this variant is not good.
I tried to use the following rules:
private static readonly Parser<AstNode> InOperator =
from expr in Parse.Ref(() => Expression)
from inKeyword in Parse.IgnoreCase("in").Token()
from values in Parse
.Ref(() => Expression)
.DelimitedBy(Comma)
.Contained(OpenParenthesis, CloseParenthesis)
select new InOperator(expr, values);
private static readonly Parser<AstNode> IsNullOperator =
from expr in Parse.Ref(() => Expression)
from isNullKeyword in Parse
.IgnoreCase("is")
.Then(_ => Parse.WhiteSpace.AtLeastOnce())
.Then(_ => Parse.IgnoreCase("null"))
select new IsNullOperator(expr);
private static readonly Parser<AstNode> Equality =
Parse
.ChainOperator(Eq, IsNullOperator.Or(InOperator).Or(Additive), MakeBinary);
which throws ParseException in code like ScriptParser.ParseExpression("1 is null") or ScriptParser.ParseExpression("1 in (1, 2, 3)"): "Parsing failure: Left recursion in the grammar.".
How can I look-ahead for Expression, or do other variants exist to solve this problem?

The answer is, unfortunately, the Sprache cannot parse a left-recursive grammar. I stumbled on comments in the source code talking about how buggy support for left-recursive grammars had been removed when researching this question (which was also how I found your question) - see the source code.
In order to deal with this problem you need to reorganize how you do your parsing. If you are writing a simple expression parser, for example, this is a common problem you have to deal with. Searching the web there is lots of discussion of how to remove left-recursion from a grammar, in particular, for expressions.
In your case, I expect you'll need to do something like:
term := everything simple in an expression (like "1", "2", "3", etc.)
expression := term [ IN ( expression*) | IS NULL | "+" expression | "-" expression | etc.]
or similar - basically - you have to unwind the recursion yourself. By doing that I was able to fix my issues with expressions. I suspect any basic compiler book probably has a section on how to "normalize" a grammar.
It makes building whatever object you are returning from the parser a bit more of a pain, but in the select statement instead of doing "select new Expression(arg1, arg2)" I changed it to be a function call, and the function decides on the specific object being returned depending on what the arguments were.

Related

Matching and replacing function expressions

I need to do some very light parsing of C# (actually transpiled Razor code) to replace a list of function calls with textual replacements.
If given a set containing {"Foo.myFunc" : "\"def\"" } it should replace this code:
var res = "abc" + Foo.myFunc(foo, Bar.otherFunc( Baz.funk()));
with this:
var res = "abc" + "def"
I don't care about the nested expressions.
This seems fairly trivial and I think I should be able to avoid building an entire C# parser using something like this for every member of the mapping set:
find expression start (e.g. Foo.myFunc)
Push()/Pop() parentheses on a Stack until Count == 0.
Mark this as expression stop
replace everything from expression start until expression stop
But maybe I don't need to ... Is there a (possibly built-in) .NET library that can do this for me? Counting is not possible in the family of languages that RE is in, but maybe the extended regex syntax in C# can handle this somehow using back references?
edit:
As the comments to this answer demonstrates simply counting brackets will not be sufficient generally, as something like trollMe("(") will throw off those algorithms. Only true parsing would then suffice, I guess (?).
The trick for a normal string will be:
(?>"(\\"|[^"])*")
A verbatim string:
(?>#"(""|[^"])*")
Maybe this can help, but I'm not sure that this will work in all cases:
<func>(?=\()((?>/\*.*?\*/)|(?>#"(""|[^"])*")|(?>"(\\"|[^"])*")|\r?\n|[^()"]|(?<open>\()|(?<-open>\)))+?(?(open)(?!))
Replace <func> with your function name.
Useless to say that trollMe("\"(", "((", #"abc""de((f") works as expected.
DEMO

Build a simple lambda with dynamic expressions

I'm trying to understand how dynamic expressions work. So for learning purposes I'd like to do the following:
I have an object which I can currently access with a Linq statement that uses a lambda expression:
someObj.IncludeStory(x => x.News);
What I'd like to do is replace the labmda x => x.Newswith a string, for example:
string myLambda = "x => x.News";
someObj.IncludeStory(myLambda);
Obviously you can't do it like that, but as far as I understand you can achieve somewhat the same with Dynamic Expressions(?).
I've been looking at the System.Linq.Dynamic source code to get an idea of how this should work. But that only confuses me more. I think that library is far to complex for what I want. I don't need sorting, grouping and all that fancy stuff.
Basically my questions are:
Can I use Dynamic Expressions to generate a lambda like this dynamicaly: x => x.News?
If so, then how would I do this with a Dynamic Expression?
I find it hard to get started with this.
What I've tried is something like:
var expression = #"IncludeStory(x => x.News)";
var p = Expression.Parameter(someObj.GetType(), "News");
var e = myAlias.DynamicExpression.ParseLambda(new[] { p }, null, expression);
var result1 = e.Compile().DynamicInvoke(someObj);
You can use DynamicExpression.ParseLambda to convert a string into an Expression Tree. For more detail go and see the project that comes with VS2010 C:\Program Files (x86)\Microsoft Visual Studio 10.0\Samples\1033 -> CSharpSamples -> LinqSamples -> DynamicQuery (Also I think is part of the installation of higher versions)

Pattern matching with extraction of found substrings to variables

I have some strings of some defined format like Foo.<Whatever>.$(Something) and I would like to split them in parts and have each part automatically assigned to a variable.
I once wrote something resembling the bash/shell pipe command option '<' with C# classes and operator overloading. The usage was something like
ParseExpression ex = pex("item1") > ".<" > pex("item2") > ">.$(" > pex("item3") > ")";
ParseResult r = new ParseResult(ex, "Foo.<Whatever>.$(Something)");
ParseResult then had a Dictionary with the keys item1 through item3 set to the strings found in the given string. The method pex generated some object that could be used with the > operator, eventually having a chain of ParseExpressionParts which constitute a ParseExpression.
I don't have the code at hand in the moment, and before I start coding it from scratch again I thought I better ask whether someone has done and published it already.
The parse expressions remind me of parser combinators like Parsec and FParsec (for F#). How complex is the syntax going to be? As it is, it could be handled by a regex with groups.
If you want to create a more complex grammar using parser combinators you can use FParsec, one of the better known parser combinators, targeting F#. In general, functional languages like F# are used a lot in such situations. CSharp-monad is a parser combinator targeting C#. The project isn't very active though.
You can also use a full-blown parser generator like ANTLR 4. ANTLR is used by ASP.NET MVC to parse Razor syntax views. ANTLR 4 creates a parse tree and allows you to use either a Visitor or a Listener to process it that are similar to DOM or SAX processing.. A Listener calls your code as soon as an element is encounter (eg the opening <, the content etc), while the visitor works on the finished tree.
The Visual Studio extension for ANTLR will generate both the parser classes as well as base Visitor and Listener classes for your grammar. The NetBeans-based ANTLRWorks IDE makes creating and testing grammars very easy.
A rough grammar for your example would be :
format: tag '.' '<' category '>' '.' '$' '(' value ')';
tag : ID;
category : ID;
value : ID;
ID :[A-Z0-9]+;
Or you could define keywords like FOO : 'FOO' that have special meaning for your grammar. A visitor or listener could handle the tag eg to format a string, execute an operation on the values etc.
There are no hard and fast rules. Personally, I use regular expressions for simpler cases, eg processing relatively simple log files and ANTLR for more complex cases like screen-scraping mainframe data. I haven't looked into parser combinators as I never had the time to get comfortable with F#. They would be really handy though to handle some messed up log4net log files
I started with Heinzi's suggestion and eventually came up with the following code:
const string tokenPrefix = "px";
const string tokenSuffix = "sx";
const string tokenVar = "var";
string r = string.Format(#"(?<{0}>.*)\$\((?<{1}>.*)\)(?<{2}>.*)",
tokenPrefix, tokenVar, tokenSuffix);
Regex regex = new Regex(r);
Match match = regex.Match("Foo$(Something)Else");
if (match.Success)
{
string prefix = match.Groups[tokenPrefix].Value; // = "Foo"
string suffix = match.Groups[tokenSuffix].Value; // = "Something"
string variable = match.Groups[tokenVar].Value; // = "Else"
}
After talking to a collegue about this I was told to consider using the C# parser coonstruction library named "Sprache" (which is something between regex and ANTLR-alike toolsets) when my pattern usage increases and I want to have better maintainability.

How to customize Lucene.NET to search for words with symbols without case-sensitivity (e.g. "C#" or ".net")?

The standard analyzer does not work. From what I can understand, it changes this to a search for c and net
The WhitespaceAnalyzer would work but it's case sensitive.
The general rule is search should work like Google so hoping it's a configuration thing considering .net, c# have been out there for a while or there's a workaround for this.
Per the suggestions below, I tried the custom WhitespaceAnalyzer but then if the keywords are separated by a comma and no-space are not handled correctly e.g.
java,.net,c#,oracle
will not be returned while searching which would be incorrect.
I came across PatternAnalyzer which is used to split the tokens but can't figure out how to use it in this scenario.
I'm using Lucene.Net 3.0.3 and .NET 4.0
Write your own custom analyzer class similar to SynonymAnalyzer in Lucene.Net – Custom Synonym Analyzer. Your override of TokenStream could solve this by pipelining the stream using WhitespaceTokenizer and LowerCaseFilter.
Remember that your indexer and searcher need to use the same analyzer.
Update: Handling multiple comma-delimited keywords
If you only need to handle unspaced comma-delimited keywords for searching, not indexing then you could convert the search expression expr as below.
expr = expr.Replace(',', ' ');
Then pass expr to the QueryParser. If you want to support other delimiters like ';' you could do it like this:
var terms = expr.Split(new char[] { ',', ';'} );
expr = String.Join(" ", terms);
But you also need to check for a phrase expression like "sybase,c#,.net,oracle" (expression includes the quote " chars) which should not be converted (the user is looking for an exact match):
expr = expr.Trim();
if (!(expr.StartsWith("\"") && expr.EndsWith("\"")))
{
expr = expr.Replace(',', ' ');
}
The expression might include both a phrase and some keywords, like this:
"sybase,c#,.net,oracle" server,c#,.net,sybase
Then you need to parse and translate the search expression to this:
"sybase,c#,.net,oracle" server c# .net sybase
If you also need to handle unspaced comma-delimited keywords for indexing then you need to parse the text for unspaced comma-delimited keywords and store them in a distinct field eg. Keywords (which must be associated with your custom analyzer). Then your search handler needs to convert a search expression like this:
server,c#,.net,sybase
to this:
Keywords:server Keywords:c# Keywords:.net, Keywords:sybase
or more simply:
Keywords:(server, c#, .net, sybase)
Use the WhitespacerAnalyzer and chain it with a LowerCaseFilter.
Use the same chain at search and index time. by converting everything to lower case, you actually make it case insensitive.
According to your problem description, that should work and be simple to implement.
for others who might be looking for an answer as well
the final answer turned out be to create a custom TokenFilter and a custom Analyzer using
that token filter along with Whitespacetokenizer, lowercasefilter etc., all in all about 30 lines of code, i will create a blog post and post the link here when i do, have to create a blog first !

Detect parenthesis in BinaryExpression

I am building a expression analyser from which I would like to generate database query code, I've gotten quite far but am stuck parsing BinaryExpressions accurately. It's quite easy to break them up into Left and Right but I need to detect parenthesis and generate my code accordingly and I cannot see how to do this.
An example [please ignore the flawed logic :)]:
a => a.Line2 != "1" && (a.Line2 == "a" || a.Line2 != "b") && !a.Line1.EndsWith("a")
I need to detect the 'set' in the middle and preserve their grouping but I cannot see any difference in the expression to a normal BinaryExpression during parsing (I would hate to check the string representation for parenthesis)
Any help would be appreciated.
(I should probably mention that I'm using C#)
--Edit--
I failed to mention that I'm using the standard .Net Expression classes to build the expressions (System.Linq.Expressions namespace)
--Edit2--
Ok I'm not parsing text into code, I'm parsing code into text. So my Parser class has a method like this:
void FilterWith<T>(Expression<Func<T, bool>> filterExpression);
which allows you to write code like this:
FilterWith<Customer>(c => c.Name =="asd" && c.Surname == "qwe");
which is quite easy to parse using the standard .Net classes, my challenge is parsing this expression:
FilterWith<Customer>(c => c.Name == "asd" && (c.Surname == "qwe" && c.Status == 1) && !c.Disabled)
my challenge is to keep the expressions between parenthesis as a single set. The .Net classes correctly splits the parenthesis parts from the others but gives no indication that it is a set due to the parenthesis.
I haven't used Expression myself, but if it works anything like any other AST, then the problem is easier to solve than you make it out to be. As another commentor pointed out, just put parentheses around all of your binary expressions and then you won't have to worry about order of operations issues.
Alternatively, you could check to see if the expression you are generating is at a lower precedence than the containing expression and if so, put parenthesis around it. So if you have a tree like this [* 4 [+ 5 6]] (where tree nodes are represented recursively as [node left-subtree right-subtree]), you would know when writing out the [+ 4 5] tree that it was contained inside a * operation, which is higher precedence than a + operation and thus requires than any of its immediate subtrees be placed in parentheses. The pseudo-code could be something like this:
function parseBinary(node) {
if(node.left.operator.precedence < node.operator.precedence)
write "(" + parseBinary(node.left) + ")"
else
write parseBinary(node.left)
write node.operator
// and now do the same thing for node.right as you did for node.left above
}
You'll need to have a table of precedence for the various operators, and a way to get at the operator itself to find out what it is and thence what its precedence is. However, I imagine you can figure that part out.
When building a expression analyzer, you need first a parser, and for that you need a tokenizer.
A tokenizer is a piece of code that reading an expression, generates tokens (which can be valid or invalid), for a determined syntax.
So your parser, using the tokenizer, reads the expression in the established order (left-to right, right-to-left, top-to-bottom, whatever you choose) and creates a tree that maps the expression.
Then the analyzer interprets the tree into an expression, giving its definitive meaning.

Categories

Resources