I'm working on a custom mathematical expression calculator, but I'm having problems at parsing nested conditional expression like this one:
IIF("M"="M",(IIF(100 < 50,(IIF(2 > 0.45,2,1)),(IIF(2 > 0.45,4,3)))),(IIF(100 < 46,(IIF(2 > 0.45,2,1)),(IIF(2 >0.45,4,3)))))
What I'd like to do is to split the IIF function by commas in order to get its parameters:
Dim condition = "M"="M"
Dim truePart = (IIF(100 < 50,(IIF([2 > 0.45,2,1)),(IIF(2 >0.45,4,3))))
Dim falsePart = (IIF(100 < 46,(IIF(2 > 0.45,2,1)),(IIF(2 >0.45,4,3)))))
At the moment I'm using Regex to parse single IIF function by getting what is inside the parentheses and the split it by commas:
\((.*?)\)
Obviously that doesn't work with such expression since it will stop at the first closing parentheses, therefore I thought about using this to get all the other characters:
\((.*?)\).*
But now I'm not sure how to split it, since using commas is not an option anymore.
The answer from theory is that regular expressions are not capable to do what you requested because they "cannot count". However, you need to count.
The practise says that .NET regular expressions are no regular expressions but stack machines. With a group (?<Group>.*) you in fact add an entry to a stack of that group. With (?<-Group>), you can remove an entry from that stack. You can also test whether the stack is empty.
Out of curiosity, I gave it a try and I believe that
[\(,]([^\(\)]|(?<Par>\()|(?<-Par>\)))*(?(Par)---|[,\)])
should work, where --- is used as an escape sequence. If you understand that "regular expression" right away, then I think you are good to go. In all other cases, I would rather recommend you to write a parser manually. Otherwise, you are not going to understand your code 5min after you have tested it.
Related
I need to do some very light parsing of C# (actually transpiled Razor code) to replace a list of function calls with textual replacements.
If given a set containing {"Foo.myFunc" : "\"def\"" } it should replace this code:
var res = "abc" + Foo.myFunc(foo, Bar.otherFunc( Baz.funk()));
with this:
var res = "abc" + "def"
I don't care about the nested expressions.
This seems fairly trivial and I think I should be able to avoid building an entire C# parser using something like this for every member of the mapping set:
find expression start (e.g. Foo.myFunc)
Push()/Pop() parentheses on a Stack until Count == 0.
Mark this as expression stop
replace everything from expression start until expression stop
But maybe I don't need to ... Is there a (possibly built-in) .NET library that can do this for me? Counting is not possible in the family of languages that RE is in, but maybe the extended regex syntax in C# can handle this somehow using back references?
edit:
As the comments to this answer demonstrates simply counting brackets will not be sufficient generally, as something like trollMe("(") will throw off those algorithms. Only true parsing would then suffice, I guess (?).
The trick for a normal string will be:
(?>"(\\"|[^"])*")
A verbatim string:
(?>#"(""|[^"])*")
Maybe this can help, but I'm not sure that this will work in all cases:
<func>(?=\()((?>/\*.*?\*/)|(?>#"(""|[^"])*")|(?>"(\\"|[^"])*")|\r?\n|[^()"]|(?<open>\()|(?<-open>\)))+?(?(open)(?!))
Replace <func> with your function name.
Useless to say that trollMe("\"(", "((", #"abc""de((f") works as expected.
DEMO
I have a 2 part question:
How can I get the regex expression of an XSD facet and then use it to determine if a string matches the restriction? In my mind, this is "How do I convert XML Schema regex to .NET Regex", but I'm open for suggestions if you have another way for me to do it other than converting the expression.
If the test (#1) fails, how can I use the XSD pattern regex to automatically create a string which does satisfy the constraint?
XmlSchemaDatatype.ParseValue is your answer. Assuming the associated simple type has more facets and you only want to validate against the pattern one(s), then you have to simply find the enumeration facet in the XmlSchemaSimpleTypeRestriction.Facets, use a copy of that to create a new XmlSchemaSimpleType, with a new XmlSchemaSimpleTypeRestriction Content and new pattern facet(s) using the values you scooped above. Then using this newly created simple type, invoke XmlSchemaDatatype.ParseValue.
I would advise against your suggestion in the comment, since the regex "dialects" are different.
I am not aware of such a thing, available for free or otherwise. I am sure it can be done but I never found something that would actually work, when I needed it myself. If you do find one, please share.
It is not too difficult to convert a XML Schema regex to a .NET regex.
Basically you need to replace few patterns such as \c and \D with by their .NET alternatives such as \p{_xmlC} and \P{_xmlD}.
Also you need to wrap expression in ^ and $ markers.
.NET implements this in method Preprocess in https://github.com/Microsoft/referencesource/blob/master/System.Xml/System/Xml/Schema/FacetChecker.cs
If you decide to copy-paste the implementation, be careful, though.
You need to replace loop
for (int position = 0; position < length - 2; position ++)
with
for (int position = 0; position < length - 1; position ++)
because for optimization reasons Preprocess assumes the input expression is enclosed in parentheses.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using C# regular expressions to remove HTML tags
I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.
Here's the code I've tried
Regex reg = new Regex(#"<.*>");
file = reg.Replace(file, "");
Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?
Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.
Well, the unexpected behavior you're getting is because your regular expression is greedy
If you change your regex to
Regex reg = new Regex(#"<.*?>");
file = reg.Replace(file, "");
you'll get what you expect.
Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.
I am building a expression analyser from which I would like to generate database query code, I've gotten quite far but am stuck parsing BinaryExpressions accurately. It's quite easy to break them up into Left and Right but I need to detect parenthesis and generate my code accordingly and I cannot see how to do this.
An example [please ignore the flawed logic :)]:
a => a.Line2 != "1" && (a.Line2 == "a" || a.Line2 != "b") && !a.Line1.EndsWith("a")
I need to detect the 'set' in the middle and preserve their grouping but I cannot see any difference in the expression to a normal BinaryExpression during parsing (I would hate to check the string representation for parenthesis)
Any help would be appreciated.
(I should probably mention that I'm using C#)
--Edit--
I failed to mention that I'm using the standard .Net Expression classes to build the expressions (System.Linq.Expressions namespace)
--Edit2--
Ok I'm not parsing text into code, I'm parsing code into text. So my Parser class has a method like this:
void FilterWith<T>(Expression<Func<T, bool>> filterExpression);
which allows you to write code like this:
FilterWith<Customer>(c => c.Name =="asd" && c.Surname == "qwe");
which is quite easy to parse using the standard .Net classes, my challenge is parsing this expression:
FilterWith<Customer>(c => c.Name == "asd" && (c.Surname == "qwe" && c.Status == 1) && !c.Disabled)
my challenge is to keep the expressions between parenthesis as a single set. The .Net classes correctly splits the parenthesis parts from the others but gives no indication that it is a set due to the parenthesis.
I haven't used Expression myself, but if it works anything like any other AST, then the problem is easier to solve than you make it out to be. As another commentor pointed out, just put parentheses around all of your binary expressions and then you won't have to worry about order of operations issues.
Alternatively, you could check to see if the expression you are generating is at a lower precedence than the containing expression and if so, put parenthesis around it. So if you have a tree like this [* 4 [+ 5 6]] (where tree nodes are represented recursively as [node left-subtree right-subtree]), you would know when writing out the [+ 4 5] tree that it was contained inside a * operation, which is higher precedence than a + operation and thus requires than any of its immediate subtrees be placed in parentheses. The pseudo-code could be something like this:
function parseBinary(node) {
if(node.left.operator.precedence < node.operator.precedence)
write "(" + parseBinary(node.left) + ")"
else
write parseBinary(node.left)
write node.operator
// and now do the same thing for node.right as you did for node.left above
}
You'll need to have a table of precedence for the various operators, and a way to get at the operator itself to find out what it is and thence what its precedence is. However, I imagine you can figure that part out.
When building a expression analyzer, you need first a parser, and for that you need a tokenizer.
A tokenizer is a piece of code that reading an expression, generates tokens (which can be valid or invalid), for a determined syntax.
So your parser, using the tokenizer, reads the expression in the established order (left-to right, right-to-left, top-to-bottom, whatever you choose) and creates a tree that maps the expression.
Then the analyzer interprets the tree into an expression, giving its definitive meaning.
I am using the following regular expression (from http://www.simple-talk.com/dotnet/asp.net/regular-expression-based-token-replacement-in-asp.net/)
(?<functionName>[^\$]*?)\((?:(?<params>.**?)(?:,|(?=\))))*?)
it works fine, except when I what to include brackets within the parameters such
as "<b>hello<b> renderHTML(""GetData(12)"") "
so I want "GetData(12)" instead I get "GetData(12".
Is there a way to ignore any matches if they are wrapped in double quotes?
There are ways to ignore the parens inside of quotes but this will not solve your problem. Function calls in C# cannot be matched with a regular expression . Regular expressions cannot match nested structures such as they way both parens and < appear inside of a function call. To match these you need to use a grammar of sorts.
I while back I wrote a blog post which goes into a bit more detail about this problem
http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
I don't mean to be avoiding the answer here. But any answer to this question will just be broken by a slightly more complex (or sometimes even simpler) function call.