How does Visual Studio syntax-highlight strings in the Regex constructor? - c#

Hi fellow programmers and nerds!
When creating regular expressions Visual Studio, the IDE will highlight the string if it's preceded by a verbatim identifier (for example, #"Some string). This looks something like this:
(Notice the way the string is highlighted). Most of you will have seen this by now, I'm sure.
My problem: I am using a package acquired from NuGet which deals with regular expressions, and they have a function which takes in a regular expression string, however their function doesn't have the syntax highlighting.
As you can see, this just makes reading the Regex string just a pain. I mean, it's not all-too-important, but it would make a difference if we can just have that visually-helpful highlighting to reduce the time and effort one's brain uses trying to decipher the expression, especially in a case like mine where there will be quite a quantity of these expressions.
The question
So what I'm wanting to know is, is there a way to make a function highlight the string this way*, or is it just something that's hardwired into the IDE for the specific case of the Regex c-tor? Is there some sort of annotation which can be tacked onto the function to achieve this with minimal effort, or would it be necessary to use some sort of extension?
*I have wrapped the call to AddStyle() into one of my own functions anyway, and the string will be passed as a parameter, so if any modifications need to be made to achieve the syntax-highlight, they can be made to my function. Therefore the fact that the AddStyle() function is from an external library should be irrelevant.
If it's a lot of work then it's not worth my time, somebody else is welcome to develop an extension to solve this, but if there is a way...
Important distinction
Please bear in mind I am talking about Visual Studio, NOT Visual Studio Code.
Also, if there is a way to pull the original expression string from the Regex, I might do it that way, since performance isn't a huge concern here as this is a once-on-startup thing, however I would prefer not to do it that way. I don't actually need the Regex object.

According to https://devblogs.microsoft.com/dotnet/visual-studio-2019-net-productivity/#regex-language-support and https://www.meziantou.net/visual-studio-tips-and-tricks-regex-editing.htm you can mark the string with a special comment to get syntax highlighting:
// language=regex
var str = #"[A-Z]\d+;
or
MyMethod(/* language=regex */ #"[A-Z]\d+);
(the comment may contain more than just this language=regex part)
The first linked blog talks about a preview, but this feature is also present in the final product.

.NET 7 introduces the new [StringSyntax(...)] attribute, which is used in .NET 7 on more than 350 string, string[], and ReadOnlySpan<char> parameters, properties, and fields to highlight to an interested tool what kind of syntax is expected to be passed or set.
https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/?WT_mc_id=dotnet-35129-website&hmsr=joyk.com&utm_source=joyk.com&utm_medium=referral
So for a method argument you should just use:
void MyMethod([StringSyntax(StringSyntaxAttribute.Regex)] string regex);
Here is a video demonstrating the feature: https://youtu.be/Y2YOaqSAJAQ

Related

Matching and replacing function expressions

I need to do some very light parsing of C# (actually transpiled Razor code) to replace a list of function calls with textual replacements.
If given a set containing {"Foo.myFunc" : "\"def\"" } it should replace this code:
var res = "abc" + Foo.myFunc(foo, Bar.otherFunc( Baz.funk()));
with this:
var res = "abc" + "def"
I don't care about the nested expressions.
This seems fairly trivial and I think I should be able to avoid building an entire C# parser using something like this for every member of the mapping set:
find expression start (e.g. Foo.myFunc)
Push()/Pop() parentheses on a Stack until Count == 0.
Mark this as expression stop
replace everything from expression start until expression stop
But maybe I don't need to ... Is there a (possibly built-in) .NET library that can do this for me? Counting is not possible in the family of languages that RE is in, but maybe the extended regex syntax in C# can handle this somehow using back references?
edit:
As the comments to this answer demonstrates simply counting brackets will not be sufficient generally, as something like trollMe("(") will throw off those algorithms. Only true parsing would then suffice, I guess (?).
The trick for a normal string will be:
(?>"(\\"|[^"])*")
A verbatim string:
(?>#"(""|[^"])*")
Maybe this can help, but I'm not sure that this will work in all cases:
<func>(?=\()((?>/\*.*?\*/)|(?>#"(""|[^"])*")|(?>"(\\"|[^"])*")|\r?\n|[^()"]|(?<open>\()|(?<-open>\)))+?(?(open)(?!))
Replace <func> with your function name.
Useless to say that trollMe("\"(", "((", #"abc""de((f") works as expected.
DEMO

Parse string into a LINQ query

What method would be considered best practice for parsing a LINQ string into a query?
Or in other words, what approach makes the most sense to convert:
string query = #"from element in source
where element.Property = ""param""
select element";
into
IEnumerable<Element> = from element in source
where element.Property = "param"
select element;
assuming that source refers to an IEnumerable<Element> or IQueryable<Element> in the local scope.
Starting with .NET 4.6 you can use CSharpScript to parse Linq. Assuming the expression you want to parse is in string variable "query", this will do it:
string query = "from element in source where element.Property = ""param"" select element";
IEnumerable result = null;
try
{
var scriptOptions = ScriptOptions.Default.WithReferences(typeof(System.Linq.Enumerable).Assembly).WithImports("System.Linq");
result = await CSharpScript.EvaluateAsync<IEnumerable>(
query,
scriptOptions,
globals: global);
} catch (CompilationErrorException ex) {
//
}
Don't forget to pass your (Data)source you want to work on, with the global-variable(s) to have access to them in script parsing.
It requires some text parsing and heavy use of System.Linq.Expressions. I've done some toying with this here and here. The code in the second article is somewhat updated from the first but still rough in spots. I've continued to mess round with this on occasion and have a somewhat cleaner version that I've been meaning to post if you have any interest. I've got it pretty close to supporting a good subset of ANSI SQL 89.
You're going to need a C# language parser (at least v3.5, possibly v4.0, depending on what C# language features you wish to support in LINQ). You'll take those parser results and feed it directly into an Expression tree using a visitor pattern. I'm not sure yet but I'm willing to bet you'll also need some form of type analysis to fully generate the Expression nodes.
I'm looking for the same thing as you, but I don't really need it that badly so I haven't searched hard nor written any code along these lines.
I have written something that takes user string input and compiles it to a dynamic assembly using the Microsoft.CSharp.CSharpCodeProvider compiler provider class. If you just want to take strings of code and execute the result, this should suit you fine.
Here's the description of the console tool I wrote, LinqFilter:
http://bittwiddlers.org/?p=141
Here's the source repository. LinqFilter/Program.cs demonstrates how to use the compiler to compile the LINQ expression:
http://bittwiddlers.org/viewsvn/trunk/public/LinqFilter/?root=WellDunne
This might work for you: C# eval equivalent?
This may or may not help you, but check out LINQ Dynamic Query Library.
While this doesn't specifically give an example to answer your question I would have thought the best practice would generally be to build an expression tree from the string.
In this question I asked how to filter a linq query with a string which shows you building a portion of an expression tree. This concept however can be extended to build an entire expression tree representing your string.
See this article from Microsoft.
There are probably other better posts out there as well. Additionally I think something like RavenDB does this already in its code base for defining indexes.

Why is my regex so much slower compiled than interpreted?

I have a large and complex C# regex that runs OK when interpreted, but is a bit slow. I'm trying to speed this up by setting RegexOptions.Compiled, and this seems to take about 30 seconds for the first time and instantly after that. I'm trying to negate this by compiling the regex to an assembly first, so my app can be as fast as possible.
My problem is when the compiling delay takes place, whether it's compiled in the app:
Regex myComplexRegex = new Regex(regexText, RegexOptions.Compiled);
MatchCollection matches = myComplexRegex.Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{
}
or using Regex.CompileToAssembly in advance:
MatchCollection matches = new CompiledAssembly.ComplexRegex().Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{
}
This is making compiling to an assembly basically useless, as I still get the delay on the first foreach call. What I want is for all the compiling delay to be done at compile time instead (at the Regex.CompileToAssembly call), and not at runtime. Where am I going wrong ?
(The code I'm using to compile to an assembly is similar to http://www.dijksterhuis.org/regular-expressions-advanced/ , if that's relevant ).
Edit:
Should I be using new when calling the compiled assembly in new CompiledAssembly.ComplexRegex().Matches(searchText); ? It gives a "object reference required" error without it though.
Update 2
Thanks for the answers/comments. The regex that I'm using is pretty long but basically straightforward, a list of thousands of words each separated by |. I can't see it'd be a backtracking problem really. The subject string can be just one letter long, and it can still cause the compilation delay. For a RegexOptions.Compiled regex, it'll take over 10 seconds to execute when the regex contains 5000 words. For comparison, the non-compiled version of the regex can take 30,000+ words and still execute just about instantly.
After doing a lot of testing on this, what I think I've found out is:
Don't use RegexOptions.Compiled when your regex has many alternatives - it can be extremely slow to to compile.
.Net will use lazy evaluation for regex when possible, and AFAI can see this extends (at least to some extent) to regex compilation too. A regex will be fully compiled only when it has to be, and there seems to be no way of forcing compilation ahead of time.
Regex.CompileToAssembly would be much more useful if the regexes could be forced to be fully compiled, it seems to be verging on being pointless as it is.
Please correct me if I'm wrong or missing something!
When using RegexOptions.Compiled, you should make sure to re-use the Regex object. It doesn't seem like you are doing this.
RegexOptions.Compiled is a trade-off. The initial construction of the Regex will be slower, because code is compiled on-the-fly, but each match should be faster. If your regular expression changes at run-time, there will probably be no benefit from using RegexOptions.Compiled, although it might depend on the actual expression involved.
Update, per the comments
If your actual code looks like the one you have posted, you are not taking any advantage of CompileToAssembly, as you are creating new, on-the-fly compiled instances of Regex each time that piece of code runs. In order to take advantage of CompileToAssembly, you will need to compile the Regex first; then take the generated assembly and reference it in your project. You should then instantiate the generated, strongly-typed Regex types generated.
In the example you link to, he has a regular expression named FindTCPIP, which gets compiled into a type named FindCTPIP. When this needs to be used, one should create a new instance of this specific type, such as:
TheRegularExpressions.FindTCPIP MatchTCP = new TheRegularExpressions.FindTCPIP();
Try using Regex.CompileToAssembly, then link to the assembly so that you can construct the Regex objects. RegexOptions.Compiled is a runtime option, the regex would still get re-compiled every time you run the application.
A very probable cause when investigating a slow regex is that it backtracks too much. This is solved by rewriting the regex so that the number of backtracking is non existent or minimal.
Can you post the regex and a sample input where it is slow.
Personally I didn't have the need to compile a regex although its interesting to see some actual numbers about performance if you have taken this path.
To force initialization you can call Match against an empty string. On top of that you can use ngen to create a native image of the expression to speed up the process even further. But probably most importantly, it's essentially just as fast to throw 30.000 string.IndexOf's or string.Contains or Regex.Match statements against a given text, than compiling a ginormous big expression to Match against a single text. Since that requires a lot less compilation, jitting etc, as the state machine is a lot simpler.
Another thing you could consider is to tokenize the text and intersect it with the list of words you're after.
After extensive testing of my own, I can confirm the suspicions of mikel are essentially correct. Even when using Regex.CompileToAssembly() and statically linking the resultant DLL into the application, there is a substantial initial delay on the first practical matching call (at least for patterns involving many ORed alternatives). Moreover, the initial delay on the first matching call depends on what text you match against. For example, matching against an empty string or some other arbitrary text will cause less of an initial delay, but you will still get additional delays later on when actual positive matches are first encountered in new text. The only way to fully guarantee future matches will all be lightning fast is to initially force a positive match at runtime with text that does indeed match. Of course this gives the maximum initial delay possible (in exchange for all future matches being lightning fast).
I dug deeper in order to understand this better. For each regex compiled into the assembly, a triplet of classes are written with the following naming template: {RegexName, RegexNameFactoryN, RegexNameRunnerN}. A reference to the RegexNameFactoryN class is instantiated at time of RegexName ctor, but the RegexNameRunnerN class is not. See the private factory and runnerref fields in the base Regex class. runnerref is a cached weak reference to a RegexNameRunnerN object. After various experiments with reflection, I can confirm that the ctors of all 3 of these compiled classes are fast and the RegexNameFactoryN.CreateInstance() function (which returns the initial RegexNameRunnerN reference) is also fast. The initial delay occurs somewhere within RegexRunner.Scan(), or it's call tree, and is thus likely outside the reach of the compiled MSIL generated by Regex.CompileToAssembly() since this call tree involves numerous non-abstract functions. This is very unfortunate and means the C# Regex compilation process performance benefits only extend so far: At runtime there will always be some substantial delay at the first time a positive match is encountered (at least for this class of many-ORed patterns).
I theorize that this has to do with how the Nondeterministic Finite Automaton (NFA) engine performs some of it's own internal caching/instantiations at runtime as the pattern is processed.
jessehouwing's suggestion of ngen is interesting and could possibly improve performance. I have not tested it.

Regular expression in C# , is this possible?

I never use regular expression before and plan to use it to solve my problem but not quite sure whether it can help me.
I have a situation where I need store a rule or formula to build string values like following examples in a database field then retrieve this rule and build the string value.
FacilityCode + Left(ModelNO,2)
Right(PO,3) + Left(Serial,2)
Is this achievable using .net regular expression? Any good tutorial or simple examples of this problem.
Regexp : http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
But it doesn't seems fitting :)
It might be better to code some random string generator. Regex is for searching data not creating data.
The thing to remember about regex is that it is like an aircraft carrier; it does one thing very very well, it does not do other jobs very well at all.
An aircraft carrier moves planes very well on the ocean; it does not make a cheese sandwich well AT ALL!!
That is to say, if you use regex when you shouldn't you will almost certainly use far more processing power than if you used another tool for that job. Html parsing comes to mind.
Regex is provided as part of System.Text.RegularExpressions, but you can't rely exclusively on it. It'll let you search existing strings, but you'll need to implement your own logic for building new strings based on what you find in the existing data.
Also, keep in mind that System.Text.RegularExpressions works differently from regexp in Perl and other implementations. For example, it doesn't recognize POSIX character class definitions.
Since you're new to regex, you might want to check out the "Regular Expressions User Guide" on zytrax.com. It's not as comprehensive as an O'Reilly manual, but it'll do as a start.

c# parse a string that contains conditions , key=value

I m giving a string that contains several different combination of data.
For example :
string data = "(age=20&gender=male) or (city=newyork)"
string data1 = "(job=engineer&gender=female)"
string data2 = "(foo =1 or foo = 2) & (bar =1)"
I need to parse this string and create structure out of it and i have to evaluate this to a condition of another object. eg: if the object has these properties, then do something , else skip etc.
What are the best practices to do this?
Should i use a parser such as antlr and generate tokens out of the string. etc.?
reminder : there are several combinations of how this string is created. but it s all and/or.
Something like ANTLR is probably overkill for this.
A simple implementation of the shunting-yard algorithm would probably do the trick quite nicely.
Using regular expressions may work if the example is very simple, but it will more likely lead to a code that is impossible to maintain. Using some other approach to parsing seems like a good idea.
I would take a look at NCalc - it is mainly focused on parsing mathematical expressions, but it seems to be quite customizable (you can specify your functions and constants), so it may work in your scenario as well.
If this is too complex for your purpose, you can use any "parser generator" for C#. Using ANTLR is one great option - here is an example that shows how to start writing something like your example Five minute introduction to ANTLR
You could also try using F#, which is a great language for this kind of problem. See for example FsLex Sample by Chris Smith, which shows a simple mathematical evaluator - processing the parsed expression in F# would be a lot easier than in C#. In F#, you could also use FParsec, which is very lightweight, but may be a bit difficult to follow if you're not used to F#.
I suggest you to have a look at regular expressions: http://www.codeproject.com/KB/dotnet/regextutorial.aspx
Antlr is a great tool, but you can probably do this with regular expressions. One of the nice things about the .NET regex engine is support for nested constructs. See
http://retkomma.wordpress.com/2007/10/30/nested-regular-expressions-explained/
and this SO post.
Seems like you might want to use Regular Expressions to do this.
Read up a little bit on Regular Expressions in .NET. Here are some good articles:
http://msdn.microsoft.com/en-us/library/hs600312.aspx
http://www.regular-expressions.info/dotnet.html
When it comes time to write/test your Regular expression i would highly recommend using RegExLib.com's regex tester.

Categories

Resources