Pattern matching with extraction of found substrings to variables - c#

I have some strings of some defined format like Foo.<Whatever>.$(Something) and I would like to split them in parts and have each part automatically assigned to a variable.
I once wrote something resembling the bash/shell pipe command option '<' with C# classes and operator overloading. The usage was something like
ParseExpression ex = pex("item1") > ".<" > pex("item2") > ">.$(" > pex("item3") > ")";
ParseResult r = new ParseResult(ex, "Foo.<Whatever>.$(Something)");
ParseResult then had a Dictionary with the keys item1 through item3 set to the strings found in the given string. The method pex generated some object that could be used with the > operator, eventually having a chain of ParseExpressionParts which constitute a ParseExpression.
I don't have the code at hand in the moment, and before I start coding it from scratch again I thought I better ask whether someone has done and published it already.

The parse expressions remind me of parser combinators like Parsec and FParsec (for F#). How complex is the syntax going to be? As it is, it could be handled by a regex with groups.
If you want to create a more complex grammar using parser combinators you can use FParsec, one of the better known parser combinators, targeting F#. In general, functional languages like F# are used a lot in such situations. CSharp-monad is a parser combinator targeting C#. The project isn't very active though.
You can also use a full-blown parser generator like ANTLR 4. ANTLR is used by ASP.NET MVC to parse Razor syntax views. ANTLR 4 creates a parse tree and allows you to use either a Visitor or a Listener to process it that are similar to DOM or SAX processing.. A Listener calls your code as soon as an element is encounter (eg the opening <, the content etc), while the visitor works on the finished tree.
The Visual Studio extension for ANTLR will generate both the parser classes as well as base Visitor and Listener classes for your grammar. The NetBeans-based ANTLRWorks IDE makes creating and testing grammars very easy.
A rough grammar for your example would be :
format: tag '.' '<' category '>' '.' '$' '(' value ')';
tag : ID;
category : ID;
value : ID;
ID :[A-Z0-9]+;
Or you could define keywords like FOO : 'FOO' that have special meaning for your grammar. A visitor or listener could handle the tag eg to format a string, execute an operation on the values etc.
There are no hard and fast rules. Personally, I use regular expressions for simpler cases, eg processing relatively simple log files and ANTLR for more complex cases like screen-scraping mainframe data. I haven't looked into parser combinators as I never had the time to get comfortable with F#. They would be really handy though to handle some messed up log4net log files

I started with Heinzi's suggestion and eventually came up with the following code:
const string tokenPrefix = "px";
const string tokenSuffix = "sx";
const string tokenVar = "var";
string r = string.Format(#"(?<{0}>.*)\$\((?<{1}>.*)\)(?<{2}>.*)",
tokenPrefix, tokenVar, tokenSuffix);
Regex regex = new Regex(r);
Match match = regex.Match("Foo$(Something)Else");
if (match.Success)
{
string prefix = match.Groups[tokenPrefix].Value; // = "Foo"
string suffix = match.Groups[tokenSuffix].Value; // = "Something"
string variable = match.Groups[tokenVar].Value; // = "Else"
}
After talking to a collegue about this I was told to consider using the C# parser coonstruction library named "Sprache" (which is something between regex and ANTLR-alike toolsets) when my pattern usage increases and I want to have better maintainability.

Related

regex for matching programming language like block [duplicate]

This question already has answers here:
Regular expression to match balanced parentheses
(21 answers)
Closed 2 years ago.
I'm trying to make a small scripting language using c#
currently doing a block parser
im stuck at making regex for block.
Blocks can have ∞ times of sub blocks
This is what i need to catch
{
naber();
}
{
int x = 5;
x = 2;
if (x == 5) {
x = 5;
}
}
I tried this but not working
\{[^{}]*|(\{[^\{\}]\})*\}
This is my first post please have mercy on me
Regex will not help you for this. If you are designing a scripting language, possibly to be executed, that has blocks and sub-blocks, you need context-free grammar as opposed to regular grammar which can be expressed through regular expressions.
To interpret a context-free language you need the following steps (simplified):
Convert the code string to a list of tokens/symbols. This process is done by a component usually called Lexer.
Convert the tokens into a structured tree (AST - Abstract Syntax Tree) based on grammar rules (things like operator precedence, nested code blocks, etc). This is done by a component usually called Parser.
From here several options arise, either you translate the AST into native code, or intermediate code (like bytecode) or transpile it into another language; Or you can run it directly in memory, the most simple approach and probably what you want/need.
These should already be plenty of concepts to search for, but all of this can be achieved easily with tools like ANTLR. There might be alternatives to ANTLR obviously, I just don’t recall any just now.
I agree with those saying that regex isn't what you should use parsing code.
With that said, it is possible on some reg engines to match characters and get code in a block.
This might work for you {((?>[^{}]+|(?R))*)}. If the regex engine supports recursive pattern then it is possible to do some work parsing code.
More here about it Match balanced curly braces

Child regex pattern in variable [duplicate]

Say I have a regex matching a hexadecimal 32 bit number:
([0-9a-fA-F]{1,8})
When I construct a regex where I need to match this multiple times, e.g.
(?<from>[0-9a-fA-F]{1,8})\s*:\s*(?<to>[0-9a-fA-F]{1,8})
Do I have to repeat the subexpression definition every time, or is there a way to "name and reuse" it?
I'd imagine something like (warning, invented syntax!)
(?<from>{hexnum=[0-9a-fA-F]{1,8}})\s*:\s*(?<to>{=hexnum})
where hexnum= would define the subexpression "hexnum", and {=hexnum} would reuse it.
Since I already learnt it matters: I'm using .NET's System.Text.RegularExpressions.Regex, but a general answer would be interesting, too.
RegEx Subroutines
When you want to use a sub-expression multiple times without rewriting it, you can group it then call it as a subroutine. Subroutines may be called by name, index, or relative position.
Subroutines are supported by PCRE, Perl, Ruby, PHP, Delphi, R, and others. Unfortunately, the .NET Framework is lacking, but there are some PCRE libraries for .NET that you can use instead (such as https://github.com/ltrzesniewski/pcre-net).
Syntax
Here's how subroutines work: let's say you have a sub-expression [abc] that you want to repeat three times in a row.
Standard RegEx
Any: [abc][abc][abc]
Subroutine, by Name
Perl:     (?'name'[abc])(?&name)(?&name)
PCRE: (?P<name>[abc])(?P>name)(?P>name)
Ruby:   (?<name>[abc])\g<name>\g<name>
Subroutine, by Index
Perl/PCRE: ([abc])(?1)(?1)
Ruby:          ([abc])\g<1>\g<1>
Subroutine, by Relative Position
Perl:     ([abc])(?-1)(?-1)
PCRE: ([abc])(?-1)(?-1)
Ruby:   ([abc])\g<-1>\g<-1>
Subroutine, Predefined
This defines a subroutine without executing it.
Perl/PCRE: (?(DEFINE)(?'name'[abc]))(?P>name)(?P>name)(?P>name)
Examples
Matches a valid IPv4 address string, from 0.0.0.0 to 255.255.255.255:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.(?1)\.(?1)\.(?1)
Without subroutines:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))
And to solve the original posted problem:
(?<from>(?P<hexnum>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(?P>hexnum))
More Info
http://regular-expressions.info/subroutine.html
http://regex101.com/
Why not do something like this, not really shorter but a bit more maintainable.
String.Format("(?<from>{0})\s*:\s*(?<to>{0})", "[0-9a-zA-Z]{1,8}");
If you want more self documenting code i would assign the number regex string to a properly named const variable.
.NET regex does not support pattern recursion, and if you can use (?<from>(?<hex>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(\g<hex>)) in Ruby and PHP/PCRE (where hex is a "technical" named capturing group whose name should not occur in the main pattern), in .NET, you may just define the block(s) as separate variables, and then use them to build a dynamic pattern.
Starting with C#6, you may use an interpolated string literal that looks very much like a PCRE/Onigmo subpattern recursion, but is actually cleaner and has no potential bottleneck when the group is named identically to the "technical" capturing group:
C# demo:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var block = "[0-9a-fA-F]{1,8}";
var pattern = $#"(?<from>{block})\s*:\s*(?<to>{block})";
Console.WriteLine(Regex.IsMatch("12345678 :87654321", pattern));
}
}
The $#"..." is a verbatim interpolated string literal, where escape sequences are treated as combinations of a literal backslash and a char after it. Make sure to define literal { with {{ and } with }} (e.g. $#"(?:{block}){{5}}" to repeat a block 5 times).
For older C# versions, use string.Format:
var pattern = string.Format(#"(?<from>{0})\s*:\s*(?<to>{0})", block);
as is suggested in Mattias's answer.
If I am understanding your question correctly, you want to reuse certain patterns to construct a bigger pattern?
string f = #"fc\d+/";
string e = #"\d+";
Regex regexObj = new Regex(f+e);
Other than this, using backreferences will only help if you are trying to match the exact same string that you have previously matched somewhere in your regex.
e.g.
/\b([a-z])\w+\1\b/
Will only match : text, spaces in the above text :
This is a sample text which is not the title since it does not end with 2 spaces.
There is no such predefined class. I think you can simplify it using ignore-case option, e.g.:
(?i)(?<from>[0-9a-z]{1,8})\s*:\s*(?<to>[0-9a-z]{1,8})
To reuse regex named capture group use this syntax: \k<name> or \k'name'
So the answer is:
(?<from>[0-9a-fA-F]{1,8})\s*:\s*\k<from>
More info: http://www.regular-expressions.info/named.html

Matching and replacing function expressions

I need to do some very light parsing of C# (actually transpiled Razor code) to replace a list of function calls with textual replacements.
If given a set containing {"Foo.myFunc" : "\"def\"" } it should replace this code:
var res = "abc" + Foo.myFunc(foo, Bar.otherFunc( Baz.funk()));
with this:
var res = "abc" + "def"
I don't care about the nested expressions.
This seems fairly trivial and I think I should be able to avoid building an entire C# parser using something like this for every member of the mapping set:
find expression start (e.g. Foo.myFunc)
Push()/Pop() parentheses on a Stack until Count == 0.
Mark this as expression stop
replace everything from expression start until expression stop
But maybe I don't need to ... Is there a (possibly built-in) .NET library that can do this for me? Counting is not possible in the family of languages that RE is in, but maybe the extended regex syntax in C# can handle this somehow using back references?
edit:
As the comments to this answer demonstrates simply counting brackets will not be sufficient generally, as something like trollMe("(") will throw off those algorithms. Only true parsing would then suffice, I guess (?).
The trick for a normal string will be:
(?>"(\\"|[^"])*")
A verbatim string:
(?>#"(""|[^"])*")
Maybe this can help, but I'm not sure that this will work in all cases:
<func>(?=\()((?>/\*.*?\*/)|(?>#"(""|[^"])*")|(?>"(\\"|[^"])*")|\r?\n|[^()"]|(?<open>\()|(?<-open>\)))+?(?(open)(?!))
Replace <func> with your function name.
Useless to say that trollMe("\"(", "((", #"abc""de((f") works as expected.
DEMO

How to customize Lucene.NET to search for words with symbols without case-sensitivity (e.g. "C#" or ".net")?

The standard analyzer does not work. From what I can understand, it changes this to a search for c and net
The WhitespaceAnalyzer would work but it's case sensitive.
The general rule is search should work like Google so hoping it's a configuration thing considering .net, c# have been out there for a while or there's a workaround for this.
Per the suggestions below, I tried the custom WhitespaceAnalyzer but then if the keywords are separated by a comma and no-space are not handled correctly e.g.
java,.net,c#,oracle
will not be returned while searching which would be incorrect.
I came across PatternAnalyzer which is used to split the tokens but can't figure out how to use it in this scenario.
I'm using Lucene.Net 3.0.3 and .NET 4.0
Write your own custom analyzer class similar to SynonymAnalyzer in Lucene.Net – Custom Synonym Analyzer. Your override of TokenStream could solve this by pipelining the stream using WhitespaceTokenizer and LowerCaseFilter.
Remember that your indexer and searcher need to use the same analyzer.
Update: Handling multiple comma-delimited keywords
If you only need to handle unspaced comma-delimited keywords for searching, not indexing then you could convert the search expression expr as below.
expr = expr.Replace(',', ' ');
Then pass expr to the QueryParser. If you want to support other delimiters like ';' you could do it like this:
var terms = expr.Split(new char[] { ',', ';'} );
expr = String.Join(" ", terms);
But you also need to check for a phrase expression like "sybase,c#,.net,oracle" (expression includes the quote " chars) which should not be converted (the user is looking for an exact match):
expr = expr.Trim();
if (!(expr.StartsWith("\"") && expr.EndsWith("\"")))
{
expr = expr.Replace(',', ' ');
}
The expression might include both a phrase and some keywords, like this:
"sybase,c#,.net,oracle" server,c#,.net,sybase
Then you need to parse and translate the search expression to this:
"sybase,c#,.net,oracle" server c# .net sybase
If you also need to handle unspaced comma-delimited keywords for indexing then you need to parse the text for unspaced comma-delimited keywords and store them in a distinct field eg. Keywords (which must be associated with your custom analyzer). Then your search handler needs to convert a search expression like this:
server,c#,.net,sybase
to this:
Keywords:server Keywords:c# Keywords:.net, Keywords:sybase
or more simply:
Keywords:(server, c#, .net, sybase)
Use the WhitespacerAnalyzer and chain it with a LowerCaseFilter.
Use the same chain at search and index time. by converting everything to lower case, you actually make it case insensitive.
According to your problem description, that should work and be simple to implement.
for others who might be looking for an answer as well
the final answer turned out be to create a custom TokenFilter and a custom Analyzer using
that token filter along with Whitespacetokenizer, lowercasefilter etc., all in all about 30 lines of code, i will create a blog post and post the link here when i do, have to create a blog first !

Is there a way to create a string that matches a given C# regex?

My application has a feature that parses text using a regular expression to extract special values. I find myself also needing to create strings that follow the same format. Is there a way to use the already defined regular expression to create those strings?
For example, assume my regex looks something like this:
public static Regex MyRegex = new Regex( #"sometext_(?<group1>\d*)" );
I'd like to be able to use MyRegex to create a new string, something like:
var created = MyRegex.ToString( new Dictionary<string, string>() {{ "group1", "data1" }};
Such that created would then have the value "sometextdata1".
Update: Judging from some of the answers below, I didn't make myself clear enough. I don't want to generate random strings matching the criteria, I want to be able to create specific strings matching the criteria. In the example above, I provided "data1" to fill "group1". Basically, I have a regex that I want to use in a manner similar to format strings instead of also defining a separate format string.
You'll need a tool called Rex. Well you don't 'need' it, but it's what I use :-)
http://research.microsoft.com/en-us/projects/rex/
You can (although not ideal), add the exe as a reference to your project and utilize the classes that have been made public.
It works quite well.
Native RegEx cannot do something like this.
If you really wanted to generate strings that matched a set of criteria, you could investigate definite clause grammars (DCG) instead of a regular expression. A logic programming language such as Prolog should be able to generate strings that matched the grammatical rules you defined.
From Wikipedia:
A basic example of DCGs helps to illustrate what they are and what they look like.
sentence --> noun_phrase, verb_phrase.
noun_phrase --> det, noun.
verb_phrase --> verb, noun_phrase.
det --> [the].
det --> [a].
noun --> [cat].
noun --> [bat].
verb --> [eats].
This generates sentences such as "the cat eats the bat", "a bat eats the cat". One can generate all of the valid expressions in the language generated by this grammar at a Prolog interpreter...
From your question, it sounds like this isn't really what you want to do. My advice would to be to simply create a class that held your Dictionary<String, String> object and had a custom ToString() method that returned data in the appropriate format. This would be much easier ;-)
e.g.:
public class SpecialObject
{
public Dictionary<string, string> SpecialDictionary { get; set; }
public override string ToString()
{
return "sometext_group1data1"; // or whatever you want
}
}
You might want to check to see if Pex can figure it out. Create a method that takes a string and returns whether that Regex matches it. Pex might just be smart enough to find inputs for your method that will test various aspects of the expression. (It might even help you catch some corner cases you hadn't considered.
Other than that, no. You're asking a system (regex) to do something it was totally not built to do.

Categories

Resources