Why is my .NET regex not working correctly? - c#

I have a text file which is in the format:
key1:val1,
key2:val2,
key3:val3
and I am trying to parse the key/value pairs out with a regex. Here is the regex code I am using with the same example:
string input = #"key1:val1,
key2:val2,
key3:val3";
var r = new Regex(#"^(?<name>\w+):(?<value>\w+),?$", RegexOptions.Multiline | RegexOptions.ExplicitCapture);
foreach (Match m in r.Matches(input))
{
Console.WriteLine(m.Groups["name"].Value);
Console.WriteLine(m.Groups["value"].Value);
}
When I loop through r.Matches, sometimes certain key/value pairs don't appear, and it seems to be the ones with the comma at the end of the line - but I should be taking that into account with the ,?. What am I missing here?

this might be a good situation for String.Split rather than a regex:
foreach(string pair in input.Split(new Char [] {','}))
{
string [] items = pair.Split(new Char [] {':'});
Console.WriteLine(items[0]);
Console.WriteLine(items[1]);
}

The problem is that your regular expression is not matching the newline in the first two lines.
Try changing it to
#"^(?<name>\w+):(?<value>\w+),?(\n|\r|\r\n)?$"
and it should work.
By the way, I love regular expressions, but given the problem you are trying to solve, go for the string.Split solution. It will be much easier to read...
EDIT: after reading your comment, where you say that this is a simplified version of your problem, then maybe you could simplify the expression by adding some "tolerance" for spaces / newline at the end of the match with
#"^(?<name>\w+):(?<value>\w+),?\s*$"
Also, when you play with regular expressions, test them with a tool like Expresso, it saves a lot of time.

Get rid of the RegexOptions.Multiline option.

Related

RegEx for matching special chars no spaces or newlines

I have a string and want to use regex to match all the chars, but no spaces.
I tried to replace all the spaces with nothing, using:
Regex.Replace(seller, #"[A-z](.+)", m => m.Groups[1].Value);
//rating
var betyg = Regex.Replace(seller, #"[A-z](.+)", m => m.Groups[1].Value);`
I expect the output of
"Iris-presenter | 5"
but, the output is
"Iris-presenter"
seen in this also seen in this demo.
The string is:
<spaces>Iris-presenter
<spaces>|
<spaces>5
Great question! I'm not quite sure, if this would be what you might be looking for. This expression however matches your input string:
^((?!\s|\n).)*
Graph
The graph shows how it might work:
Edit
Based on revo's advice, the expression can be much simplified, because
^((?!\s|\n).)* is equal to ^((?!\s).)* and both are equal to ^\S*.
I used (\s(.*?)) for it to work. This removes all spaces and new lines seen here

REGEX finding strings within a string

I seem to write one Reg expression a year and always end up asking for help.
Here's a string (it's a search string from Solr) and I want to select every instance of the search word.
Here's the input:-
http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A
I need to select any text between every '%3A' and '+OR' as well as the final '%3Atheory))' - in this case the word 'theory' but it will be a different word every time - the only known thing is it'll be any alpha text between the '%3A' and the '+OR'. And it need to stop at the '+AND+'
I've got as far as /%3A(.*?)[+OR]/g - it's a start I guess...
It doesn't find '%3Atheory))' and it doesn't stop at '+AND+'
I'm struggling with 'find this' OR 'find that' as well as stopping at a string.
anyone offer some guidance?
If you're using c# it might be better to split in two operations using String.Split and the Regex.Matches like so:
string input = #"http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A";
Regex regex = new Regex(#"%3A(.*?)(?:\+OR|\)\))");
var splitted = input.Split(new[] { "AND" }, StringSplitOptions.None);
var matches = regex.Matches(splitted.First());
foreach (Match m in matches)
{
// Or whatever you like to do with your matches
Console.WriteLine(m.Groups[1].Value);
}
Regex.Split has an option to keep the separating strings. So for the text given in the question, code like that below will split it into pieces:
string[] pieces = Regex.Split(theInputText, "(%3A.*?\\+(?:AND|OR))");
foreach (string ss in pieces)
{
Console.WriteLine(ss);
}
Here is a small section of the output:
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND
+-(virtualPath
%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR
+virtualPath
Having split the string into pieces it should be a simple matter to screen for the array elements with the correct starting and ending characters, also to find the last %3Atheory... entry.
Note: The question discusses +OR and +AND+ but all the +ORs are followed with a + so it may be better to include a final + in the expression, as ...OR)\\+).
Note: The inner brackets in the regular expression are non capturing, ie (?: ). If they were capturing brackets then the AND and OR captures would be included in the output array.

Regular Expression for a JSON-type String

I am having trouble splitting a String using a regular expression
"[{'name':'abc','surname':'def'},{'name':'ghi','surname':'jkl'},{'name':'asdf','surname':'asdf'}]"
Now I'd like to split this to
"{'name':'abc','surname':'def'}" and "{'name':'ghi','surname':'jkl'}"
Later on I will deserialize both Strings and work with the values. I must admit that I've worked way too little with regular expressions and would love if someone could help me. I want to split by those square brackets as well as by the middle comma. I was either splitting by ALL commas or not splitting at all.
Kind regards
This Regex will do that:
({.*?})
and here is a Regex 101 to prove it.
To use it you might do something like this:
var match = Regex.Match(input, pattern);
// match.Groups has all of the matches

Escaping \x from strings

Well, I got this little method:
static string escapeString(string str) {
string s = str.Replace(#"\r", "\r").Replace(#"\n", "\n").Replace(#"\t", "\t");
Regex regex = new Regex(#"\\x(..)");
var matches = regex.Matches(s);
foreach (Match match in matches) {
s = s.Replace(match.Value, ((char)Convert.ToByte(match.Value.Replace(#"\x", ""), 16)).ToString());
}
return s;
}
It replaces "\x65" from String, which I've got in args[0].
But my Problem is: "\\x65" will be replaced too, so I get "\e". I'd tried to figure out a regex which would check if there are more then one backslashs, but I had no luck.
Can sombody gimme a hint?
You can continue to hack regexes together with things like "\s|\w\x(..)" to remove the case of \x65. Obviously that will be brittle since there is no guarantee that your sequence \x65 always has a space or character in front of it. It could be the beginning of the file. Also, your regex will match \xTT, which obviously isn't unicode. Consider replacing the '.' with a character class like "\x([0-9a-f]{2})".
If this was a school project, I would do something like the following. You can replace all combinations of "\" into another unlikely sequence, like "#!!#!!#", run the regex and replacements, and then replace all of the unlikely sequence back to "\". For example:
String s = inputString.Replace(#"\\", #"_#!!#!!#_");
// do all of the regex, replacements, etc here
String output = s.Replace(#"_#!!#!!#_", #"\");
However, you shouldn't do this in production code because if your input stream ever has the magic sequence then you will get extra backslashes.
It's obvious that you are writing come kind of interpolator. I feel obligated to recommend looking into something more robust like lexers that use regexes to form Finite State Machines. Wiki has some great articles on this topic, and I'm a big fan of ANTLR. It may be overengineering now, but if you keep running into these special cases consider solving your problem in a more general way.
Start reading here for the theory: http://en.wikipedia.org/wiki/Lexical_analysis
Use a negative look-behind:
Regex regex = new Regex(#"(?<!([^\]|^)\\)\\x(..)");
This asserts that the previous character is not a solo backslash, but without capturing the previous character (look-arounds do not capture).

C# multiple string match

I need C# string search algorithm which can match multiple occurance of pattern. For example, if pattern is 'AA' and string is 'BAAABBB' Regex produce match result Index = 1, but I need result Index = 1,2. Can I force Regex to give such result?
Use a lookahead pattern:-
"A(?=A)"
This finds any A that is followed by another A without consuming the following A. Hence AAA will match this pattern twice.
To summarize all previous comments:
Dim rx As Regex = New Regex("(?=AA)")
Dim mc As MatchCollection = rx.Matches("BAAABBB")
This will produce the result you are requesting.
EDIT:
Here is the C# version (working with VB.NET today so I accidentally continued with VB.NET).
Regex rx = new Regex("(?=AA)");
MatchCollection mc = rx.Matches("BAAABBB");
Any regular expression can give an array of MatchCollection
Regex.Matches()
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchcollection.aspx
Try this:
System.Text.RegularExpressions.MatchCollection matchCol;
System.Text.RegularExpressions.Regex regX = new System.Text.RegularExpressions.Regex("(?=AA)");
string index="",str="BAAABBB";
matchCol = regX.Matches(str);
foreach (System.Text.RegularExpressions.Match mat in matchCol)
{
index = index + mat.Index + ",";
}
The contents of index are what you are looking for with the last comma removed.
Are you really looking for substrings that are only two characters long? If so, searching a 20-million character string is going to be slow no matter what regex you use (or any non-regex technique, for that matter). If the search string is longer, the regex engine can employ a search algorithm like Boyer-Moore or Knuth-Morris-Pratt to speed up the search--the longer the better, in fact.
By the way, the kind of search you're talking about is called overlapping matches; I'll add that to the tags.

Categories

Resources