C# multiple string match - c#

I need C# string search algorithm which can match multiple occurance of pattern. For example, if pattern is 'AA' and string is 'BAAABBB' Regex produce match result Index = 1, but I need result Index = 1,2. Can I force Regex to give such result?

Use a lookahead pattern:-
"A(?=A)"
This finds any A that is followed by another A without consuming the following A. Hence AAA will match this pattern twice.

To summarize all previous comments:
Dim rx As Regex = New Regex("(?=AA)")
Dim mc As MatchCollection = rx.Matches("BAAABBB")
This will produce the result you are requesting.
EDIT:
Here is the C# version (working with VB.NET today so I accidentally continued with VB.NET).
Regex rx = new Regex("(?=AA)");
MatchCollection mc = rx.Matches("BAAABBB");

Any regular expression can give an array of MatchCollection

Regex.Matches()
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchcollection.aspx

Try this:
System.Text.RegularExpressions.MatchCollection matchCol;
System.Text.RegularExpressions.Regex regX = new System.Text.RegularExpressions.Regex("(?=AA)");
string index="",str="BAAABBB";
matchCol = regX.Matches(str);
foreach (System.Text.RegularExpressions.Match mat in matchCol)
{
index = index + mat.Index + ",";
}
The contents of index are what you are looking for with the last comma removed.

Are you really looking for substrings that are only two characters long? If so, searching a 20-million character string is going to be slow no matter what regex you use (or any non-regex technique, for that matter). If the search string is longer, the regex engine can employ a search algorithm like Boyer-Moore or Knuth-Morris-Pratt to speed up the search--the longer the better, in fact.
By the way, the kind of search you're talking about is called overlapping matches; I'll add that to the tags.

Related

REGEX finding strings within a string

I seem to write one Reg expression a year and always end up asking for help.
Here's a string (it's a search string from Solr) and I want to select every instance of the search word.
Here's the input:-
http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A
I need to select any text between every '%3A' and '+OR' as well as the final '%3Atheory))' - in this case the word 'theory' but it will be a different word every time - the only known thing is it'll be any alpha text between the '%3A' and the '+OR'. And it need to stop at the '+AND+'
I've got as far as /%3A(.*?)[+OR]/g - it's a start I guess...
It doesn't find '%3Atheory))' and it doesn't stop at '+AND+'
I'm struggling with 'find this' OR 'find that' as well as stopping at a string.
anyone offer some guidance?
If you're using c# it might be better to split in two operations using String.Split and the Regex.Matches like so:
string input = #"http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A";
Regex regex = new Regex(#"%3A(.*?)(?:\+OR|\)\))");
var splitted = input.Split(new[] { "AND" }, StringSplitOptions.None);
var matches = regex.Matches(splitted.First());
foreach (Match m in matches)
{
// Or whatever you like to do with your matches
Console.WriteLine(m.Groups[1].Value);
}
Regex.Split has an option to keep the separating strings. So for the text given in the question, code like that below will split it into pieces:
string[] pieces = Regex.Split(theInputText, "(%3A.*?\\+(?:AND|OR))");
foreach (string ss in pieces)
{
Console.WriteLine(ss);
}
Here is a small section of the output:
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND
+-(virtualPath
%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR
+virtualPath
Having split the string into pieces it should be a simple matter to screen for the array elements with the correct starting and ending characters, also to find the last %3Atheory... entry.
Note: The question discusses +OR and +AND+ but all the +ORs are followed with a + so it may be better to include a final + in the expression, as ...OR)\\+).
Note: The inner brackets in the regular expression are non capturing, ie (?: ). If they were capturing brackets then the AND and OR captures would be included in the output array.

Regex to test for int pattern

Fairly simple I assume. I need to build a regex pattern to match and pull out the pattern int,int,int form a string. I would like it to include negative ints too though (I then want to sub into this string a computed value);
e.g.
1,2,3
-1,3,5
100,-2,-3
etc
Regex regex = new Regex(#"\d,\d,\d");
However, I dont think his takes into account negatives?
an example string maybe
value={2,3,4},value2=test,value3={-13,0,0},anothervalue=234,nextvalue={0,0,2}
According to the information you have provided
To include negative numbers, Change your regex as below:
Regex regex = new Regex(#"\-?\d,\-?\d,\-?\d");
To include more than one unit digits
Regex regex = new Regex(#"\-?\d+,\-?\d+,\-?\d+");
Here is another option in addition to Waqar's pattern.
-?\d[0-9]*,-?\d[0-9]*,-?\d[0-9]*
\d only matches a single numeric character
Regex regex = new Regex(#"-?\d+,-?\d+,-?\d+");

Why is my .NET regex not working correctly?

I have a text file which is in the format:
key1:val1,
key2:val2,
key3:val3
and I am trying to parse the key/value pairs out with a regex. Here is the regex code I am using with the same example:
string input = #"key1:val1,
key2:val2,
key3:val3";
var r = new Regex(#"^(?<name>\w+):(?<value>\w+),?$", RegexOptions.Multiline | RegexOptions.ExplicitCapture);
foreach (Match m in r.Matches(input))
{
Console.WriteLine(m.Groups["name"].Value);
Console.WriteLine(m.Groups["value"].Value);
}
When I loop through r.Matches, sometimes certain key/value pairs don't appear, and it seems to be the ones with the comma at the end of the line - but I should be taking that into account with the ,?. What am I missing here?
this might be a good situation for String.Split rather than a regex:
foreach(string pair in input.Split(new Char [] {','}))
{
string [] items = pair.Split(new Char [] {':'});
Console.WriteLine(items[0]);
Console.WriteLine(items[1]);
}
The problem is that your regular expression is not matching the newline in the first two lines.
Try changing it to
#"^(?<name>\w+):(?<value>\w+),?(\n|\r|\r\n)?$"
and it should work.
By the way, I love regular expressions, but given the problem you are trying to solve, go for the string.Split solution. It will be much easier to read...
EDIT: after reading your comment, where you say that this is a simplified version of your problem, then maybe you could simplify the expression by adding some "tolerance" for spaces / newline at the end of the match with
#"^(?<name>\w+):(?<value>\w+),?\s*$"
Also, when you play with regular expressions, test them with a tool like Expresso, it saves a lot of time.
Get rid of the RegexOptions.Multiline option.

How can I group multiple e-mail addresses and user names using a regular expression

I have the following text that I am trying to parse:
"user1#emailaddy1.com" <user1#emailaddy1.com>, "Jane Doe" <jane.doe# addyB.org>,
"joe#company.net" <joe#company.net>
I am using the following code to try and split up the string:
Dim groups As GroupCollection
Dim matches As MatchCollection
Dim regexp1 As New Regex("""(.*)"" <(.*)>")
matches = regexp1 .Matches(toNode.InnerText)
For Each match As Match In matches
groups = match.Groups
message.CompanyName = groups(1).Value
message.CompanyEmail = groups(2).Value
Next
But this regular expression is greedy and is grabbing the entire string up to the last quote after "joe#company.net". I'm having a hard time putting together an expression that will group this string into the two groups I'm looking for: Name (in the quotes) and E-Mail (in the angle brackets). Does anybody have any advice or suggestions for altering the regexp to get what I need?
Rather than rolling your own regular expression, I would do this:
string[] addresses = toNode.InnerText.Split(",");
foreach(string textAddress in addresses)
{
textAddress = address.Trim();
MailAddress address = new MailAddress(textAddress);
message.CompanyName = address.DisplayName;
message.CompanyEmail = address.Address;
}
While your regular expression may work for the few test cases that you have shown. Using the MailAddress class will probably be much more reliable in the long run.
How about """([^""]*)"" <([^>]*)>" for the regex? I.e. make explicit that the matched part won't include a quote/closing paren. You may also want to use a more restrictive character-range instead.
Not sure what regexp engine ASP.net is running but try the non-greedy variant by adding a ? in the regex.
Example regex
""(.*?)"" <(.*?)>
You need to specify that you want the minimal matched expression.
You can also replace (.*) pattern by more precise ones:
For example you could exclude the comma and the space...
Usually it's better to avoid using .* in a regular expression, because it reduces performance !
For example for the email, you can use a pattern like [\w-]+#([\w-]+.)+[\w-]+ or a more complex one.
You can find some good patterns on : http://regexlib.com/

Split string into sentences using regular expression

I need to match a string like "one. two. three. four. five. six. seven. eight. nine. ten. eleven" into groups of four sentences. I need a regular expression to break the string into a group after every fourth period. Something like:
string regex = #"(.*.\s){4}";
System.Text.RegularExpressions.Regex exp = new System.Text.RegularExpressions.Regex(regex);
string result = exp.Replace(toTest, ".\n");
doesn't work because it will replace the text before the periods, not just the periods themselves. How can I count just the periods and replace them with a period and new line character?
. in a regex means "any character"
so in your regex, you have used .*. which will match a word (this is equivalent to .+)
You were probably looking for [^.]\*[.] - a series of characters that are not "."s followed by a ".".
Try defining the method
private string AppendNewLineToMatch(Match match) {
return match.Value + Environment.NewLine;
}
and using
string result = exp.Replace(toTest, AppendNewLineToMatch);
This should call the method for each match, and replace it with that method's result. The method's result would be the matching text and a newline.
EDIT: Also, I agree with Oliver. The correct regex definition should be:
string regex = #"([^.]*[.]\s*){4}";
Another edit: Fixed the regex, hopefully I got it right this time.
Are you forced to do this via regex? Wouldn't it be easier to just split the string then process the array?
I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work.
The Correct pattern is
string regex = #"([^.]*[.]){4}\s*";
Search expression: #"(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)"
Replace expression: "$1 $2 $3 $4.\n"
I've ran this expression in RegexBuddy with .NET regex selected, and the output is:
one two three four.
five six seven eight.
nine. ten. eleven
I tried with a #"(?:([^.]+?).\s){4}" type of arrangement, but the capturing will only capture the last occurrence (i.e. word), so when it comes to replacing, you will lose three words out of 4. Please someone correct me if I am wrong.
In this case it would seem that regex is a bit of overkill. I would recommend using String.split and then breaking up the resulting array of strings. It should be far simpler and far more reliable than trying to make a regex do what you're trying to do.
Something like this might be a bit easier to read and debug.
String s = "one. two. three. four. five. six. seven. eight. nine. ten. eleven"
String[] splitString = s.split(".")
List li = new ArrayList(splitString.length/2)
for(int i=0;i<splitString.length;i+=4) {
st = splitString[i]+"."
st += splitString[i+1]+"."
st += splitString[i+2]+"."
st += splitString[i+3]+"."
li.add(st)
}

Categories

Resources