Regex: I want this AND that AND that... in any order - c#

I'm not even sure if this is possible or not, but here's what I'd like.
String: "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870"
I have a text box where I type in the search parameters and they are space delimited. Because of this, I want to return a match is string1 is in the string and then string2 is in the string, OR string2 is in the string and then string1 is in the string. I don't care what order the strings are in, but they ALL (will somethings me more than 2) have to be in the string.
So for instance, in the provided string I would want:
"FEB Low"
or
"Low FEB"
...to return as a match.
I'm REALLY new to regex, only read some tutorials on here but that was a while ago and I need to get this done today. Monday I start a new project which is much more important and can't be distracted with this issue. Is there anyway to do this with regular expressions, or do I have to iterate through each part of the search filter and permutate the order? Any and all help is extremely appreciated. Thanks.
UPDATE:
The reason I don't want to iterate through a loop and am looking for the best performance wise is because unfortunately, the dataTable I'm using calls this function on every key press, and I don't want it to bog down.
UPDATE:
Thank you everyone for your help, it was much appreciated.
CODE UPDATE:
Ultimately, this is what I went with.
string sSearch = nvc["sSearch"].ToString().Replace(" ", ")(?=.*");
if (sSearch != null && sSearch != "")
{
Regex r = new Regex("^(?=.*" + sSearch + ").*$", RegexOptions.IgnoreCase);
_AdminList = _AdminList.Where<IPB>(
delegate(IPB ipb)
{
//Concatenated all elements of IPB into a string
bool returnValue = r.IsMatch(strTest); //strTest is the concatenated string
return returnValue;
}).ToList<IPB>();
}
}
The IPB class has X number of elements and in no one table throughout the site I'm working on are the columns in the same order. Therefore, I needed to any order search and I didn't want to have to write a lot of code to do it. There were other good ideas in here, but I know my boss really likes Regex (preaches them) and therefore I thought it'd be best if I went with that for now. If for whatever reason the site's performance slips (intranet site) then I'll try another way. Thanks everyone.

You can use (?=…) positive lookahead; it asserts that a given pattern can be matched. You'd anchor at the beginning of the string, and one by one, in any order, look for a match of each of your patterns.
It'll look something like this:
^(?=.*one)(?=.*two)(?=.*three).*$
This will match a string that contains "one", "two", "three", in any order (as seen on rubular.com).
Depending on the context, you may want to anchor on \A and \Z, and use single-line mode so the dot matches everything.
This is not the most efficient solution to the problem. The best solution would be to parse out the words in your input and putting it into an efficient set representation, etc.
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
More practical example: password validation
Let's say that we want our password to:
Contain between 8 and 15 characters
Must contain an uppercase letter
Must contain a lowercase letter
Must contain a digit
Must contain one of special symbols
Then we can write a regex like this:
^(?=.{8,15}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!##$%^&*]).*$
\__________/\_________/\_________/\_________/\______________/
length upper lower digit symbol

Why not just do a simple check for the text since order doesn't matter?
string test = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
test = test.ToUpper();
bool match = ((test.IndexOf("FEB") >= 0) && (test.IndexOf("LOW") >= 0));
Do you need it to use regex?

I think the most expedient thing for today will be to string.Split(' ') the search terms and then iterate over the results confirming that sourceString.Contains(searchTerm)
var source = #"NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870".ToLowerInvariant();
var search = "FEB Low";
var terms = search.Split(' ');
bool all_match = !terms.Any(term => !(source.Contains(term.ToLowerInvariant())));
Notice that we use Any() to set up a short-circuit, so if the first term fails to match, we skip checking the second, third, and so forth.
This is not a great use case for RegEx. The string manipulation necessary to take an arbitrary number of search strings and convert that into a pattern almost certainly negates the performance benefit of matching the pattern with the RegEx engine, though this may vary depending on what you're matching against.
You've indicated in some comments that you want to avoid a loop, but RegEx is not a one-pass solution. It is not hard to create horrifically non-performant searches that loop and step character by character, such as the infamous catastrophic backtracking, where a very simple match takes thousands of steps to return false.

The answer by #polygenelubricants is both complete and perfect but I had a case where I wanted to match a date and something else e.g. a 10-digit number so the lookahead does not match and I cannot do it with just lookaheads so I used named groups:
(?:.*(?P<1>[0-9]{10}).*(?P<2>2[0-9]{3}-(?:0?[0-9]|1[0-2])-(?:[0-2]?[0-9]|3[0-1])).*)+
and this way the number is always group 1 and the date is always group 2. Of course it has a few flaws but it was very useful for me and I just thought I should share it! ( take a look https://www.debuggex.com/r/YULCcpn8XtysHfmE )

var text = #"NS306Low FEBRUARY 2FEB0078/9/201013B1-9-1Low31 AUGUST 19870";
var matches = Regex.Matches(text, #"(FEB)|(Low)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
output:
Low
FEB
FEB
Low
should get you started

You don't have to test each permutation, just split your search into multiple parts "FEB" and "Low" and make sure each part matches. That will be far easier than trying to come up with a regex which matches the whole thing in one go (which I'm sure is theoretically possible, but probably not practical in reality).

Use string.Split(). It will return an array of subtrings thata re delimited by a specified string/char. The code will look something like this.
int maximumSize = 100;
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string[] individualString = myString.Split(' ', maximumSize);
For more information
http://msdn.microsoft.com/en-us/library/system.string.split.aspx
Edit:
If you really wanted to use Regular Expressions this pattern will work.
[^ ]*
And you will just use Regex.Matches();
The code will be something like this:
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string pattern = "[^ ]*";
Regex rgx = new Regex(pattern);
foreach(Match match in reg.Matches(s))
{
//do stuff with match.value
}

Related

REGEX finding strings within a string

I seem to write one Reg expression a year and always end up asking for help.
Here's a string (it's a search string from Solr) and I want to select every instance of the search word.
Here's the input:-
http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A
I need to select any text between every '%3A' and '+OR' as well as the final '%3Atheory))' - in this case the word 'theory' but it will be a different word every time - the only known thing is it'll be any alpha text between the '%3A' and the '+OR'. And it need to stop at the '+AND+'
I've got as far as /%3A(.*?)[+OR]/g - it's a start I guess...
It doesn't find '%3Atheory))' and it doesn't stop at '+AND+'
I'm struggling with 'find this' OR 'find that' as well as stopping at a string.
anyone offer some guidance?
If you're using c# it might be better to split in two operations using String.Split and the Regex.Matches like so:
string input = #"http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A";
Regex regex = new Regex(#"%3A(.*?)(?:\+OR|\)\))");
var splitted = input.Split(new[] { "AND" }, StringSplitOptions.None);
var matches = regex.Matches(splitted.First());
foreach (Match m in matches)
{
// Or whatever you like to do with your matches
Console.WriteLine(m.Groups[1].Value);
}
Regex.Split has an option to keep the separating strings. So for the text given in the question, code like that below will split it into pieces:
string[] pieces = Regex.Split(theInputText, "(%3A.*?\\+(?:AND|OR))");
foreach (string ss in pieces)
{
Console.WriteLine(ss);
}
Here is a small section of the output:
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND
+-(virtualPath
%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR
+virtualPath
Having split the string into pieces it should be a simple matter to screen for the array elements with the correct starting and ending characters, also to find the last %3Atheory... entry.
Note: The question discusses +OR and +AND+ but all the +ORs are followed with a + so it may be better to include a final + in the expression, as ...OR)\\+).
Note: The inner brackets in the regular expression are non capturing, ie (?: ). If they were capturing brackets then the AND and OR captures would be included in the output array.

Regex C#. Match specific substring and return that substring only

I've been pulling my hair out over this, and I think I am quite close to actually getting it to work but I just can't seem to.
I've been attempting to pull out a specific substring from a string using Regex. This substring must match a certain set of strings/digits. Then after this, it should be returned to another function which will pull out the number only.
Here's some examples of strings that are present
"(DEV #198) I am a dev testing 23 different things."
"(dev #9540) I am a dev testing different other things."
"(FQ #1140) I am a dev testing different things."
"(fq #910) I am a dev testing different other things."
In the end I would like to have 198, 9540, 1140 or 910 as the final variable depending on the input.
Here is my Regex so far. I think it's close, but I need some help. (note the double backslashes for C#).
^(?=.*?\\b(dev|DEV|fq|FQ)\\b)(?=.*?\\b[0-9]{3,4}\\b).*$
Here is the fragment of code I'm using also.
string caseNumber = cpTicket.Desc;
string regexPattern = "^(?=.*?\\b(dev|DEV|fq|FQ)\\b)(?=.*?\\b[0-9]{3,4}\\b).*$";
caseNumber = Regex.Match(caseNumber, regexPattern).ToString();
That's as far as I have got with this. If you can help, I will be so grateful :D
Try the following expression:
(?<=(?:dev|fq) #)\d+
(?<=): This is a look behind to make sure the digits are preceded with what is inside the look behind.
(?:dev|fq): This matches either dev or fq.
#: Matches a space which is followed by a hash #.
\d+: Matches one or more digits.
Turn case insensitivity mode on so dev and fq could match in all cases.
You can turn case insensitivity from within the regular expression itself like this:
(?i)(?<=(?:dev|fq) #)\d+
In C#:
var input = "(DEV #198) I am a dev testing 23 different things.";
var matches = Regex.Matches(input, #"(?i)(?<=(?:dev|fq) #)\d+");
Notice how you can use the # before the string to create what is called a verbatim string so you don't have to double escape everything.
I think you pattern can simply be:
"\(.*?\#(\d*)\)"
This will match the group in brackets and return the number.
var m = Regex.Match(input, #"\(.*?\#(\d*)\)");
if (m.Groups.Count > 0)
{
string matchedNumber = m.Groups[1].Value;
}

Check Formatting of a String

This has probably been answered somewhere before but since there are millions of unrelated posts about string formatting.
Take the following string:
24:Something(true;false;true)[0,1,0]
I want to be able to do two things in this case. I need to check whether or not all the following conditions are true:
There is only one : Achieved using Split() which I needed to use anyway to separate the two parts.
The integer before the : is a 1-3 digit int Simple int.parse logic
The () exists, and that the "Something", in this case any string less than 10 characters, is there
The [] exists and has at least 1 integer in it. Also, make sure the elements in the [] are integers separated by ,
How can I best do this?
EDIT: I have crossed out what I've achieved so far.
A regular expression is the quickest way. Depending on the complexity it may also be the most computationally expensive.
This seems to do what you need (I'm not that good so there might be better ways to do this):
^\d{1,3}:\w{1,9}\((true|false)(;true|;false)*\)\[\d(,[\d])*\]$
Explanation
\d{1,3}
1 to 3 digits
:
followed by a colon
\w{1,9}
followed by a 1-9 character alpha-numeric string,
\((true|false)(;true|;false)*\)
followed by parenthesis containing "true" or "false" followed by any number of ";true" or ";false",
\[\d(,[\d])*\]
followed by another set of parenthesis containing a digit, followed by any number of comma+digit.
The ^ and $ at the beginning and end of the string indicate the start and end of the string which is important since we're trying to verify the entire string matches the format.
Code Sample
var input = "24:Something(true;false;true)[0,1,0]";
var regex = new System.Text.RegularExpressions.Regex(#"^\d{1,3}:.{1,9}\(.*\)\[\d(,[\d])*\]$");
bool isFormattedCorrectly = regex.IsMatch(input);
Credit # Ian Nelson
This is one of those cases where your only sensible option is to use a Regular Expression.
My hasty attempt is something like:
var input = "24:Something(true;false;true)[0,1,0]";
var regex = new System.Text.RegularExpressions.Regex(#"^\d{1,3}:.{1,9}\(.*\)\[\d(,[\d])*\]$");
System.Diagnostics.Debug.Assert(regex.IsMatch(input));
This online RegEx tester should help refine the expression.
I think, the best way is to use regular expressions like this:
string s = "24:Something(true;false;true)[0,1,0]";
Regex pattern = new Regex(#"^\d{1,3}:[a-zA-z]{1,10}\((true|false)(;true|;false)*\)\[\d(,\d)*\]$");
if (pattern.IsMatch(s))
{
// s is valid
}
If you want anything inside (), you can use following regex:
#"^\d{1,3}:[a-zA-z]{1,10}\([^:\(]*\)\[\d(,\d)*\]$"

Regex which ensures no character is repeated

I need to ensure that a input string follows these rules:
It should contain upper case characters only.
NO character should be repeated in the string.
eg. ABCA is not valid because 'A' is being repeated.
For the upper case thing, [A-Z] should be fine.
But i am lost at how to ensure no repeating characters.
Can someone suggest some method using regular expressions ?
You can do this with .NET regular expressions although I would advise against it:
string s = "ABCD";
bool result = Regex.IsMatch(s, #"^(?:([A-Z])(?!.*\1))*$");
Instead I'd advise checking that the length of the string is the same as the number of distinct characters, and checking the A-Z requirement separately:
bool result = s.Cast<char>().Distinct().Count() == s.Length;
Alteranatively, if performance is a critical issue, iterate over the characters one by one and keep a record of which you have seen.
This cannot be done via regular expressions, because they are context-free. You need at least context-sensitive grammar language, so only way how to achieve this is by writing the function by hand.
See formal grammar for background theory.
Why not check for a character which is repeated or not in uppercase instead ? With something like ([A-Z])?.*?([^A-Z]|\1)
Use negative lookahead and backreference.
string pattern = #"^(?!.*(.).*\1)[A-Z]+$";
string s1 = "ABCDEF";
string s2 = "ABCDAEF";
string s3 = "ABCDEBF";
Console.WriteLine(Regex.IsMatch(s1, pattern));//True
Console.WriteLine(Regex.IsMatch(s2, pattern));//False
Console.WriteLine(Regex.IsMatch(s3, pattern));//False
\1 matches the first captured group. Thus the negative lookahead fails if any character is repeated.
This isn't regex, and would be slow, but You could create an array of the contents of the string, and then iterate through the array comparing n to n++
=Waldo
It can be done using what is call backreference.
I am a Java program so I will show you how it is done in Java (for C#, see here).
final Pattern aPattern = Pattern.compile("([A-Z]).*\\1");
final Matcher aMatcher1 = aPattern.matcher("ABCDA");
System.out.println(aMatcher1.find());
final Matcher aMatcher2 = aPattern.matcher("ABCDA");
System.out.println(aMatcher2.find());
The regular express is ([A-Z]).*\\1 which translate to anything between 'A' to 'Z' as group 1 ('([A-Z])') anything else (.*) and group 1.
Use $1 for C#.
Hope this helps.

Split string into sentences using regular expression

I need to match a string like "one. two. three. four. five. six. seven. eight. nine. ten. eleven" into groups of four sentences. I need a regular expression to break the string into a group after every fourth period. Something like:
string regex = #"(.*.\s){4}";
System.Text.RegularExpressions.Regex exp = new System.Text.RegularExpressions.Regex(regex);
string result = exp.Replace(toTest, ".\n");
doesn't work because it will replace the text before the periods, not just the periods themselves. How can I count just the periods and replace them with a period and new line character?
. in a regex means "any character"
so in your regex, you have used .*. which will match a word (this is equivalent to .+)
You were probably looking for [^.]\*[.] - a series of characters that are not "."s followed by a ".".
Try defining the method
private string AppendNewLineToMatch(Match match) {
return match.Value + Environment.NewLine;
}
and using
string result = exp.Replace(toTest, AppendNewLineToMatch);
This should call the method for each match, and replace it with that method's result. The method's result would be the matching text and a newline.
EDIT: Also, I agree with Oliver. The correct regex definition should be:
string regex = #"([^.]*[.]\s*){4}";
Another edit: Fixed the regex, hopefully I got it right this time.
Are you forced to do this via regex? Wouldn't it be easier to just split the string then process the array?
I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work.
The Correct pattern is
string regex = #"([^.]*[.]){4}\s*";
Search expression: #"(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)"
Replace expression: "$1 $2 $3 $4.\n"
I've ran this expression in RegexBuddy with .NET regex selected, and the output is:
one two three four.
five six seven eight.
nine. ten. eleven
I tried with a #"(?:([^.]+?).\s){4}" type of arrangement, but the capturing will only capture the last occurrence (i.e. word), so when it comes to replacing, you will lose three words out of 4. Please someone correct me if I am wrong.
In this case it would seem that regex is a bit of overkill. I would recommend using String.split and then breaking up the resulting array of strings. It should be far simpler and far more reliable than trying to make a regex do what you're trying to do.
Something like this might be a bit easier to read and debug.
String s = "one. two. three. four. five. six. seven. eight. nine. ten. eleven"
String[] splitString = s.split(".")
List li = new ArrayList(splitString.length/2)
for(int i=0;i<splitString.length;i+=4) {
st = splitString[i]+"."
st += splitString[i+1]+"."
st += splitString[i+2]+"."
st += splitString[i+3]+"."
li.add(st)
}

Categories

Resources