REGEX finding strings within a string - c#

I seem to write one Reg expression a year and always end up asking for help.
Here's a string (it's a search string from Solr) and I want to select every instance of the search word.
Here's the input:-
http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A
I need to select any text between every '%3A' and '+OR' as well as the final '%3Atheory))' - in this case the word 'theory' but it will be a different word every time - the only known thing is it'll be any alpha text between the '%3A' and the '+OR'. And it need to stop at the '+AND+'
I've got as far as /%3A(.*?)[+OR]/g - it's a start I guess...
It doesn't find '%3Atheory))' and it doesn't stop at '+AND+'
I'm struggling with 'find this' OR 'find that' as well as stopping at a string.
anyone offer some guidance?

If you're using c# it might be better to split in two operations using String.Split and the Regex.Matches like so:
string input = #"http://server:8080/solr/app/select?q=(title_st_en%3Atheory+OR+title_st_ar%3Atheory+OR+title_st_da%3Atheory+OR+title_st_fr%3Atheory+OR+title_st_de%3Atheory+OR+title_st_it%3Atheory+OR+title_st_no%3Atheory+OR+title_st_sv%3Atheory+OR+title_st_ru%3Atheory+OR+title_st_es%3Atheory+OR+title_st_bg%3Atheory+OR+title_st_cs%3Atheory+OR+title_st_tr%3Atheory+OR+title_st_nl%3Atheory+OR+title_st_zh-cn%3Atheory+OR+title_st_zh-tw%3Atheory+OR+title_st_hr%3Atheory+OR+title_st_et%3Atheory+OR+title_st_he%3Atheory+OR+title_st_hu%3Atheory+OR+title_st_ja%3Atheory+OR+title_st_ko%3Atheory+OR+title_st_pl%3Atheory+OR+title_st_ro%3Atheory+OR+title_st_th%3Atheory+OR+title_st_vi%3Atheory+OR+content_stemming_en%3Atheory+OR+content_stemming_no%3Atheory+OR+(backfields%3Atheory))+AND+(((virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND+-(virtualPath%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSF%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFMAG%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NDSFRA%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_NM%5C%5CL%22+OR+virtualPath%3A%22%5C%5CSERVER%5C%5CP_INTERNAL%5C%5CL%22+OR+virtualPath%3A";
Regex regex = new Regex(#"%3A(.*?)(?:\+OR|\)\))");
var splitted = input.Split(new[] { "AND" }, StringSplitOptions.None);
var matches = regex.Matches(splitted.First());
foreach (Match m in matches)
{
// Or whatever you like to do with your matches
Console.WriteLine(m.Groups[1].Value);
}

Regex.Split has an option to keep the separating strings. So for the text given in the question, code like that below will split it into pieces:
string[] pieces = Regex.Split(theInputText, "(%3A.*?\\+(?:AND|OR))");
foreach (string ss in pieces)
{
Console.WriteLine(ss);
}
Here is a small section of the output:
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_SYSTEM%22+OR
+virtualPath
%3A%22%5C%5CSERVER%5C%5CP_!CONTACTS%22)+AND
+-(virtualPath
%3A%22%5C%5CSERVER%5C%5CU_TEST%5C%5CL%22+OR
+virtualPath
Having split the string into pieces it should be a simple matter to screen for the array elements with the correct starting and ending characters, also to find the last %3Atheory... entry.
Note: The question discusses +OR and +AND+ but all the +ORs are followed with a + so it may be better to include a final + in the expression, as ...OR)\\+).
Note: The inner brackets in the regular expression are non capturing, ie (?: ). If they were capturing brackets then the AND and OR captures would be included in the output array.

Related

Why is my .NET regex not working correctly?

I have a text file which is in the format:
key1:val1,
key2:val2,
key3:val3
and I am trying to parse the key/value pairs out with a regex. Here is the regex code I am using with the same example:
string input = #"key1:val1,
key2:val2,
key3:val3";
var r = new Regex(#"^(?<name>\w+):(?<value>\w+),?$", RegexOptions.Multiline | RegexOptions.ExplicitCapture);
foreach (Match m in r.Matches(input))
{
Console.WriteLine(m.Groups["name"].Value);
Console.WriteLine(m.Groups["value"].Value);
}
When I loop through r.Matches, sometimes certain key/value pairs don't appear, and it seems to be the ones with the comma at the end of the line - but I should be taking that into account with the ,?. What am I missing here?
this might be a good situation for String.Split rather than a regex:
foreach(string pair in input.Split(new Char [] {','}))
{
string [] items = pair.Split(new Char [] {':'});
Console.WriteLine(items[0]);
Console.WriteLine(items[1]);
}
The problem is that your regular expression is not matching the newline in the first two lines.
Try changing it to
#"^(?<name>\w+):(?<value>\w+),?(\n|\r|\r\n)?$"
and it should work.
By the way, I love regular expressions, but given the problem you are trying to solve, go for the string.Split solution. It will be much easier to read...
EDIT: after reading your comment, where you say that this is a simplified version of your problem, then maybe you could simplify the expression by adding some "tolerance" for spaces / newline at the end of the match with
#"^(?<name>\w+):(?<value>\w+),?\s*$"
Also, when you play with regular expressions, test them with a tool like Expresso, it saves a lot of time.
Get rid of the RegexOptions.Multiline option.

Regex: I want this AND that AND that... in any order

I'm not even sure if this is possible or not, but here's what I'd like.
String: "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870"
I have a text box where I type in the search parameters and they are space delimited. Because of this, I want to return a match is string1 is in the string and then string2 is in the string, OR string2 is in the string and then string1 is in the string. I don't care what order the strings are in, but they ALL (will somethings me more than 2) have to be in the string.
So for instance, in the provided string I would want:
"FEB Low"
or
"Low FEB"
...to return as a match.
I'm REALLY new to regex, only read some tutorials on here but that was a while ago and I need to get this done today. Monday I start a new project which is much more important and can't be distracted with this issue. Is there anyway to do this with regular expressions, or do I have to iterate through each part of the search filter and permutate the order? Any and all help is extremely appreciated. Thanks.
UPDATE:
The reason I don't want to iterate through a loop and am looking for the best performance wise is because unfortunately, the dataTable I'm using calls this function on every key press, and I don't want it to bog down.
UPDATE:
Thank you everyone for your help, it was much appreciated.
CODE UPDATE:
Ultimately, this is what I went with.
string sSearch = nvc["sSearch"].ToString().Replace(" ", ")(?=.*");
if (sSearch != null && sSearch != "")
{
Regex r = new Regex("^(?=.*" + sSearch + ").*$", RegexOptions.IgnoreCase);
_AdminList = _AdminList.Where<IPB>(
delegate(IPB ipb)
{
//Concatenated all elements of IPB into a string
bool returnValue = r.IsMatch(strTest); //strTest is the concatenated string
return returnValue;
}).ToList<IPB>();
}
}
The IPB class has X number of elements and in no one table throughout the site I'm working on are the columns in the same order. Therefore, I needed to any order search and I didn't want to have to write a lot of code to do it. There were other good ideas in here, but I know my boss really likes Regex (preaches them) and therefore I thought it'd be best if I went with that for now. If for whatever reason the site's performance slips (intranet site) then I'll try another way. Thanks everyone.
You can use (?=…) positive lookahead; it asserts that a given pattern can be matched. You'd anchor at the beginning of the string, and one by one, in any order, look for a match of each of your patterns.
It'll look something like this:
^(?=.*one)(?=.*two)(?=.*three).*$
This will match a string that contains "one", "two", "three", in any order (as seen on rubular.com).
Depending on the context, you may want to anchor on \A and \Z, and use single-line mode so the dot matches everything.
This is not the most efficient solution to the problem. The best solution would be to parse out the words in your input and putting it into an efficient set representation, etc.
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
More practical example: password validation
Let's say that we want our password to:
Contain between 8 and 15 characters
Must contain an uppercase letter
Must contain a lowercase letter
Must contain a digit
Must contain one of special symbols
Then we can write a regex like this:
^(?=.{8,15}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!##$%^&*]).*$
\__________/\_________/\_________/\_________/\______________/
length upper lower digit symbol
Why not just do a simple check for the text since order doesn't matter?
string test = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
test = test.ToUpper();
bool match = ((test.IndexOf("FEB") >= 0) && (test.IndexOf("LOW") >= 0));
Do you need it to use regex?
I think the most expedient thing for today will be to string.Split(' ') the search terms and then iterate over the results confirming that sourceString.Contains(searchTerm)
var source = #"NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870".ToLowerInvariant();
var search = "FEB Low";
var terms = search.Split(' ');
bool all_match = !terms.Any(term => !(source.Contains(term.ToLowerInvariant())));
Notice that we use Any() to set up a short-circuit, so if the first term fails to match, we skip checking the second, third, and so forth.
This is not a great use case for RegEx. The string manipulation necessary to take an arbitrary number of search strings and convert that into a pattern almost certainly negates the performance benefit of matching the pattern with the RegEx engine, though this may vary depending on what you're matching against.
You've indicated in some comments that you want to avoid a loop, but RegEx is not a one-pass solution. It is not hard to create horrifically non-performant searches that loop and step character by character, such as the infamous catastrophic backtracking, where a very simple match takes thousands of steps to return false.
The answer by #polygenelubricants is both complete and perfect but I had a case where I wanted to match a date and something else e.g. a 10-digit number so the lookahead does not match and I cannot do it with just lookaheads so I used named groups:
(?:.*(?P<1>[0-9]{10}).*(?P<2>2[0-9]{3}-(?:0?[0-9]|1[0-2])-(?:[0-2]?[0-9]|3[0-1])).*)+
and this way the number is always group 1 and the date is always group 2. Of course it has a few flaws but it was very useful for me and I just thought I should share it! ( take a look https://www.debuggex.com/r/YULCcpn8XtysHfmE )
var text = #"NS306Low FEBRUARY 2FEB0078/9/201013B1-9-1Low31 AUGUST 19870";
var matches = Regex.Matches(text, #"(FEB)|(Low)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
output:
Low
FEB
FEB
Low
should get you started
You don't have to test each permutation, just split your search into multiple parts "FEB" and "Low" and make sure each part matches. That will be far easier than trying to come up with a regex which matches the whole thing in one go (which I'm sure is theoretically possible, but probably not practical in reality).
Use string.Split(). It will return an array of subtrings thata re delimited by a specified string/char. The code will look something like this.
int maximumSize = 100;
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string[] individualString = myString.Split(' ', maximumSize);
For more information
http://msdn.microsoft.com/en-us/library/system.string.split.aspx
Edit:
If you really wanted to use Regular Expressions this pattern will work.
[^ ]*
And you will just use Regex.Matches();
The code will be something like this:
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string pattern = "[^ ]*";
Regex rgx = new Regex(pattern);
foreach(Match match in reg.Matches(s))
{
//do stuff with match.value
}

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Split string into sentences using regular expression

I need to match a string like "one. two. three. four. five. six. seven. eight. nine. ten. eleven" into groups of four sentences. I need a regular expression to break the string into a group after every fourth period. Something like:
string regex = #"(.*.\s){4}";
System.Text.RegularExpressions.Regex exp = new System.Text.RegularExpressions.Regex(regex);
string result = exp.Replace(toTest, ".\n");
doesn't work because it will replace the text before the periods, not just the periods themselves. How can I count just the periods and replace them with a period and new line character?
. in a regex means "any character"
so in your regex, you have used .*. which will match a word (this is equivalent to .+)
You were probably looking for [^.]\*[.] - a series of characters that are not "."s followed by a ".".
Try defining the method
private string AppendNewLineToMatch(Match match) {
return match.Value + Environment.NewLine;
}
and using
string result = exp.Replace(toTest, AppendNewLineToMatch);
This should call the method for each match, and replace it with that method's result. The method's result would be the matching text and a newline.
EDIT: Also, I agree with Oliver. The correct regex definition should be:
string regex = #"([^.]*[.]\s*){4}";
Another edit: Fixed the regex, hopefully I got it right this time.
Are you forced to do this via regex? Wouldn't it be easier to just split the string then process the array?
I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work.
The Correct pattern is
string regex = #"([^.]*[.]){4}\s*";
Search expression: #"(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)"
Replace expression: "$1 $2 $3 $4.\n"
I've ran this expression in RegexBuddy with .NET regex selected, and the output is:
one two three four.
five six seven eight.
nine. ten. eleven
I tried with a #"(?:([^.]+?).\s){4}" type of arrangement, but the capturing will only capture the last occurrence (i.e. word), so when it comes to replacing, you will lose three words out of 4. Please someone correct me if I am wrong.
In this case it would seem that regex is a bit of overkill. I would recommend using String.split and then breaking up the resulting array of strings. It should be far simpler and far more reliable than trying to make a regex do what you're trying to do.
Something like this might be a bit easier to read and debug.
String s = "one. two. three. four. five. six. seven. eight. nine. ten. eleven"
String[] splitString = s.split(".")
List li = new ArrayList(splitString.length/2)
for(int i=0;i<splitString.length;i+=4) {
st = splitString[i]+"."
st += splitString[i+1]+"."
st += splitString[i+2]+"."
st += splitString[i+3]+"."
li.add(st)
}

Categories

Resources