Regex to match multiple strings - c#

I need to create a regex that can match multiple strings. For example, I want to find all the instances of "good" or "great". I found some examples, but what I came up with doesn't seem to work:
\b(good|great)\w*\b
Can anyone point me in the right direction?
Edit: I should note that I don't want to just match whole words. For example, I may want to match "ood" or "reat" as well (parts of the words).
Edit 2: Here is some sample text: "This is a really great story."
I might want to match "this" or "really", or I might want to match "eall" or "reat".

If you can guarantee that there are no reserved regex characters in your word list (or if you escape them), you could just use this code to make a big word list into #"(a|big|word|list)". There's nothing wrong with the | operator as you're using it, as long as those () surround it. It sounds like the \w* and the \b patterns are what are interfering with your matches.
String[] pattern_list = whatever;
String regex = String.Format("({0})", String.Join("|", pattern_list));

(good)*(great)*
after your edit:
\b(g*o*o*d*)*(g*r*e*a*t*)*\b

I think you are asking for smth you dont really mean
if you want to search for any Part of the word, you litterally searching letters
e.g. Search {Jack, Jim} in "John and Shelly are cool"
is searching all letters in the names {J,a,c,k,i,m}
*J*ohn *a*nd Shelly *a*re
and for that you don't need REG-EX :)
in my opinion,
A Suffix Tree can help you with that
http://en.wikipedia.org/wiki/Suffix_tree#Functionality
enjoy.

I don't understand the problem correctly:
If you want to match "great" or "reat" you can express this by a pattern like:
"g?reat"
This simply says that the "reat"-part must exist and the "g" is optional.
This would match "reat" and "great" but not "eat", because the first "r" in "reat" is required.
If you have the too words "great" and "good" and you want to match them both with an optional "g" you can write this like this:
(g?reat|g?ood)
And if you want to include a word-boundary like:
\b(g?reat|g?ood)
You should be aware that this would not match anything like "breat" because you have the "reat" but the "r" is not at the word boundary because of the "b".
So if you want to match whole words that contain a substring link "reat" or "ood" then you should try:
"\b\w*?(reat|ood)\w+\b"
This reads:
1. Beginning with a word boundary begin matching any number word-characters, but don't be gready.
2. Match "reat" or "ood" enshures that only those words are matched that contain one of them.
3. Match any number of word characters following "reat" or "ood" until the next word boundary is reached.
This will match:
"goodness", "good", "ood" (if a complete word)
It can be read as: Give me all complete words that contain "ood" or "reat".
Is that what you are looking for?

I'm not entirely sure that regex alone offers a solution for what you're trying to do. You could, however, use the following code to create a regex expression for a given word. Although, the resulting regex pattern has the potential to become very long and slow:
function wordPermutations( $word, $minLength = 2 )
{
$perms = array( );
for ($start = 0; $start < strlen( $word ); $start++)
{
for ($end = strlen( $word ); $end > $start; $end--)
{
$perm = substr( $word, $start, ($end - $start));
if (strlen( $perm ) >= $minLength)
{
$perms[] = $perm;
}
}
}
return $perms;
}
Test Code:
$perms = wordPermutations( 'great', 3 ); // get all permutations of "great" that are 3 or more chars in length
var_dump( $perms );
echo ( '/\b('.implode( '|', $perms ).')\b/' );
Example Output:
array
0 => string 'great' (length=5)
1 => string 'grea' (length=4)
2 => string 'gre' (length=3)
3 => string 'reat' (length=4)
4 => string 'rea' (length=3)
5 => string 'eat' (length=3)
/\b(great|grea|gre|reat|rea|eat)\b/

Just check for the boolean that Regex.IsMatch() returns.
if (Regex.IsMatch(line, "condition") && Regex.IsMatch(line, "conditition2"))
The line will have both regex, right.

Related

Regex and proper capture using .matches .Concat in C#

I have the following regex:
#"{thing:(?:((\w)\2*)([^}]*?))+}"
I'm using it to find matches within a string:
MatchCollection matches = regex.Matches(string);
IEnumerable formatTokens = matches[0].Groups[3].Captures
.OfType<Capture>()
.Where(i => i.Length > 0)
.Select(i => i.Value)
.Concat(matches[0].Groups[1].Captures.OfType<Capture>().Select(i => i.Value));
This used to yield the results I wanted; however, my goal has since changed. This is the desired behavior now:
Suppose the string entered is 'stuff/{thing:aa/bb/cccc}{thing:cccc}'
I want formatTokens to be:
formatTokens[0] == "aa/bb/cccc"
formatTokens[1] == "cccc"
Right now, this is what I get:
formatTokens[0] == "/"
formatTokens[1] == "/"
formatTokens[2] == "cccc"
formatTokens[3] == "bb"
formatTokens[4] == "aa"
Note especially that "cccc" does not appear twice even though it was entered twice.
I think the problems are 1) the recapture in the regex and 2) the concat configuration (which is from when I wanted everything separated), but so far I haven't been able to find a combination that yields what I want. Can someone shed some light on the proper regex/concat combination to yield the desired results above?
You may use
Regex.Matches(s, #"{thing:([^}]*)}")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.ToList()
See the regex demo
Details
{thing: - a literal {thing: substring
([^}]*) - Capturing group #1 (when a match is obtained, its value can be accessed via match.Groups[1].Value): 0+ chars other than }
} - a } char.
This way, you find multiple matches and only collect Group 1 values in the resulting list/array.
Mod update
I'm not sure why you settled for Stringnuts regex because it matches
anything inside braces {}.
The meek on SO will not get the satisfaction of deep knowledge,
so that may be your real problem.
Lets analyze your regex.
{thing:
(?:
( # (1 start)
( \w ) # (2)
\2*
) # (1 end)
( [^}]*? ) # (3)
)+
}
This reduces to this
{thing:
(?: \w [^}]*? )+
}
The only constraint is that right after {thing: there must be a word.
After which there can be anything else, because this clause [^}]*? accepts
anything.
Also, even though that clause is not greedy, the surrounding cluster will only run one iteration (?: )+
So, basically, it does almost nothing except for the single word requirement.
Your regex can be used as is to get convoluted matches,
and because you've captured all the parts in Capture Collections,
with each match you can piece that together using the code below.
I would try to understand regex a little better, before you go on to other stuff since it is likely much more important than the
language tricks used to extract data.
Here is how you would piece it all together using your unaltered regex.
Regex regex = new Regex(#"{thing:(?:((\w)\2*)([^}]*?))+}");
string str = "stuff/{thing:aa/bb/cccc}{thing:cccc}";
foreach (Match match in regex.Matches(str))
{
CaptureCollection cc1 = match.Groups[1].Captures;
CaptureCollection cc3 = match.Groups[3].Captures;
string token = "";
for (int i = 0; i < cc1.Count; i++)
token += cc1[i].Value + cc3[i].Value;
Console.WriteLine("{0}", token);
}
Output
aa/bb/cccc
cccc
Note that for example, your regex will match almost anything inside
of the braces as long as the first character is a word.
For example, it matches {thing:Z,,,*()(((asgassgasg,asgfasgafg\/\=99.239 }
You may want to think about the requirements of what actually is allowed
inside the braces.
Good Luck!

Regex to get values from a string using C#

I have posted this earlier but did not give clear information on what i was trying to achieve.
I am trying get values from a string using Regex in c#. I am not able to understand why some values i could get and some i can not using a similar approach.
Please find the code snippet below.
Kindly let me know what i am missing.
Thanks in advance.
string text = "0*MAO-001*20160409*20160408*Encounter Data Duplicates Report * *ENC000200800400120160407*PRO*PROD*";
//toget the value 20160409 from the above text
//this code works fine
Regex pattern = new Regex(#"([0][*]MAO[-][0][0][1].*?[*](?<Value>\d+)[*])");
Match match = pattern.Match(text);
string Value = match.Groups["Value"].Value.ToString();
//to get the value ENC000200800400120160407 from the above text
// this does not work and gives me nothing
Regex pattern2 = new Regex(#"([0][*]MAO[-][0][0][1].*?[*].*?[*].*?[*].*?[*].*?[*](?<Value2>\d+)[*])");
Match match2 = pattern.Match(text);
string Value2 = match.Groups["Value2"].Value.ToString();
It looks your file is '*' delimitered.
You can use one single regex to catch all the values
Try use
((?<values>[^\*]+)\*)
as your pattern.
All these values will be catched in values array.
----Update add c# code-----
string text = "0*MAO-001*20160409*20160408*Encounter Data Duplicates Report * *ENC000200800400120160407*PRO*PROD*";
Regex pattern = new Regex(#"(?<values>[^\*]+)\*");
var matches = pattern.Matches(text);
string Value = matches[3].Groups["values"].Captures[0];
string Value2 = matches[6].Groups["values"].Captures[0];
You need to use this for 2nd regex
([0][*]MAO[-][0][0][1].*?[*].*?[*].*?[*].*?[*].*?[*](?<Value2>\w+)[*])
\w is any character from set [A-Za-z0-9_]. You were using only \d which searches for digits [0-9] which was not the case
C# Code
In your second try at using the regex, you are matching with pattern and not pattern2.
Match match2 = pattern.Match(text);
string Value2 = match.Groups["Value2"].Value.ToString();
You are also using the Groups from match and not match2.
This is why it is important to name your variables something meaningful to what they represent. Yes it may be a "pattern" but what does that pattern represent. When you use variables that are vaguely named it creates issues like these.
You almost got it, but the field you're looking for contains letters and digits.
This is your second regex kind of fixed.
([0][*]MAO[-][0][0][1].*?[*](?:.*?[*]){4}(?<Value2>.*?)[*])
( # (1 start)
[0] [*] MAO [-] [0] [0] [1] .*? [*]
(?: .*? [*] ){4}
(?<Value2> .*? ) # (2)
[*]
) # (1 end)
To make it a little less busy, this might be better
(0\*MAO-001.*?\*(?:[^*]*\*){4}(?<Value2>[^*]*)\*)

Regular Expression For Alphanumeric String With At Least One Alphabet Or Atleast One Numeric In The String

To test one alphanumeric string we usually use the regular expression "^[a-zA-Z0-9_]*$" (or most preferably "^\w+$" for C#). But this regex accepts numeric only strings or alphabet only strings, like "12345678" or "asdfgth".
I need one regex which will accept only the alphanumeric strings that have at-least one alphabet and one number. That is to say by the regex "ar56ji" will be one of the correct strings, not the previously said strings.
Thanks in advance.
This should do it:
if (Regex.IsMatch(subjectString, #"
# Match string having one letter and one digit (min).
\A # Anchor to start of string.
(?=[^0-9]*[0-9]) # at least one number and
(?=[^A-Za-z]*[A-Za-z]) # at least one letter.
\w+ # Match string of alphanums.
\Z # Anchor to end of string.
",
RegexOptions.IgnorePatternWhitespace)) {
// Successful match
} else {
// Match attempt failed
}
EDIT 2012-08-28 Improved efficiency of lookaheads by changing the lazy dot stars to specific greedy char classes.
Try this out:
"^\w*(?=\w*\d)(?=\w*[a-zA-z])\w*$"
There is a good article about it here:
http://nilangshah.wordpress.com/2007/06/26/password-validation-via-regular-expression/
This should work:
"^[a-zA-Z0-9_]*([a-zA-Z][0-9]|[0-9][a-zA-Z])[a-zA-Z0-9_]*$"
This will match:
<zero-or-more-stuff>
EITHER <letter-followed-by-digit> OR <digit-followed-by-letter>
<zero-or-more-stuff>
By ensuring you have either a digit followed by letter or a letter followed by digit, you are enforcing the requirement to have at least one digit and at least one letter. Note that I've left out the _ above, because it wasn't clear whether you would accept that as a letter, a digit, or neither.
Try this one ^([a-zA-z]+[0-9][a-zA-Z0-9]*)|([0-9]+[a-zA-z][a-zA-Z0-9]*)$
Simple is better. If you had a hard time writing it originally, you're (or some other poor sap) is going to have a hard time maintaining it or modifying it. (And I think that I see some possible holes in the approaches listed above.)
using System.Text.RegularExpressions;
boolean IsGoodPassword(string pwd){
int minPwdLen = 8;
int maxPwdLen = 12;
boolean allowableChars = false;
boolean oneLetterOneNumber = false;
boolean goodLength = false;
string allowedCharsPattern = "^[a-z0-9]*$";
//Does it pass the test for containing only allowed chars?
allowableChars = Regex.IsMatch(pwd, allowedCharsPattern , RegexOptions.IgnoreCase));
//Does it contain at least one # and one letter?
oneLetterOneNumber = Regex.IsMatch(pwd, "[0-9]")) && Regex.IsMatch(pwd, "[a-z]", RegularExpressions.IgnoreCase));
//Does it pass length requirements?
goodLength = pwd.Length >= minPwdLength && pwd.Length <= maxPwdLength;
return allowableChars && oneLetterOneNumber && goodLength;
}

How to get all words of a string in c#?

I have a paragraph in a single string and I'd like to get all the words in that paragraph.
My problem is that I don't want the suffixes words that end with punctuation marks such as (',','.',''','"',';',':','!','?') and /n /t etc.
I also don't want words with 's and 'm such as world's where it should only return world.
In the example
he said. "My dog's bone, toy, are missing!"
the list should be: he said my dog bone toy are missing
Expanding on Shan's answer, I would consider something like this as a starting point:
MatchCollection matches = Regex.Match(input, #"\b[\w']*\b");
Why include the ' character? Because this will prevent words like "we're" from being split into two words. After capturing it, you can manually strip out the suffix yourself (whereas otherwise, you couldn't recognize that re is not a word and ignore it).
So:
static string[] GetWords(string input)
{
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
return words.ToArray();
}
static string TrimSuffix(string word)
{
int apostropheLocation = word.IndexOf('\'');
if (apostropheLocation != -1)
{
word = word.Substring(0, apostropheLocation);
}
return word;
}
Example input:
he said. "My dog's bone, toy, are missing!" What're you doing tonight, by the way?
Example output:
[he, said, My, dog, bone, toy, are, missing, What, you, doing, tonight, by, the, way]
One limitation of this approach is that it will not handle acronyms well; e.g., "Y.M.C.A." would be treated as four words. I think that could also be handled by including . as a character to match in a word and then stripping it out if it's a full stop afterwards (i.e., by checking that it's the only period in the word as well as the last character).
Hope this is helpful for you:
string[] separators = new string[] {",", ".", "!", "\'", " ", "\'s"};
string text = "My dog's bone, toy, are missing!";
foreach (string word in text.Split(separators, StringSplitOptions.RemoveEmptyEntries))
Console.WriteLine(word);
See Regex word boundary expressions, What is the most efficient way to count all of the words in a richtextbox?. Moral of the story is that there are many ways to approach the problem, but regular expressions are probably the way to go for simplicity.
split on whitespace, trim anything that isn't a letter on the resulting strings.
Here's a looping replace method... not fast, but a way to solve it...
string result = "string to cut ' stuff. ! out of";
".',!#".ToCharArray().ToList().ForEach(a => result = result.Replace(a.ToString(),""));
This assumes you want to place it back in the original string, not a new string or a list.

RegEx Problem using .NET

I have a little problem on RegEx pattern in c#. Here's the rule below:
input: 1234567
expected output: 123/1234567
Rules:
Get the first three digit in the input. //123
Add /
Append the the original input. //123/1234567
The expected output should looks like this: 123/1234567
here's my regex pattern:
regex rx = new regex(#"((\w{1,3})(\w{1,7}))");
but the output is incorrect. 123/4567
I think this is what you're looking for:
string s = #"1234567";
s = Regex.Replace(s, #"(\w{3})(\w+)", #"$1/$1$2");
Instead of trying to match part of the string, then match the whole string, just match the whole thing in two capture groups and reuse the first one.
It's not clear why you need a RegEx for this. Why not just do:
string x = "1234567";
string result = x.Substring(0, 3) + "/" + x;
Another option is:
string s = Regex.Replace("1234567", #"^\w{3}", "$&/$&"););
That would capture 123 and replace it to 123/123, leaving the tail of 4567.
^\w{3} - Matches the first 3 characters.
$& - replace with the whole match.
You could also do #"^(\w{3})", "$1/$1" if you are more comfortable with it; it is better known.
Use positive look-ahead assertions, as they don't 'consume' characters in the current input stream, while still capturing input into groups:
Regex rx = new Regex(#"(?'group1'?=\w{1,3})(?'group2'?=\w{1,7})");
group1 should be 123, group2 should be 1234567.

Categories

Resources