How to do this Regex in C#? - c#

I've been trying to do this for quite some time but for some reason never got it right.
There will be texts like these:
12325 NHGKF
34523 KGJ
29302 MMKSEIE
49504EFDF
The rule is there will be EXACTLY 5 digit number (no more or less) after that a 1 SPACE (or no space at all) and some text after as shown above. I would like to have a MATCH using a regex pattern and extract THE NUMBER and SPACE and THE TEXT.
Is this possible? Thank you very much!

Since from your wording you seem to need to be able to get each component part of the input text on a successful match, then here's one that'll give you named groups number, space and text so you can get them easily if the regex matches:
(?<number>\d{5})(?<space>\s?)(?<text>\w+)
On the returned Match, if Success==true then you can do:
string number = match.Groups["number"].Value;
string text = match.Groups["text"].Value;
bool hadSpace = match.Groups["space"] != null;

The expression is relatively simple:
^([0-9]{5}) ?([A-Z]+)$
That is, 5 digits, an optional space, and one or more upper-case letter. The anchors at both ends ensure that the entire input is matched.
The parentheses around the digits pattern and the letters pattern designate capturing groups one and two. Access them to get the number and the word.

string test = "12345 SOMETEXT";
string[] result = Regex.Split(test, #"(\d{5})\s*(\w+)");

You could use the Split method:
public class Program
{
static void Main()
{
var values = new[]
{
"12325 NHGKF",
"34523 KGJ",
"29302 MMKSEIE",
"49504EFDF"
};
foreach (var value in values)
{
var tokens = Regex.Split(value, #"(\d{5})\s*(\w+)");
Console.WriteLine("key: {0}, value: {1}", tokens[1], tokens[2]);
}
}
}

Related

How to check if a string contains a word and ignore special characters?

I need to check if a sentence contains any of the word from a string array but while checking it should ignore special characters like comma. But the result should have original sentence.
For example, I have a sentence "Tesla car price is $ 250,000."
In my word array I've wrdList = new string[5]{ "250000", "Apple", "40.00"};
I have written the below line of code, but it is not returning the result because 250,000 and 250000 are not matching.
List<string> res = row.ItemArray.Where(itmArr => wrdList.Any(wrd => itmArr.ToString().ToLower().Contains(wrd.ToString()))).OfType<string>().ToList();
And one important thing is, I need to get original sentence if it matches with string array.
For example, result should be "Tesla car price is $ 250,000."
not like "Tesla car price is $ 250000."
How about Replace(",", "")
itmArr.ToString().ToLower().Replace(",", "").Contains(wrd.ToString())
side note: .ToLower() isn't required since digits are case insensitive and a string doesn't need .ToString()
so the resuld could also be
itmArr.Replace(",", "").Contains(wrd)
https://dotnetfiddle.net/A2zN0d
Update
sice the , could be a different character - culture based, you can also use
ystem.Threading.Thread.CurrentThread.CurrentCulture.NumberFormat.NumberGroupSeparator
instead
The first option to consider for most text matching problems is to use regular expressions. This will work for your problem. The core part of the solution is to construct an appropriate regular expression to match what you need to match.
You have a list of words, but I'll focus on just one word. Your requirements specify that you want to match on a "word". So to start with, you can use the "word boundary" pattern \b. To match the word "250000", the regular expression would be \b250000\b.
Your requirements also specify that the word can "contain" characters that are "special". For it to work correctly, you need to be clear what it means to "contain" and which characters are "special".
For the "contain" requirement, I'll assume you mean that the special character can be between any two characters in the word, but not the first or last character. So for the word "250000", any of the question marks in this string could be a special character: "2?5?0?0?0?0".
For the "special" requirement, there are options that depend on your requirements. If it's simply punctuation, you can use the character class \p{P}. If you need to specify a specific list of special characters, you can use a character group. For example, if your only special character is comma, the character group would be [,].
To put all that together, you would create a function to build the appropriate regular expression for each target word, then use that to check your sentence. Something like this:
public static void Main()
{
string sentence = "Tesla car price is $ 250,000.";
var targetWords = new string[]{ "250000", "350000", "400000"};
Console.WriteLine($"Contains target word? {ContainsTarget(sentence, targetWords)}");
}
private static bool ContainsTarget(string sentence, string[] targetWords)
{
return targetWords.Any(targetWord => ContainsTarget(sentence, targetWord));
}
private static bool ContainsTarget(string sentence, string targetWord)
{
string targetWordExpression = TargetWordExpression(targetWord);
var re = new Regex(targetWordExpression);
return re.IsMatch(sentence);
}
private static string TargetWordExpression(string targetWord)
{
var sb = new StringBuilder();
// If special characters means a specific list, use this:
string specialCharacterMatch = $"[,]?";
// If special characters means any punctuation, then you can use this:
//string specialCharactersMatch = "\\p{P}?";
bool any = false;
foreach (char c in targetWord)
{
if (any)
{
sb.Append(specialCharacterMatch);
}
any = true;
sb.Append(c);
}
return $"\\b{sb}\\b";
}
Working code: https://dotnetfiddle.net/5UJSur
Hope below solution can help,
Used Regular expression for removing non alphanumeric characters
Returns the original string if it contains any matching word from wrdList.
string s = "Tesla car price is $ 250,000.";
string[] wrdList = new string[3] { "250000", "Apple", "40.00" };
Regex rgx = new Regex("[^a-zA-Z0-9 -]");
string str = rgx.Replace(s, "");
if (wrdList.Any(str.Contains))
{
Console.Write(s);
}
else
{
Console.Write("No Match Found!");
}
Uplodade on fiddle for more exploration
https://dotnetfiddle.net/zbwuDy
In addition for paragraph, can split into array of sentences and iterate through. Check the same on below fiddle.
https://dotnetfiddle.net/AvO6FJ

Extract version number

I am trying to just get the version number from an HML link.
Take this for example
firefox-10.0.2.bundle
I have got it to take everything after the - with
string versionNum = name.Split('-').Last();
versionNum = Regex.Replace(versionNum, "[^0-9.]", "");
which gives you an output of
10.0.2
However, if the link is like this
firefox-10.0.2.source.tar.bz2
the output will look like
10.0.2...2
How can I make it so that it just chops everything off after the third .? Or can I make it so that when first letter is detected it cuts that and everything that follows?
You could solve this with a single regex match.
Here is an example:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Regex regex = new Regex(#"\d+.\d+.\d+");
Match match = regex.Match("firefox-10.0.2.source.tar.bz2");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
after you split "firefox-10.0.2.source.tar.bz2" to "10.0.2.source.tar.bz2"
string a = "10.0.2.source.tar.bz2";
string[] list = a.Split(new char [] { '.' });
string output = ""
foreach(var item in list)
{
if item is integer then // try to write this part
output += item + ".";
}
after that remove the last character from output.
Although late, I feel that this answer would be much more apt:
Regex r = new Regex(#"[\d\.]+(?![a-zA-Z\-])");
Match m = r.Match(name);
Console.WriteLine(m.Value);
Improvements -
Though #Samuel's answer works, what happens if the build is 10.2.2.3? His regex would give 10.2.2 - a partial answer, and therefore, wrong.
With the regex I have posted, the match would be complete.
Explanation -
[\d\.]+ matches all the combination of numbers and dots such as 10.2.2.34.56.78 and even just 10 if the build is 10.bundle
(?![a-zA-Z\-]) is a negative look-ahead which ensures that the match is not followed by any letter or dash.
Being robust is absolutely vital to any code, so my posted answer should work pretty well under any circumstances (because the link could be anything).
Here's a version which can handle 1-4 numbers (not just digits) in the input string, and returns a version number:
public static Version ExtractVersionNumber(string input)
{
Match match = Regex.Match(input, #"(\d+\.){0,3}\d+");
if (match.Success)
{
return new Version(match.Value);
}
return null;
}
void Main()
{
Console.WriteLine(ExtractVersionNumber("firefox-10.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.6.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.6source.tar.bz2"));
}
Explanation
There are essentially 2 parts:
(\d+\.){0,3} -match a number (uninterrupted sequence of 1 or more digits) immediately followed by a dot. Match this 0 to 3 times.
\d+ - match a number (sequence of 1 or more digits).
These work as follows:
When there's only 1 number (or even if there's only 1 number followed by a dot), the first part will match nothing, the second part will match the number
when there are 2 numbers separated by a dot, the first part matches the first number and the dot, the second part matches the second number.
for 3 numbers separated by dots, the first part gets the first 2 numbers & dots, the last the third number
for 4 or more numbers separated by dots, the first part gets the first 3 numbers and dots, the second gets the fourth number. Any subsequent numbers and dots are ignored.
ps. If you wanted to ensure that you only got the number after the hyphen (e.g. to avoid getting 4.0.1 given the string firefox4.0.1-10.0.2.source.tar.bz2") you could add a negative look behind to say "the character immediately before the version number must be a hyphen": (?<=-)(\d+\.){0,3}\d+.

Regex matching more than {7}

I generally extract Dell service tags from a huge list and I have a a bit of code that is supposed to extract the 7 Alpha-Numeric tags, but if there is extra text int the document, it will sometimes extract that extra text.
My Pattern:
Regex rServTag_Pattern = new Regex(#".*(?=.{7})(?=.*\d)(?=.*[a-zA-Z]).*");
var mTag = rServTag_Pattern.Match(Clipboard.GetText());
For the most part it really works, but after some time it can get annoying and extract more than what is needed. How can I make sure it extracts only the 7Alpha-Numeric string?
Example service tags: 7DJHT90, LK2JHN4, and so on (these are not actual service tags.
Just use
var rServTag = new Regex(#"(?=([a-zA-Z]+\d[a-zA-Z\d]+|\d+[a-zA-Z][a-zA-Z0-9]+))[a-zA-Z0-9]{7}");
If you need to avoid extracting 7 letter+digit combinations from inside text, you can add word boundaries:
var rServTag = new Regex(#"\b(?=([a-zA-Z]+\d[a-zA-Z\d]+|\d+[a-zA-Z][a-zA-Z0-9]+))[a-zA-Z0-9]{7}\b");
I would split your problem into two steps:
Split the input by a delimiter
Process each split string
In your case, I would split Clipboard.GetText() by all non-alphnumeric characters:
string[] splitArray = Regex.Split(Clipboard.GetText(), #"[^a-zA-Z\d]+");
foreach (string s in splitArray)
{
// process s
}
Then for each split string s, apply a regex that only matches strings which have at least one letter (?=.*[a-zA-Z]), at least one digit (?=.*\d), and is exactly 7 characters long ^[a-zA-Z\d]{7}$:
new Regex(#"^(?=.*[a-zA-Z])(?=.*\d)[a-zA-Z\d]{7}$");
Example:
Regex regex = new Regex(#"^(?=.*[a-zA-Z])(?=.*\d)[a-zA-Z\d]{7}$");
string[] splitArray = Regex.Split(Clipboard.GetText(), #"[^a-zA-Z\d]+");
foreach (string s in splitArray)
{
if (regex.IsMatch(s))
{
// s is a valid service tag
}
}
Given the input "123ABCD, ABCDEFG... ABCD123, 123AAAAAAAA", splitArray will equal ["123ABCD", "ABCDEFG", "ABCD123", "123AAAAAAAA"].
regex.IsMatch(s) will return true for s "123ABCD" and "ABCD123".
Use wordboundaries to isolate 7 characters.
Regex rServTag_Pattern = new Regex(#".*\b[A-Z\d]{7}\b.*");
This assumes only capitals and digits in the service tag (based on OP's sample input)

Regex to match a repeated set of possible characters

I'm writing a short text token replacement system which takes the form:
$varName(opt1|opt2|opt3)
It's designed to easily swap out things based on arbitrary values like this:
$gender(he|she)
I figured the best way to get and process those was a regex that matches the pattern but i can't figure out how to recognise the options between the brackets because they can repeat an arbitrary number of times and may not have as many pipe characters as options.
Any help?
(I'm using C# as the regex host)
EDIT:
I tried this but it only seems to work with something with 2 options
\$[a-zA-Z]+\(([a-zA-Z]+\|)+[a-zA-Z]+\)
Something like this should work:
string text = "$gender(he|she|it|alien)";
string pattern = #"\$(\w+)\(([\w\|]*)\)";
Match match = Regex.Match(text, pattern);
string varName = match.Groups[1].Value;
string[] values = match.Groups[2].Value.Split('|');
Console.WriteLine(varName + ": ");
foreach (string value in values)
{
Console.WriteLine(" " + value);
}
This is what it prints out:
gender:
he
she
it
alien
varName has the name of the variable, and then values is an array of strings containing each option.
However, if you put in something like "$gender()" with no values, or "$gender(he|she|)" with an extra pipe on the end, you'll get empty strings in the result. If that might be a problem, try this:
string[] values = match.Groups[2].Value.Split('|').Where((s) => !string.IsNullOrEmpty(s)).ToArray();
I figured it out.
I was forgetting to account for numbers in the options.
\$[a-zA-Z]+\(([a-zA-Z0-9]+\|)+[a-zA-Z0-9]+\)

named groups splitting regardless of position of match

Having a hard time explaining what I mean, so here is what I want to do
I want any sentence to be parsed along the pattern of
text #something a few words [someothertext]
for this, the matching sentence would be
Jeremy is trying #20 times to [understand this]
And I would name 4 groups, as text, time, who, subtitle
However, I could also write
#20 Jeremy is trying [understand this] times to
and still get the tokens
#20
Jeremy is trying
times to
understand this
corresponding to the right groups
As long as the delimited tokens can separate the 2 text only tokens, I'm fine.
Is this even possible? I've tried a few regex's and failed miserably (am still experimenting but finding myself spending way too much time learning it)
Note: The order of the tokens can be random. If this isn't possible with regex then I guess I can live with a fixed order.
edit: fixed a typo. clarified further what I wanted.
You can alternate on the different types of text. Using named groups means that one group would have a Success value equal to true for each match.
This pattern should do what you need:
#"(?<Number>#\d+\b)|(?<Subtitle>\[.+?])|\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s*"
(?<Number>#\d+\b) - matches # followed by one or more digits, up to a word boundary
(?<Subtitle>\[.+?]) - non-greedy matching of text between square brackets
\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s* - trims spaces at either end of the string, and the named capture group uses an approach that matches a single character at a time provided that the negative look-ahead fails to match if it detects text that would match the other 2 text patterns of interest (numbers and subtitles).
Example usage:
var inputs = new[]
{
"Jeremy is trying #20 times to [understand this]",
"#20 Jeremy is trying [understand this] times to"
};
string pattern = #"(?<Number>#\d+\b)|(?<Subtitle>\[.+?])|\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s*";
foreach (var input in inputs)
{
Console.WriteLine("Input: " + input);
foreach (Match m in Regex.Matches(input, pattern))
{
// skip first group, which is the entire matched text
var group = m.Groups.Cast<Group>().Skip(1).First(g => g.Success);
Console.WriteLine(group.Value);
}
Console.WriteLine();
}
Alternately, this example demonstrates how to pair the named groups to the matches:
var re = new Regex(pattern);
foreach (var input in inputs)
{
Console.WriteLine("Input: " + input);
var query = from Match m in re.Matches(input)
from g in re.GetGroupNames().Skip(1)
where m.Groups[g].Success
select new
{
GroupName = g,
Value = m.Groups[g].Value
};
foreach (var item in query)
{
Console.WriteLine("{0}: {1}", item.GroupName, item.Value);
}
Console.WriteLine();
}
So if I understand this correctly, you're looking for four phrases:
1) 1+ words of normal text
2) 1 word of text prefixed by a #
3) 1+ words of normal text
4) 1+ words of text wrapped by [ ]
My (admittedly slow and regex-less) suggestion would be to find the indexes of the #, [, and ] characters, then use several calls to string.Substring().
This would be acceptable for relatively small strings and a relatively small number of iterations, although with much larger strings this would be extremely slow.

Categories

Resources