C# regex pattern to getting words

C# regex pattern to getting words - c#

I am trying to figure out the pattern that will get words from a string. Say for instance my string is:
string text = "HI/how.are.3.a.d.you.&/{}today 2z3";
I tried to eliminate anything under 1 letter or number but it doesn't work:
Regex.Split(s, #"\b\w{1,1}\b");
I also tried this:
Regex.Splits(text, #"\W+");
But it outputs:
"HI how are a d you today"
I just want to get all the words so that my final string is:
"HI how are you today"

To get all words that are at least 2 characters long you can use this pattern: \b[a-zA-Z]{2,}\b.
string text = "HI/how.are.3.a.d.you.&/{}today 2z3";
var matches = Regex.Matches(text, #"\b[a-zA-Z]{2,}\b");
string result = String.Join(" ", matches.Cast<Match>().Select(m => m.Value));
Console.WriteLine(result);
As others have pointed out in the comments, "A" and "I" are valid words. In case you decide to match those you can use this pattern instead:
var matches = Regex.Matches(text, #"\b(?:[a-z]{2,}|[ai])\b",
RegexOptions.IgnoreCase);
In both patterns I've used \b to match word-boundaries. If you have input such as "1abc2" then "abc" wouldn't be matched. If you want it to be matched then remove the \b metacharacters. Doing so from the first pattern is straightforward. The second pattern would change to [a-z]{2,}|[ai].

Related

Regex for a URL with illegal characters \\

From the following string:
google.com/local/reviews?placeid\\u003dChIJ070npYRaeEgRZNoxwuYYrew\\u0026q\\u003d
To extract u003dChIJ070npYRaeEgRZNoxwuYYrew although this value will change every time.
I have tried
Regex r = new Regex(#"("(?<=\placeid\\\s+)\p{L}+");
Which does not work.
I am guilty of neglecting my knowledge is regex so I apologise if this is painfully easy.

There are no whitespace chars in the string that you want to match with \s+ and there are 2 backslashes.
Using \p{L}+ only matches any letter and the string that you want also contains numbers.
(?<=placeid\\\\\s*)[\p{L}\p{N}]+
Regex demo
For example
string pattern = #"(?<=placeid\\\\\s*)[\p{L}\p{N}]+";
string input = #"google.com/local/reviews?placeid\\u003dChIJ070npYRaeEgRZNoxwuYYrew\\u0026q\\u003d";
Match m = Regex.Match(input, pattern);
Console.WriteLine(m.Value);
Output
u003dChIJ070npYRaeEgRZNoxwuYYrew

How to split string by another string

I have this string (it's from EDI data):
ISA*ESA?ISA*ESA?
The * indicates it could be any character and can be of any length.
? indicates any single character.
Only the ISA and ESA are guaranteed not to change.
I need this split into two strings which could look like this: "ISA~this is date~ESA|" and
"ISA~this is more data~ESA|"
How do I do this in c#?
I can't use string.split, because it doesn't really have a delimeter.

You can use Regex.Split for accomplishing this
string splitStr = "|", inputStr = "ISA~this is date~ESA|ISA~this is more data~ESA|";
var regex = new Regex($#"(?<=ESA){Regex.Escape(splitStr)}(?=ISA)", RegexOptions.Compiled);
var items = regex.Split(inputStr);
foreach (var item in items) {
Console.WriteLine(item);
}
Output:
ISA~this is date~ESA
ISA~this is more data~ESA|
Note that if your string between the ISA and ESA have the same pattern that we are looking for, then you will have to find some smart way around it.
To explain the Regex a bit:
(?<=ESA) Look-behind assertion. This portion is not captured but still matched
(?=ISA) Look-ahead assertion. This portion is not captured but still matched
Using these look-around assertions you can find the correct | character for splitting

Simply use the
int x = whateverString.indexOf("?ISA"); // replace ? with the actual character here
and then just use the substring from 0 to that indexOf, indexOf to length.
Edit:
If ? is not known,
can we just use the regex Pattern and Matcher.
Matcher matcher = Patter.compile("ISA.*ESA").match(whateverString);
if(matcher.find()) {
matcher.find();
int x = matcher.start();
}
Here x would give that start index of that match.
Edit: I mistakenly saw it as java one, for C#
string pattern = #"ISA.*ESA";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = myRegex.Match(whateverString); // m is the first match
while (m.Success)
{
Console.writeLine(m.value);
m = m.NextMatch(); // more matches
}

RegEx will probably be the best for this. See this link
Mask would be
ISA(?<data1>.*?)ESA.ISA(?<data2>.*?)ESA.
This will give you 2 groups with data you need
Match match = Regex.Match(input, #"ISA(?<data1>.*?)ESA.ISA(?<data2>.*?)ESA.",RegexOptions.IgnoreCase);
if (match.Success)
{
var data1 = match.Groups["data1"].Value;
var data2 = match.Groups["data2"].Value;
}
Use Regex.Matches If you need multiple matches found, and specify different RegexOptions if needed.

It's kinda hacky but you could do...
string x = "ISA*ESA?ISA*ESA?";
x = x.Replace("*","~"); // OR SOME OTHER DELIMITER
string[] y = x.Split('~');
Not perfect in all situations, but it could solve your problem simply.

You could split by "ISA" and "ESA" and then put the parts back together.
string input = "ISA~this is date~ESA|ISA~this is more data~ESA|";
string start = "ISA",
end = "ESA";
var splitedInput = input.Split(new[] { start, end }, StringSplitOptions.None);
var firstPart = $"{start}{splitedInput[1]}{end}{splitedInput[2]}";
var secondPart = $"{start}{splitedInput[3]}{end}{splitedInput[4]}";
firstPart = "ISA~this is date~ESA|"
secondPart = "ISA~this is more data~ESA|";

Use a Regex like ISA(.+?)ESA and select the first group
string input = "ISA~mycontent+ESA";
Match match = Regex.Match(input, #"ISA(.+?)ESA",RegexOptions.IgnoreCase);
if (match.Success)
{
string key = match.Groups[1].Value;
}

Instead of "splitting" by a string, I would instead describe your question as "grouping" by a string. This can easily be done using a regular expression:
Regular expression: ^(ISA.*?(?=ESA)ESA.)(ISA.*?(?=ESA)ESA.)$
Explanation:
^ - asserts position at start of the string
( - start capturing group
ISA - match string ISA exactly
.*?(?=ESA) - match any character 0 or more times, positive lookahead on the
string ESA (basically match any character until the string ESA is found)
ESA - match string ESA exactly
. - match any character
) - end capturing group
repeat one more time...
$ - asserts position at end of the string
Try it on Regex101
Example:
string input = "ISA~this is date~ESA|ISA~this is more data~ESA|";
Regex regex = new Regex(#"^(ISA.*?(?=ESA)ESA.)(ISA.*?(?=ESA)ESA.)$",
RegexOptions.Compiled);
Match match = regex.Match(input);
if (match.Success)
{
string firstValue = match.Groups[1].Value; // "ISA~this is date~ESA|"
string secondValue = match.Groups[2].Value; // "ISA~this is more data~ESA|"
}

There are two answers to the question "How to split a string by another string".
var matches = input.Split(new [] { "ISA" }, StringSplitOptions.RemoveEmptyEntries);
and
var matches = Regex.Split(input, "ISA").ToList();
However, the first removes empty entries, while the second does not.

C# Regex - starts with pattern1 not contain pattern2

for the following input string contains all of these:
a1.aaa[SUBSCRIBED]
a1.bbb
a1.ccc
b1.ddd
d1.ddd[SUBSCRIBED]
I want to get the output:
bbb
ccc
which means: all the words that come after "a1." And not contain the substring "[SUBSCRIBED]"

all the words comes after "a1." And not contains the substring
"[SUBSCRIBED]"
Why regex? Following is crystal clear:
var result = strings
.Where(s => s.StartsWith("a1.") && !s.Contains("[SUBSCRIBED]"))
.Select(s => s.Substring(3));

Tim's answer makes sense. However if you insist on it I would venture that a Regex would look like this though.
^a1\.(.*)(?<!\[SUBSCRIBED\])$
with ^a1 meaning starts with a1
\.(.*) taking any number of character
and the negative lookbehind (?<!\[SUBSCRIBED\])$ would refuse text ending with [SUBSCRIBED]

You may use
^a1\.(?!.*\[SUBSCRIBED])(.*)
See the regex demo.
Details
^ - start of string
a1\. - a literal a1. substring
(?!.*\[SUBSCRIBED]) - a negative lookahead that fails the match if there is a [SUBSCRIBED] substring is present after any 0+ chars (other than newline if the RegexOptions.Singleline option is not used)
(.*) - Group 1: the rest of the line up to the end (if you use RegexOptions.Singleline option, . will match newlines as well).
C# code:
var result = string.Empty;
var m = Regex.Match(s, #"^a1\.(?!.*\[SUBSCRIBED])(.*)");
if (m.Success)
{
result = m.Groups[1].Value;
}

Improve RegEx search

Using DirectoryServices.AccountManagement I'm getting users DistinguishedName which looks like so:
CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu
I need to get first OU value from this.
I found similar solution: C# Extracting a name from a string
And using some tweaks I created this code:
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
Match m = Regex.Match(input, #"OU=([a-zA-Z\\]+)\,.*$");
Console.WriteLine(m.Groups[1].Value);
This code returns STORE as expected, but if I change Groups[1] to Groups[0] I get almost same result as input string:
OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu
How can I change this regex so it will return only values of OU? SO that in this example I get array of 2 matches. If I would have more OU in my string then array would be longer.
EDIT:
I've converted my code (using #dasblinkenlight suggestions) into function:
private static List<string> GetOUs()
{
var input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
var mm = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
return (from Match m in mm select m.Groups[1].Value).ToList();
}
Is that correct?

Your regular expression is fine (almost), you are just using a wrong API.
Remove the parts of the regexp that match up to the ending anchor $, and change the call of Match for a call of Matches, and get the matches in a loop, like this:
var input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
var mm = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
foreach (Match m in mm)
Console.WriteLine(m.Groups[1].Value);
}

Your existing regex:
#"OU=([a-zA-Z\\]+)\,.*$"
Matches OU=, then some letters and backslashes ([a-zA-Z\\]+), then a comma, then any characters (.*) to the end of the line ($).
Thus a single match will always match the entire line after the first OU section.
Modify your regex by removing the ,.*$ at the end, at it will match each OU group:
#"OU=([a-zA-Z\\]+)"
Also note that the parentheses are a capturing group. They are useful if you also want to capture just the value part by itself, but if you are not using that, they are not necessary, and you can just have this:
#"OU=[a-zA-Z\\]+"

It's beacuse you are mixing up matches and groups
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
MatchCollection mc = Regex.Matches(input, #"OU=([a-zA-Z\\]+),");
foreach(Match m in mc)
{
Console.WriteLine(m.Result("$1"));
}

Group[0] returns the full match:
Group[1] returns the first Pattern in the match [i.e. everything in the first parenthesis '(' ')' ]
So if you wanted to get exactly those 2 occurances of OU.. you could do this:
Match m = Regex.Match(input, #"OU=([a-zA-Z\\]+)\,OU=([a-zA-Z\\]+)\,.*$");
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Group[0] returns the full match: (which you don't want)
Group[1] returns the first Pattern in the match [i.e everything in the first parenthesis '(' ')' ]
Group[2] returns the second Pattern in the match [i.e. everything in the second parenthesis '(' ')' ]
Giving:
STORE
COMPANY
But I'm assuming you don't want to be so explicit with your Regex for each Pattern you are interested in.
If you want to get multiple matches, then you need to do Regex's Matches call that returns a Matchcollection.
MatchCollection ms = Regex.Matches(...);
This still won't work with your current Regex though, because everything from STORE so the end of the line will be in the first match. If you only want to get the pattern "1-or-more-letters" after a "OU="
You only need:
#"OU=([a-zA-Z\\]+)"
So your code would be:
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
MatchCollection ms = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
foreach (Match m in ms)
{
Console.WriteLine(m.Groups[1].Value);// get the string in the first "(" ")"
}

RegEx Problem using .NET

I have a little problem on RegEx pattern in c#. Here's the rule below:
input: 1234567
expected output: 123/1234567
Rules:
Get the first three digit in the input. //123
Add /
Append the the original input. //123/1234567
The expected output should looks like this: 123/1234567
here's my regex pattern:
regex rx = new regex(#"((\w{1,3})(\w{1,7}))");
but the output is incorrect. 123/4567

I think this is what you're looking for:
string s = #"1234567";
s = Regex.Replace(s, #"(\w{3})(\w+)", #"$1/$1$2");
Instead of trying to match part of the string, then match the whole string, just match the whole thing in two capture groups and reuse the first one.

It's not clear why you need a RegEx for this. Why not just do:
string x = "1234567";
string result = x.Substring(0, 3) + "/" + x;

Another option is:
string s = Regex.Replace("1234567", #"^\w{3}", "$&/$&"););
That would capture 123 and replace it to 123/123, leaving the tail of 4567.
^\w{3} - Matches the first 3 characters.
$& - replace with the whole match.
You could also do #"^(\w{3})", "$1/$1" if you are more comfortable with it; it is better known.

Use positive look-ahead assertions, as they don't 'consume' characters in the current input stream, while still capturing input into groups:
Regex rx = new Regex(#"(?'group1'?=\w{1,3})(?'group2'?=\w{1,7})");
group1 should be 123, group2 should be 1234567.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# regex pattern to getting words - c#

Related

Regex for a URL with illegal characters \\

How to split string by another string

C# Regex - starts with pattern1 not contain pattern2

Improve RegEx search

RegEx Problem using .NET

Categories

Resources