Regular expression for performing task being done by string functions - c#

The below code is performing following functionality which I intend to integrate into larger application.
Splitting large input string input by dot (.) character wherever it
occurs in input string.
Storing the splitted substrings into array result[];
In the foreach loop , a substring is matched for occurrence of
keyword.
If match occurs , starting from position of this matched substring in original input string , upto 300 characters are to be printed.
string[] result = input.Split('.');
foreach (string str in result)
{
//Console.WriteLine(str);
Match m = Regex.Match(str, keyword);
if (m.Success)
{
int start = input.IndexOf(str);
if ((input.Length - start) < 300)
{
Console.WriteLine(input.Substring(start, input.Length - start));
break;
}
else
{
Console.WriteLine(input.Substring(start, 300));
break;
}
}
The input is in fact large amount of text and I think this should be done by regular expression. Being a novice ,I am not able to put everything together using a regular expressions .
Match keyword. Match m = Regex.Match(str, keyword);
300 characters starting from dot (.) i.e starting from matched sentence , print 300 characters "^.\w{0,300}"
What I intend to do is :
Search for keyword in input text.
Just as a match is found , start from the sentence containing the
keyword and print upto 300 characters from input string.
How should I proceed ? Please help .

If I got it right, all you need to do is find your keyword and capture all that follows until you find first dot or reach maximum number of characters:
#"keyword([^\.]{0,300})"
See sample demo here.
C# code:
var regex = new Regex(#"keyword([^\.]{0,300})");
foreach (Match match in regex.Matches(input))
{
var result = match.Groups[1].Value;
// work with the result
}

Try this regex:
(?<=\.?)([\w\s]{0,300}keyword.*?)(?=\.)
explain:
(?= subexpression) Zero-width positive lookahead assertion.
(?<= subexpression) Zero-width positive lookbehind assertion.
*? Matches the previous element zero or more times, but as few times as possible.
and a simple code:
foreach (Match match in Regex.Matches(input,
#"(?<=\.?)([\w\s]{0,300}print.*?)(?=\.)"))
{
Console.WriteLine(match.Groups[1].Value);
}

Related

Regular expression not working in dotnet C# but works in online editors [duplicate]

I want to get a Substring out of a String.
The Substring I want is a sequence of numerical characters.
Input
"abcdefKD-0815xyz42ghijk";
"dag4ah424KD-42ab333k";
"BeverlyHills90210KD-433Nokia3310";
Generally it could be any String, but they all have one thing in common:
There is a part that starts with KD-
and ends with a number
Everything after the number to be gone.
In the examples above this number would be 0815, 42, 433 respectively. But it could be any number
Right now I have a Substring that contains all numerical characters after KD- but I would like to have only the 0815ish part of the string.
What i have so far
String toMakeSub = "abcdef21KD-0815xyz429569468949489694694689ghijk";
toMakeSub = toMakeSub.Substring(toMakeSub.IndexOf("KD-") + "KD-".Length);
String result = Regex.Replace(toMakeSub, "[^0-9]", "");
The Result is 0815429569468949489694694689 but I want only the 0815 (it could be any length though so cutting after four digits is not possible).
Its as easy as the following pattern
(?<=KD-)\d+
The way to read this
(?<=subpattern) : Zero-width positive lookbehind assertion. Continues matching only if subpattern matches on the left.
\d : Matches any decimal digit.
+ : Matches previous element one or more times.
Example
var input = "abcdef21KD-0815xyz429569468949489694694689ghijk";
var regex = new Regex(#"(?<=KD-)\d+");
var match = regex.Match(input);
if (match.Success)
{
Console.WriteLine(match.Value);
}
input = "abcdef21KD-0815xyz429569468949489694694689ghijk, KD-234dsfsdfdsf";
// or to match multiple times
var matches = regex.Matches(input);
foreach (var matchValue in matches)
{
Console.WriteLine(matchValue);
}

How do I regex match each individual word within backticks?

I am trying to get results for each individual word within backticks. For example,
if I have something like this text
some description `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`
I want the search results to be:
match
these_words
th_is_wor
THIS_WOR
thi_sqw
word_snake
I'm essentially trying to get each "word", word being one or more english letter or underscore characters, between each set of backticks.
I currently have the following regex that seems to match ALL the text between each set of backticks:
/(?<=`)(\b([^`\]|\w|_)*\b)(?=`)/gi
This uses a positive lookbehind to find text that comes after a ` character: (?<=`)
Followed by a capture group for one or more things such that the thing is not a `, not a \, is a word character, or is an _ character within word boundaries: (\b([^`\]|\w|_)*\b)
Followed by a positive lookahead for another ` character to ensure we're enclosed within backticks.
This sort of works, but captures ALL the text between backticks instead of each individual word. This would require further processing which I'd like to avoid. My regex results right now are:
match these_words th_is_wor
THIS_WOR thi_sqw
word_snake
If there is a generic formula for getting each individual word within backticks or within quotes, that would be fantastic. Thank you!
Note: Much appreciated if the answer could be formatted in C#, but not required, as I can do that bit myself if needed.
Edit: Thank you Mr. إين from Ben Awad's Discord server for the quickest response! This is the solution as proposed by him. Also thank you to everyone who responded to my post, you guys are all AWESOME!
using System;
using System.Text.RegularExpressions;
class Program {
static void Main(string[] args) {
string backtickSentence = "i want to `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`";
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
string quoteSentence = "some other \"words in a \" sentence be \"gettin me tripped_up AllUp inHere\"";
string quotePattern = "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*\"(?:[^\"]* )*)\\w+";
// Call Matches method without specifying any options.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {} // Do Nothing: Assume that timeout represents no match.
Console.WriteLine();
// Call Matches method for case-insensitive matching.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {}
}
}
His explanation for this was as follows, but you can paste his regex into regexr.com for more info
var NOT_BACKTICK = #"[^`]*";
var WORD = #"(\w+)";
var START = $#"^{NOT_BACKTICK}"; // match anything before the first backtick
var INSIDE_BACKTICKS = $#"`{NOT_BACKTICK}`"; // match a pair of backticks
var ODD_NUM_BACKTICKS_BEFORE = $#"{START}({INSIDE_BACKTICKS}{NOT_BACKTICK})*`"; // match anything before the first backtick, then any amount of paired backticks with anything afterwards, then a single opening backtick
var CONDITION = $#"(?<={ODD_NUM_BACKTICKS_BEFORE})";
var CONDITION_TRUE = $#"(?: *{WORD})"; // match any spaces then a word
var CONDITION_FALSE = $#"(?:(?<={ODD_NUM_BACKTICKS_BEFORE}{NOT_BACKTICK} ){WORD})"; // match up to an opening backtick, then anything up to a space before the current word
// uses conditional matching
// see https://learn.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr
var pattern = $#"(?{CONDITION}{CONDITION_TRUE}|{CONDITION_FALSE})";
// refined backtick pattern
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
With C# you can use the Group.Captures Property and then get the capture group values.
Note that \w also matches _
`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`
Explanation
<code> Match literally
(?: Non capture group to repeat as a whole part
[\p{Zs}\t]* Match optional spaces
(\w+) Capture group 1, match 1+ word characters
[\p{Zs}\t]* Match optional spaces
)+ Close the non capture group and repeat as least 1 or more times
<code> Match literally
See a .NET regex demo and a C# demo.
For example:
string s = #"some description ` match these_words th_is_wor ` or `THIS_WOR thi_sqw` a `word_snake`";
string pattern = #"`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`";
foreach (Match m in Regex.Matches(s, pattern))
{
string[] result = m.Groups[1].Captures.Select(c => c.Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
Output
match,these_words,th_is_wor
THIS_WOR,thi_sqw
word_snake

Return a number from String after a specific word

I want to get a Substring out of a String.
The Substring I want is a sequence of numerical characters.
Input
"abcdefKD-0815xyz42ghijk";
"dag4ah424KD-42ab333k";
"BeverlyHills90210KD-433Nokia3310";
Generally it could be any String, but they all have one thing in common:
There is a part that starts with KD-
and ends with a number
Everything after the number to be gone.
In the examples above this number would be 0815, 42, 433 respectively. But it could be any number
Right now I have a Substring that contains all numerical characters after KD- but I would like to have only the 0815ish part of the string.
What i have so far
String toMakeSub = "abcdef21KD-0815xyz429569468949489694694689ghijk";
toMakeSub = toMakeSub.Substring(toMakeSub.IndexOf("KD-") + "KD-".Length);
String result = Regex.Replace(toMakeSub, "[^0-9]", "");
The Result is 0815429569468949489694694689 but I want only the 0815 (it could be any length though so cutting after four digits is not possible).
Its as easy as the following pattern
(?<=KD-)\d+
The way to read this
(?<=subpattern) : Zero-width positive lookbehind assertion. Continues matching only if subpattern matches on the left.
\d : Matches any decimal digit.
+ : Matches previous element one or more times.
Example
var input = "abcdef21KD-0815xyz429569468949489694694689ghijk";
var regex = new Regex(#"(?<=KD-)\d+");
var match = regex.Match(input);
if (match.Success)
{
Console.WriteLine(match.Value);
}
input = "abcdef21KD-0815xyz429569468949489694694689ghijk, KD-234dsfsdfdsf";
// or to match multiple times
var matches = regex.Matches(input);
foreach (var matchValue in matches)
{
Console.WriteLine(matchValue);
}

Find String Between To Identical Control Separators?

I'm reading from a file, and need to find a string that is encapsulated by two identical non-ascii values/control seperators, in this case 'RS'
How would I go about doing this? Would I need some form of regex?
RS stands for Record Separator, and it has a value of 30 (or 0x1E in hexadecimal). You can use this regular expression:
\x1E([\w\s]*?)\x1E
That matches the RS, then matches any letter, number or space, and then again the RS. The ? is to make the regex match as less characters as possible, in case there are more RS characters afterwards.
If you prefer not to match numbers, you could use [a-zA-Z\s] instead of [\w\s].
Example:
string fileContents = "Something \u001Eyour string\u001E more things \u001Eanother text\u001E end.";
MatchCollection matches = Regex.Matches(fileContents, #"\x1E([\w\s]*?)\x1E");
if (matches.Count == 0)
return; // Not found, display an error message and exit.
foreach (Match match in matches)
{
if (match.Groups.Count > 1)
Console.WriteLine(match.Groups[1].Value);
}
As you can see, you get a collection of Match, and each match.Value will have the whole matched string including the separators. match.Groups will have all matched groups, being the first one again the whole matched string (that's by default) and then each of your groups (those between parenthesis). In this case, you only have one in your regex, so you just need the second one on that list.
Using regex you can do something like this:
string pattern = string.Format("{0}(.*){1}",firstString,secondString);
var matches = Regex.Matches(myString, pattern);
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
//Do stuff, with the current you should remove firstString and secondString from the capture.Value
}
}
After that use Regex.match to find the string that match with the pattern built before.
Remember to escape all the special char for regex.
You can use Regex.Matches, I'm using X as the separator in this example:
var fileContents = "Xsomething1X Xsomething2X Xsomething3X";
var results = Regex.Matches(fileContents, #"(X).*?(\1)");
The you can loop on results to do anything you want with the matches.
The \1 in the regex means "reference first group". I've put X between () so it is going to be group 1, the I use \1 to say that the match in this place should be exactly the same as the group 1.
You don't need a regular expression for that.
Read the contents of the file (File.ReadAllText).
Split on the separator character (String.Split).
If you know there's only one occurrence of your string, take the second array element (result[1]). Otherwise, take every other entry (result.Where((x, i) => i % 2 == 1)).

Replace all characters and first 0's (zeroes)

I am trying to replace all characters inside a Regular Expression expect the number, but the number should not start with 0
How can I achieve this using Regular Expression?
I have tried multiple things like #"^([1-9]+)(0+)(\d*)"and "(?<=[1-9])0+", but those does not work
Some examples of the text could be hej:\\\\0.0.0.22, hej:22, hej:\\\\?022 and hej:\\\\?22, and the result should in all places be 22
Rather than replace, try and match against [1-9][0-9]*$ on your string. Grab the matched text.
Note that as .NET regexes match Unicode number characters if you use \d, here the regex restricts what is matched to a simple character class instead.
(note: regex assumes matches at end of line only)
According to one of your comments hej:\\\\0.011.0.022 should yield 110022. First select the relevant string part from the first non zero digit up to the last number not being zero:
([1-9].*[1-9]\d*)|[1-9]
[1-9] is the first non zero digit
.* are any number of any characters
[1-9]\d* are numbers, starting at the first non-zero digit
|[1-9] includes cases consisting of only one single non zero digit
Then remove all non digits (\D)
Match match = Regex.Match(input, #"([1-9].*[1-9]\d*)|[1-9]");
if (match.Success) {
result = Regex.Replace(match.Value, "\D", "");
} else {
result = "";
}
Use following
[1-9][0-9]*$
You don't need to do any recursion, just match that.
Here is something that you can try The87Boy you can play around with or add to the pattern as you like.
string strTargetString = #"hej:\\\\*?0222\";
string pattern = "[\\\\hej:0.?*]";
string replacement = " ";
Regex regEx = new Regex(pattern);
string newRegStr = Regex.Replace(regEx.Replace(strTargetString, replacement), #"\s+", " ");
Result from the about Example = 22

Categories

Resources