Get all strings between triple Pipes, e.g. |||Hello|||, from a text - c#

I wanted to get all strings, that are surrounded by triple pipes (like |||Hello|||) from a text and found this regex in C#:
Regex regex = new Regex(#".*?\|\|\|(\w+)\|\|\|"); // searches strings, which are surrounded by three pipes >>> |||string|||
foreach (Match match in regex.Matches(strContent))
{
lstReturn.Add(match.Groups[1].Value);
}
It works as it should with small strings, but not on a large text (freezes without response).
Can you please tell me how I can make this query faster or suggest an alternative?

The .*? at the start of your pattern makes matching slower since the engine needs to perform more checks once the subsequent subpatterns fail. Once there is no | the .*? is "expanded", or "backtracked", and the non-| char is matched with .*?. With very long strings, this leads to catastrophic backtracking.
The second pattern also allows for internal optimization since the regex engine knows the match will start with a | hardcoded char.
You need to remove .*? since you do not need the part before |||word|||.
You can compare .*?\|\|\|(\w+)\|\|\| and \|\|\|(\w+)\|\|\| matching steps:
First one:
Second one:
You can see that "red arrows" denoting backtracking fire more often in the first image.

Related

Obtain a particular URL from a large string

I have some exported performance data from Chrome in C# and it contains a large amount of URL's. I want one specifically and only the first occurance of it. Actually could be any as it's repeated a number of times, however if I have a string of various garbage and URL's mixed in, how would I find the one that starts with https and ends in mpa?
So it would be like https://thisisaurl.com/2020/11/20/14243324324/324234/test.mpa Note everything between the https and mpa could be different. Actually the thisisaurl.com will probably stay the same but can't be sure right now. Just know the URL would end in mpa.
I've been playing with something like this:
var linkParser = new Regex(#"\b(?:https?://|mpa\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match m in linkParser.Matches(logs[i].Message))
Console.WriteLine(m.Value);
But wasn't giving me what I'm looking for. Just some other URL starting with https. Appreciate any help.
Also example below:
{"columnNumber":104001,"functionName":"","lineNumber":1,"scriptId":"9","url":"https://www.blabla.com/assets/build/js/show/videoTop-b8f5d35a3719d4f31aee.min.js"},{"columnNumber":82859,"functionName":"makeRequestStandard","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":82357,"functionName":"makeRequest","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":168917,"functionName":"request","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":162205,"functionName":"send","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":132652,"functionName":"postHeartbeat","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":131860,"functionName":"sendHeartbeat","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"hasPostData":true,"headers":{"Content-Type":"application/json","Referer":"https://www.bla.com/info/me/a4Af8ptKlr5gthauHnde5C9JdeJcNnWa","User-Agent":"Mozilla/5.0
https://blabla3.bla.com/2020/11/20/1822084675943/383378/value.mpa\",\":null,\"evs\":[{\"name\":\"\",\"attr\":
So in the case of above want
https://blabla3.bla.com/2020/11/20/1822084675943/383378/value.mpa
Note the string contains a lot more than above, that's just a middle snippet.
You had the right idea using a regular expression, you just need to tweak it some. For one, you should be escaping the / after https. I was able to match the string you were looking for with a quick and dirty regex of https:\/\/[^\s]+[\w].mpa It will match the characters https: literally, \/ will match the character / literally, [^\s]+ will match a non whitespace character multiple times (^ is a negation, \s short for whitespace characters), \w will match a word character (i.e. value in value.mpa), and .mpa will match those characters literally. You can tweak it as needed for case insensitivity or other needs

How do I select all including sensitive case (regex) in c#?

I have a problem with a regex command,
I have a file with a tons of lines and with a lot of sensitive characters,
this is an Example with all sensitive case 0123456789/*-+.&é"'(-è_çà)=~#{[|`\^#]}²$*ù^%µ£¨¤,;:!?./§<>AZERTYUIOPMLKJHGFDSQWXCVBNazertyuiopmlkjhgfdsqwxcvbn
I tried many regex commands but never get the expected result,
I have to select everything from Example to the end
I tried this command on https://www.regextester.com/ :
\sExample(.*?)+
Image of the result here
And when I tried it in C# the only result I get was : Example
I don't understand why --'
Here's a quick chat about greedy and pessimistic:
Here is test data:
Example word followed by another word and then more
Here are two regex:
Example.*word
Example.*?word
The first is greedy. Regex will match Example then it will take .* which consumes everything all the way to the END of the string and the works backwards spitting a character at a time back out, trying to make the match succeed. It will succeed when Example word followed by another word is matched, the .* having matched word followed by another (and the spaces at either end)
The second is pessimistic; it nibbled forwards along the string one character at a time, trying to match. Regex will match Example then it'll take one more character into the .*? wildcard, then check if it found word - which it did. So pessimistic matching will only find a single space and the full match in pessimistic mode is Example word
Because you say you want the whole string after Example I recommend use of a greedy quantifier so it just immediately takes the whole string that remains and declares a match, rather than nibbling forwards one at a time (slow)
This, then, will match (and capture) everything after Example:
\sExample(.*)
The brackets make a capture group. In c# we can name the group using ?<namehere> at the start of the brackets and then everything that .* matches can be retrieved with:
Regex r = new Regex("\sExample(?<x>.*)");
Match m = r.Match("Exampleblahblah");
Console.WriteLine(m.Groups["x"].Value); //prints: blahblah
Note that if your data contains newlines you should note that . doesn't match a newline, unless you enable RegexOptions.SingleLine when you create the regex

Fastest regex for first occurence of a word

I would like my regex to capture the following kind of strings as two Urls with "%3f" inside them.
https://*****%3f****%3D,https://*****%3f****%3D …
Where each string URL of this type should be captured by itself. Note - The * is here for simplification and the URLS can be in any part of the big string with anything in between.
My regex now is:
(https://\S+?%3f)(?<toDelete>\S+?%3D)
But I've been asked to see if there's a non lazy approach for this (or just a faster version), as it is much slower then greediness, and this regex will be called over huge strings and dataflow.
Note that the reason I cant simply put \S* is that doing so will capture in one match from the first http to the last %3D.
You might probably split the string with a comma and then get a substring up to the %3f value.
If you want to make the \S*? pattern work "faster" you must take into account what kind of context this part of a pattern should be aware of.
You are matching any char that is not a whitespace char, any amount of times, up to the first occurrence of %3f. That is, you want to match any chars other than % and whitespace or % chars that are not followed with 3f. That makes (?:[^\s%]|%(?!3f))*. However, alternation ruins the whole idea of optimization. You need to use the "unroll-the-loop" approach: [^%\s]*(?:%(?!3f)[^%\s]*)*.
So, the whole pattern will look like
https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f
Or with the Delete part:
(https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f)(?<toDelete>[^%\s]*(?:%(?!3D)[^%\s]*)*%3D)
For short strings, this last pattern might work a tiny bit slower than the \S+? based pattern, but it becomes much more efficient when the matched string becomes longer.

Regex.IsMatch gives true but http://www.regexr.com/ gives false

I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)
First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.

To find everything between { }

I'm new to regex and was hoping for a pointer towards finding matches for words which are between { } brackets which are words and the first letter is uppercase and the second is lowercase. So I want to ignore any numbers also words which contain numbers
{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}
so I would only want to bring back matches for:
Test
Tesgd
Abc
I've looked at using \b and \w for words that are bound and [Az] for upper followed by lower but not sure how to only get the words which are between the brackets only as well.
Here is your solution:
Regex r = new Regex(#"(?<={[^}]*?({(?<depth>)[^}]*?}(?<-depth>))*?[^}]*?)(?<myword>[A-Z][a-z]+?)(?=,|}|\Z)", RegexOptions.ExplicitCapture);
string s = "{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}";
var m = r.Matches(s);
foreach (Match match in m)
Console.WriteLine(match.Groups["myword"].Value);
I assumed it is OK to match inside but not the deepest level paranthesis.
Let's dissect the regex a bit. AAA means an arbitrary expression. www means an arbitrary identifier (sequence of letters)
. is any character
[A-Z] is as you can guess any upper case letter.
[^}] is any character but }
,|}|\Z means , or } or end-of-string
*? means match what came before 0 or more times but lazily (Do a minimal match if possible and spit what you swallowed to make as many matches as possible)
(?<=AAA) means AAA should match on the left before you really try
to match something.
(?=AAA) means AAA should match on the right
after you really match something.
(?<www>AAA) means match AAA and give the string you matched the name www. Only used with ExplicitCapture option.
(?<depth>) matches everything but also pushes "depth" on the stack.
(?<-depth>) matches everything but also pops "depth" from the stack. Fails if the stack is empty.
We use the last two items to ensure that we are inside a paranthesis. It would be much simpler if there were no nested paranthesis or matches occured only in the deepest paranthesis.
The regular expression works on your example and probably has no bugs. However I tend to agree with others, you should not blindly copy what you cannot understand and maintain. Regular expressions are wonderful but only if you are willing to spend effort to learn them.
Edit: I corrected a careless mistake in the regex. (replaced .*? with [^}]*? in two places. Morale of the story: It's very easy to introduce bugs in Regex's.
In answer your original question, I would have offered this regex:
\b[A-Z][a-z]+\b(?=[^{}]*})
The last part is a positive lookahead; it notes the current match position, tries to match the enclosed subexpression, then returns the match position to where it started. In this case, it starts at the end of the word that was just matched and gobbles up as many characters it can as long as they're not { or }. If the next character after that is }, it means the word is inside a pair of braces, so the lookahead succeeds. If the next character is {, or if there's no next character because it's at the end of the string, the lookahead fails and the regex engine moves on to try the next word.
Unfortunately, that won't work because (as you mentioned in a comment) the braces may be nested. Matching any kind of nested or recursive structure is fundamentally incompatible with the way regexes work. Many regex flavors offer that capability anyway, but they tend to go about it in wildly different ways, and it's always ugly. Here's how I would do this in C#, using Balanced Groups:
Regex r = new Regex(#"
\b[A-Z][a-z]+\b
(?!
(?>
[^{}]+
|
{ (?<Open>)
|
} (?<-Open>)
)*
$
(?(Open)(?!))
)", RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
string s = "testa Testb { Test1 Testc testd 1Test } Teste { Testf {testg Testh} testi } Testj";
foreach (Match m in r.Matches(s))
{
Console.WriteLine(m.Value);
}
output:
Testc
Testf
Testh
I'm still using a lookahead, but this time I'm using the group named Open as a counter to keep track of the number of opening braces relative to the number of closing braces. If the word currently under consideration is not enclosed in braces, then by the time the lookahead reaches the end of the string ($), the value of Open will be zero. Otherwise, whether it's positive or negative, the conditional construct - (?(Open)(?!)) - will interpret it as "true" and try to try to match (?!). That's a negative lookahead for nothing, which is guaranteed to fail; it's always possible to match nothing.
Nested or not, there's no need to use a lookbehind; a lookahead is sufficient. Most flavors place such severe restrictions on lookbehinds that nobody would even think to try using them for a job like this. .NET has no such restrictions, so you could do this in a lookbehind, but it wouldn't make much sense. Why do all that work when the other conditions--uppercase first letter, no digits, etc--are so much cheaper to test?
Do the filtering in two steps. Use the regular expression
#"\{(.*)\}"
to pull out the pieces between the brackets, and the regular expression
#"\b([A-Z][a-z]+)\b"
to pull out each of the words that begins with a capital letter and is followed by lower case letters.

Categories

Resources