Regex matching on to extract multi-line text regions (C#)

Regex matching on to extract multi-line text regions (C#) - c#

I'm looking to capture text regions in a large text block, created in the following format:
...
[region:region-name]
multi line
text block
[/region]
...
[region:another-region-name]
more
multi-line text
[/region]
I have this almost worked out with
\[region:(?'link'.*)\](?'text'(.|[\r\n])*)\[/region\]
This works if I only had one region in the entire text. But, when there are multiple, this gives me just one block with every other 'region' included in the 'text' of that one.
I have a feeling that this is to be solved using a negative look ahead, but being a non-pro with regex, I don't know how to modify the above to do it right.
Can someone help?

You can do this without lookahead:
\[region:(?'link'.*)\](?'text'(?s).*?)\[/region\]
The additional ? makes the * quantifier lazy, so it will match as few characters as possible. And the (?s) allows the dot to match newlines after this position, so you don't have to use the (.|[\r\n]) construction (an alternative would be [\s\S]).

You don't need a negative lookahead, just need to change (?'text'(.|[\r\n])*) to be "non-greedy", so that it will match the first instance of [/region] rather than the last. You can do this by adding ? after *, so the resulting pattern would be:
\[region:(?'link'.*)\](?'text'(.|[\r\n])*?)\[/region\]

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.

The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

RegEx to capture text between two delimiter characters including 'shared'

If I have the following text...
The quick :brown:fox: jumped over the lazy :dog:.
I would like a regular expression to capture all the words that are between 2 : characters. In the above example it should return :brown:, :fox:, :dog:.
So far, I have this (\:{1}.\w*\s*\:{1}) which returns :brown: and :dog:. I can't quite figure out how to share the : between the 2 matching groups so that it will also return ':fox:'.

Here is a simple pattern which can be made to work:
(?<=:)(\w+)(?=:)
This uses lookarounds to make sure that one or more word characters are surrounded before and after by colons. Check the demo below to see it working.
The match would be available as the first capture group. Actually, it should also be available as the entire match itself, because lookarounds do not consume anything.
Demo
I like the above lookaround approach because it is clean and simple (at least in my mind). If, for some reason, you don't want any lookarounds, then just use the following pattern:
:(\w+):
But note that now you explicitly have to access the first capture group to obtain the matching word without colons on either side.

C# (.net) RegEx.Match Substring between newlines - using newline as positive lookahead limit

I have been tinkering with RegEx and got some great results and I want to keep using it.
Right now I am stuck at finding a string that is set between 2 newlines. Here is the sample target text (note this is one of thousands of possible texts):
Substance information in Wikipedia
FORMULA
CH2O
Grafik
Molar mass: 30,03 g/mol
The target is "CH2O".
I tried (?<=FORMULA).*(?=Grafik) with RegexOptions.Singleline and it starts right after FORMULA but goes all the way down and ignores Grafik.
I tried it without singleline but it returns nothing since the . stops at the \n. Since I want the newline as a limit, the following has no singleline.
The closest I have gotten were these:
(?<=FORMULA)[\w\W]+(?=Grafik)
(?<=FORMULA)[\w\W]*(?=Grafik)
However, if the Grafik changes, I'd like to track the newline instead of it.
(?<=FORMULA)[\w\W]*(?=\n) or (?<=FORMULA)[\w\W]*(?=\r) will still match Grafik for some reason...
Does anyone know a more optimal way to make the positive lookahead the newline?
Please don't answer anything unrelated to RegEx.

Would this work for you
(?<=FORMULA\s+)\S+
Matches everything after FORMULA and before a new line

Matching multiple lines up until a sepertor line?

Learning myself some Regex, while trying to parse a datasheet, and I'm thinking there's not an easy way (in Regex, I mean.. in C#, sure!) to do this. Say I have a file with the lines:
0000AA One Token - Value
0000AA Another Token- Another Value
0000AA YA Token - Yet Another
0000AA Yes, Another - Even More
0000AA
0000AA ______________________________________________________________________
0000AA This line - while it will match the regex, shouldn't.
So I have an easy multi-line regex:
^\s*[A-Z]{2}[0-9]{4}\s\s*(?<token>.*?)\-(?<value>.*?)$
This loads All the 'Tokens' into 'token', and all the values into 'value' group. Pretty simple! However, the Regex ALSO matches the bottom line, putting 'This line' into the token, and 'while it will [...]' into the value.
Essentially, I'd like the regex to only match the lines above the ____ separator line. Would this be possible with Regex alone, or will I need to modify my incoming string first to .Split() on the ____ separator line?
Cheers all --Mike.

Parsing such a text file with regex only would not be using the right tool for the job. Although possible, it would be both inefficient and unnecessarily complex.
I would actually not load all the text into a string and split on this line either, as it's not the most efficient way of doing this. I would rather read through the file in a loop, one line at a time, processing each line as needed. Then stop processing when you reach this particular line.

I'd like the regex to only match the lines above the ____ separator line. Would this be possible with Regex alone?
Sure it's possible. Add a lookahead to make sure such a line follows, something like:
(?=(?s).*^\w{6}[ \t]+_{4,})
Add this to the end of your expression to make sure that such a line follows. Eg:
(?m)^\s*[A-Z]{2}[0-9]{4}\s\s*(?<token>.*?)\-(?<value>.*)$(?=(?s).*^\w{6}[ \t]+_{4,})
(Also added m and s flags in the expression.)
This is not very efficient tho, as the regex engine will probably need to scan through most of the string for every match.

Regex match words that are not part of a larger word

I am trying to use Regex in C# to look for a list of keywords in a bunch of text. However I want to be very specific about what the "surrounding" text can be for something to count as a keyword.
So for example, the keyword "hello" should be found in (hello), hello., hello< but not in hellothere.
My main problem is that I don't REQUIRE the separators, if the keyword is the first word or the last word it's okay. I guess another way to look at it is that the beginning-of-the-file and the end-of-the-file should be acceptable separators.
I'm new to Regex so I was hoping someone could help me get the pattern right. So far I have:
[ <(.]+?keyword[<(.]+?
where <, (, . are some example separators and keyword is of course the keyword I am looking for.

You could use the word boundary anchor:
\bkeyword\b
which would find your keyword only when not part of a larger word.

You will want to look into the word boundary (\b) to avoid matching keywords that appear as a part of another word (as in your hellothere example).
You can also add matching at beginning of line (^) and end of line ($) to control the position where keywords may appear.

I think you want something like:
(^$|[ <(.])+?keyword($|[<(.]+?)
The ^ and $ chars symbolise the start and end of the input text, respectively. (If you specify the Multiline option, it matches to the start/end of the line rather than text, but you would seem to want the Singleline option.)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex matching on to extract multi-line text regions (C#) - c#

You don't need a negative lookahead, just need to change (?'text'(.|[\r\n])) to be "non-greedy", so that it will match the first instance of [/region] rather than the last. You can do this by adding ? after , so the resulting pattern would be: \[region:(?'link'.)\](?'text'(.|[\r\n])?)\[/region\]

Related

Regex groups expression not capturing content

RegEx to capture text between two delimiter characters including 'shared'

C# (.net) RegEx.Match Substring between newlines - using newline as positive lookahead limit

Matching multiple lines up until a sepertor line?

Regex match words that are not part of a larger word

Categories

Resources