Learning myself some Regex, while trying to parse a datasheet, and I'm thinking there's not an easy way (in Regex, I mean.. in C#, sure!) to do this. Say I have a file with the lines:
0000AA One Token - Value
0000AA Another Token- Another Value
0000AA YA Token - Yet Another
0000AA Yes, Another - Even More
0000AA
0000AA ______________________________________________________________________
0000AA This line - while it will match the regex, shouldn't.
So I have an easy multi-line regex:
^\s*[A-Z]{2}[0-9]{4}\s\s*(?<token>.*?)\-(?<value>.*?)$
This loads All the 'Tokens' into 'token', and all the values into 'value' group. Pretty simple! However, the Regex ALSO matches the bottom line, putting 'This line' into the token, and 'while it will [...]' into the value.
Essentially, I'd like the regex to only match the lines above the ____ separator line. Would this be possible with Regex alone, or will I need to modify my incoming string first to .Split() on the ____ separator line?
Cheers all --Mike.
Parsing such a text file with regex only would not be using the right tool for the job. Although possible, it would be both inefficient and unnecessarily complex.
I would actually not load all the text into a string and split on this line either, as it's not the most efficient way of doing this. I would rather read through the file in a loop, one line at a time, processing each line as needed. Then stop processing when you reach this particular line.
I'd like the regex to only match the lines above the ____ separator line. Would this be possible with Regex alone?
Sure it's possible. Add a lookahead to make sure such a line follows, something like:
(?=(?s).*^\w{6}[ \t]+_{4,})
Add this to the end of your expression to make sure that such a line follows. Eg:
(?m)^\s*[A-Z]{2}[0-9]{4}\s\s*(?<token>.*?)\-(?<value>.*)$(?=(?s).*^\w{6}[ \t]+_{4,})
(Also added m and s flags in the expression.)
This is not very efficient tho, as the regex engine will probably need to scan through most of the string for every match.
Related
One project I am currently working on involves writing a parser in C#.
I chose to use Regex to extract the parts of each line. Only one problem... I have very little Regex experience.
My current issue is that I can't get argument lists to work. More specifically, I can't match comma separated lists. After two hours of being stuck, I've turned to SO.
My closest regex so far is:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+\s*)*\)
Obviously, the actual code part is not matched. Only the listed types are wanted.
I removed any and all comma detection code, as it all broke.
I want to make it match void FunctionName(int a, string b) or the equivalent with other spacing.
How can I make this happen?
Please suggest edits before voting to close, I'm bad at Stack Overflowing.
Try it like this:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+(?(?=\s*,\s*\w)\s*,\s*|\s*))*\)
Demo
Explanation:
the crucial part here is the if-else regex a la (?(?=regex)then|else):
(?(?=\s*,\s*\w)\s*,\s*|\s*)
which means: if a type-param pair is followed by a comma assert another word character appears.
However, if feel using regex could turn out to be the wrong choice for your task at hand. There are some lightweight parser frameworks out, e.g. Sprache.
You're actually very close:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+,?\s*)*\)
The only difference is the ,? close to to end of the regex, which Means an optional comma and will match the comma between variables.
I need to check a string that contains a list of e-mails. These emails are usually separated by commas, but I need to check if somewhere in that list there is a delimiter other than a comma. Here's an example:
email1#email.com,email2#email.com,email3#email.com#email4#email.com
I need to identify that different character and replace to a comma.
I cannot just use a regex to identify special characters other than the comma and replace them because emails may have some of these characters. So I need to find something between two e-mail.
I made the following regex to identify an e-mail and I believe it will cover most of the emails:
^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#[a-z0-9]+(\.[a-z0-9]+)+$
But I'm a little lost on how to use it to solve my problem, using C #. I need to capture something that was between two matches of this regex and replace to a comma.
Could anyone help me?
Thank you.
Your problem is unsolvable because the delimiter can not always be determined by a human.
Consider this input where the delimiter is a .:
user#server.co.uk.user#otherServer.com
Is this:
user#server.co | uk.user#otherServer.com
or is it:
user#server.co.uk | user#otherServer.com
Or this input:
user#server.intuser#otherServer.com
Is it delimiter u:
user#server.int | ser#otherServer.com
Or delimiter t:
user#server.in | user#otherServer.com
If you're not willing to accept a certain percentage of failures, you're better off looking for ways not to receive this input to begin with.
([^#,]+#[^.]+\.\w{3}(?!,|$)).
Try this.Replace by $1,.See demo.
http://regex101.com/r/tF4jD3/15
P.S this will work for email id's of format something#something.com.
I can't think of an elegant way to achieve this. If you don't mind an inelegant solution, you can replace any top level domain plus one character with the same TLD plus comma.
You'll end up replacing ".com#" with ".com,", ".eu*" with ".eu," and so on. Replacement could take place using Regex so your iterations will be the same number of the TLDs you want to replace.
One option you could try is to split the incoming string using the # symbol and check that each part of the resulting array has a comma in int--except the first and last.
If you find one that is missing the comma do a search for the .com or .net or .org in that element and stick a comma after that character.
Lastly just run splice the list back together with the # symbol
Thanks for the replies.
The string must have only commas as the delimiter.
The example I mentioned was just to illustrate, because this list was generated using a jquery plugin that had a flaw that was noticed only after allowing it to be saved in the list something like "email1#email.comemail2#email.com" or any other combination non standard "email1#email.com,email2#email.com".
My main concern is cases like "email1#email.com/email2#email.com"
I'm trying to automate a search for this kind of inconsistency, as prevention.
I thought about using regex but I really do not know if it is the best approach.
I am now thinking, as it is not a critical part of the system, it would be a simpler way just to use a list of invalid characters to make the replace.
But I will try the vks's solution.
Thank you all.
How do I parse a Textfile like:
{:block1:}
%param1%= value1
%param2% = value2
%paramn% =valuen
{:block2:}
1st html - sourcecode Just copy 1:1
{:block3:}
2nd html - sourcecode Just copy 1:1
...{:block4:}
3rd html - sourcecode Just copy 1:1
I would like to convert data to a XmlDocument.
Blocks are identified by {::} and params are identified by %%=
Thanx a lot.
What I'm looking for is more an idea but complete code. I have found many examples reading ini-files using RegEx and a TextReader to get some lines. The problem is: It's possible, that more than one {:block:} is within a line. There are so many whitespaces, linebreaks...
If the problem is that more than one {:block:} can appear within a line, could you replace every "{" with a "\r\n{" to guarantee that every block is in its own line? (In other words, replace every "{" with a "newline{" ) would the extra spaces cause a problem? Otherwise, you could write a Regex expression to identify only those blocks where you need to enter a linebreak.
The whitespaces and line breaks are both handled with the Regex escape character \s. A common way to use \s in Regex is either as "\s+" or "\s*", depending on whether whitespace is optional or necessary.
It would also help if you were more specific about particular problems.
I'm looking to capture text regions in a large text block, created in the following format:
...
[region:region-name]
multi line
text block
[/region]
...
[region:another-region-name]
more
multi-line text
[/region]
I have this almost worked out with
\[region:(?'link'.*)\](?'text'(.|[\r\n])*)\[/region\]
This works if I only had one region in the entire text. But, when there are multiple, this gives me just one block with every other 'region' included in the 'text' of that one.
I have a feeling that this is to be solved using a negative look ahead, but being a non-pro with regex, I don't know how to modify the above to do it right.
Can someone help?
You can do this without lookahead:
\[region:(?'link'.*)\](?'text'(?s).*?)\[/region\]
The additional ? makes the * quantifier lazy, so it will match as few characters as possible. And the (?s) allows the dot to match newlines after this position, so you don't have to use the (.|[\r\n]) construction (an alternative would be [\s\S]).
You don't need a negative lookahead, just need to change (?'text'(.|[\r\n])*) to be "non-greedy", so that it will match the first instance of [/region] rather than the last. You can do this by adding ? after *, so the resulting pattern would be:
\[region:(?'link'.*)\](?'text'(.|[\r\n])*?)\[/region\]
I'm currently working on a parser for our internal log files (generated by log4php, log4net and log4j). So far I have a nice regular expression to parse the logs, except for one annoying bit: Some log messages span multiple lines, which I can't get to match properly. The regex I have now is this:
(?<date>\d{2}/\d{2}/\d{2})\s(?<time>\d{2}):\d{2}:\d{2}),\d{3})\s(?<message>.+)
The log format (which I use for testing the parser) is this:
07/23/08 14:17:31,321 log
message
spanning
multiple
lines
07/23/08 14:17:31,321 log message on one line
When I run the parser right now, I get only the line the log starts on. If I change it to span multiple lines, I get only one result (the whole log file).
#samjudson:
You need to pass the RegexOptions.Singleline flag in to the regular expression, so that "." matches all characters, not just all characters except new lines (which is the default).
I tried that, but then it matches the whole file. I also tried to set the message-group to .+? (non-greedy), but then it matches a single character (which isn't what I'm looking for either).
The problem is that the pattern for the message matches on the date-group as well, so when it doesn't break on a new-line it just goes on and on and on.
I use this regex for the message group now. It works, unless there's a pattern IN the log message which is the same as the start of the log message.
(?<message>(.(?!\d{2}/\d{2}/\d{2}\s\d{2}:\d{2}:\d{2},\d{3}\s\[\d{4}\]))+)
This will only work if the log message doesn't contain a date at the beginning of the line, but you could try adding a negative look-ahead assertion for a date in the "message" group:
(?<date>\d{2}/\d{2}/\d{2})\s(?<time>\d{2}:\d{2}:\d{2},\d{3})\s(?<message>(.(?!^\d{2}/\d{2}/
\d{2}))+)
Note that this requires the use of the RegexOptions.MultiLine flag.
You obviously need that "messages lines" can be distinguished from "log lines"; if you allow the message part to start with date/time after a new line, then there is simply no way to determine what is part of a message and what not. So, instead of using the dot, you need an expression that allows anything that does not include a newline followed by a date and time.
Personally, however, I would not use a regular expression to parse the whole log entry. I prefer using my own loop to iterate over each line and use one simple regular expression to determine whether a line is the start of a new entry or not. Also from the point of readability this would have my preference.
The problem you have is that you need to terminate the RegEx pattern so it knows when one message ends and then next starts.
When you were running in default mode the newline was working as an implicit terminator.
The problem is if you go into multiline mode there's no terminator so the pattern will gobble up the whole file. Non-greedy matches a few characters as possible which will be just one.
Now, if use the date for the next message as the terminator I think your parser will only get every other line.
Is there something else in the file you could to terminate the pattern?
You might find it a lot easier to parse the file with a proper parser generator - ANTLR can generate one in C#... Context Free parsers only seem hard until you "get" them - after that, they are much simpler and friendlier to use than Regular Expressions...
You need to pass the RegexOptions. Singleline flag in to the regular expression, so that "." matches all characters, not just all characters except new lines (which is the default).