I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester
Related
Text:
[A]I'm an example text [] But I want to be included [[]]
[A]I'm another text without a second part []
Regex:
\[A\][\s\S]*?(?:(?=\[\])|(?=\[\[\]\]))
Using the above regex, it's not possible to capture the second part of the first text.
Demo
Is there a way to tell the regex to be greedy on the 'or'-part? I want to capture the biggest group possible.
Edit 1:
Original Attempt:
Demo
Edit 2:
What I want to achive:
In our company, we're using a webservice to report our workingtime. I want to develop a desktop application to easily keep an eye on the worked time. I successfully downloaded the server's response (with all the data necessary) but unfortunately this date is in a quiet bad state to process it.
Therefor I need to split the whole page into different days. Unfortunately, a single day may have multiple time sets, e.g. 06:05 - 10:33; 10:55 - 13:13. The above posted regular expression splits the days dataset after the first time set (so after 10:33). Therefor I want the regex to handle the Or-part "greedy" (if expression 1 (the larger one) is true, skip the second expression. If expression 1 is false, use the second one).
I have changed your regex (actually simpler) to do what you want:
\[A\].*\[?\[\]\]?
It starts by matching the '[A]', then matches any number of any characters (greedy) and finally one or two '[]'.
Edit:
This will prefer double Square brackets:
\[A\].*(?:\[\[\]\]|\[\])
You may use
\[A][\s\S]*?(?=\[A]|$)
See the regex demo.
Details
\[A] - a [A] substring
[\s\S]*? - any 0+ chars as few as possible
(?=\[A]|$) - a location that is immediately followed with [A] or end of string.
In C#, you actually may even use a split operation:
Regex.Split(s, #"(?!^)(?=\[A])")
See this .NET regex demo. The (?!^)(?=\[A]) regex matches a location in a string that is not at the start and that is immediately followed with [A].
If instead of A there can be any letter, replaces A with [A-Z] or [A-Z]+.
I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))
Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.
I have code that searches a folder that contains SQL patch files. I want to add file names to an array if they match the following name format:
number-text-text.sql
What Regex would I use to match number-text-text.sql?
Would it be possible to make the Regex match file names where the number part of the file name is between two numbers? If so what would be the Regex syntax for this?
The following regex make it halfway there:
\d+-[a-zA-Z]+-[a-zA-Z]+\.sql
Regarding to match in a specific range it gets trickier as regex doesn't have a simple way to handle ranges. To limit the match to a filename with a number between 2 and 13 this is the regex:
([2-9]|1[0-3])-[a-zA-Z]+-[a-zA-Z]+\.sql
Your regular expression should be:
(\d+)-[a-zA-Z]+-[a-zA-Z]+\.sql
You would then use the first captured group to check if your number is between the two numbers you desire. Don't try to check if a number is within a range with a regular expression; do it in two steps. Your code will be much clearer for it.
How about:
\d+-[^-]+-[^-]+\.sql
Edit: You want just letters, so here it is without specific ranges.
\d+-[a-z]+-[a-z]+\.sql - You'll also want to use the i flag, not sure how that's done in c#, but here it is in js/perl:
/\d+-[a-z]+-[a-z]+\.sql/i
Ranges are more difficult. Here's an example of how to match 0-255:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
So to match (0-255)-text-text.sql, you'd have this:
/^(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])-[a-z]+-[a-z]+\.sql/i
(I put the digits in a non-capturing group and matched from the beginning of the string to prevent partial matches on the number and in case you're expecting numbered groups or something).
Basically every time you need another digit of possibility, you'll need to add a new condition inside this case. The smaller the digit you'd like to match, the more cases you'll need as well. What is your desired min/max? AFAIK there's not a simple way to do this dynamically (although I'd love for someone to show me I'm wrong about that).
The simplest way to get around this would be to simply capture the digits, and use native syntax to see if it's in your range. Example in js:
var match = filename.match(/(\d+)-[a-z]+-[a-z]+\.sql/i);
if(match && match[1] < maximumNumber && match[1] > minimumNumber){
doStuff();
}
This should work:
select '4-dfsg-asdfg.sql' ~ E'^[0-9]+-[a-zA-Z]+-[a-zA-Z]+\\.sql$'
This restricts the TEXT to simple ASCII characters. May or may not be what you want.
This is tested in PostgreSQL. Regular expression flavors differ a lot between implementations. You probably know that?
Anchors at begin ^ and end $ are optional, depending how you are going to do it.
I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].
Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.
a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.
You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.
Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.