I need help for building a regex - c#

It is my first time working with regex and I am a little lost. To give you a little background, I am making a program that reads a text file line by line and it saves in a string called "line". If the line starts with either a tab o or a whitespace, followed by a number or number and dots (such as 1 or 1.2.1, for instance) followed by another tab or whitespace, it copies the line to another file.
So far I build this regex, but it does not work
string pattern = #"(\t| ) *[0-9.] (\t| )";
if (line.StartsWith(pattern))
{
//copy line
}
Also, is line.StartsWith correct? Or should I use something like rgx.Matches(pattern)?

Your pattern contains a character class without a quantifier, which will match either a single digit or dot.
To prevent matching for example only dots you could first match digits followed by an optional part which matches a dot and then again digits [0-9]+(?:\.[0-9]+)*
Note that in this part (\t| ) there are 2 characters expected to match as the space in that part has meaning.
You could simplify the pattern to use a character class to match either a tab or space instead of using an alternation and if you don't need the capturing group you could omit it.
Instead of using StartsWith you could usefor example IsMatch
^[ \t][0-9]+(?:\.[0-9]+)*[ \t]
^ Start of string
[ \t] Match a single tab or space
[0-9]+ Match 1+ digits 0-9
(?:\.[0-9]+)* Repeat 0+ times a dot and 1+ digits
[ \t] Match a single tab or space
Regex demo | C# demo
For example
string s = "\t1.2.1 ";
Regex regex = new Regex(#"^[ \t][0-9]+(?:\.[0-9]+)*[ \t]");
if (regex.IsMatch(s)) {
//copy line
}

Related

Regex start new match at specific pattern

Hello im kinda new to regex and have a small, maybe simple question.
I have the given text:
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
17.11.2020 15:32 typical Pat. seems sleeping
Additional test
My current regex (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*)
matches only till sleeping but reates 3 matches correctly.
But i need the Additional test text also in the second group.
i tried something like (\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?([,.:\w\s]*) but now i have only one huge match because the second group takes everything until the end.
How can i match everything until a new line with a date starts and create a new match from there on?
If you are sure there is only one additional line to be matched you can use
(?m)^(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2})\s*(.*(?:\n.*)?)
See the regex demo. Details:
(?m) - a multiline modifier
^ - start of a line
(\d{2}\.\d{2}\.\d{4}\s\d{2}:\d{2}) - Group 1: a datetime string
\s* - zero or more whitespaces
(.*(?:\n.*)?) - Group 2: any zero or more chars other than a newline char as many as possible and then an optional line, a newline followed with any zero or more chars other than a newline char as many as possible.
If there can be any amount of lines, you may consider
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2})[\p{Zs}\t]*(?s)(.*?)(?=\n\d{2}\.\d{2}\.\d{4}|\z)
See this regex demo. Here,
(?m)^(\d{2}\.\d{2}\.\d{4}[\p{Zs}\t]\d{2}:\d{2}) - matches the same as above, just \s is replaced with [\p{Zs}\t] that only matches horizontal whitespace
[\p{Zs}\t]* - 0+ horizontal whitespace chars
(?s) - now, . will match any chars including a newline
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\n\d{2}\.\d{2}\.\d{4}|\z) - up to the leftmost occurrence of a newline, followed with a date string, or up to the end of string.
You are using \s repeatedly using the * quantifier with the character class [,.:\w\s]* and \s also matches newlines and will match too much.
You can just match the rest of the line using (.*\r?\n.*) which would not match a newline, then match a newline and the next line in the same group.
^(\d{2}.\d{2}.\d{4}\s\d{2}:\d{2})\s?(.*\r?\n.*)
Regex demo
If multiple lines can follow, match all following lines that do not start with a date like pattern.
^(\d{2}\.\d{2}\.\d{4})\s*(.*(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)*)
Explanation
^ Start of the string
( Capture group1
\d{2}\.\d{2}\.\d{4} Match a date like pattern
) Close group 1
\s* Match 0+ whitespace chars (Or match whitespace chars without newlines [^\S\r\n]*)
( Capture group 2
.* Match the whole line
(?:\r?\n(?!\d{2}\.\d{2}\.\d{4}).*)* Optionally repeat matching the whole line if it does not start with a date like pattern
) Close group 2
Regex demo

How can I filter out certain combinations?

I'm trying to filter the input of a TextBox using a Regex. I need up to three numbers before the decimal point and I need two after it. This can be in any form.
I've tried changing the regex commands around, but it creates errors and single inputs won't be valid. I'm using a TextBox in WPF to collect the data.
bool containsLetter = Regex.IsMatch(units.Text, "^[0-9]{1,3}([.] [0-9] {1,3})?$");
if (containsLetter == true)
{
MessageBox.Show("error");
}
return containsLetter;
I want the regex filter to accept these types of inputs:
111.11,
11.11,
1.11,
1.01,
100,
10,
1,
As it has been mentioned in the comment, spaces are characters that will be interpreted literally in your regex pattern.
Therefore in this part of your regex:
([.] [0-9] {1,3})
a space is expected between . and [0-9],
the same goes for after [0-9] where the regex would match 1 to 3 spaces.
This being said, for readability purpose you have several way to construct your regex.
1) Put the comments out of the regex:
string myregex = #"\s" // Match any whitespace once
+ #"\n" // Match one newline character
+ #"[a-zA-Z]"; // Match any letter
2) Add comments within your regex by using the syntax (?#comment)
needle(?# this will find a needle)
Example
3) Activate free-spacing mode within your regex:
nee # this will find a nee...
dle # ...dle (the split means nothing when white-space is ignored)
doc: https://www.regular-expressions.info/freespacing.html
Example

Parsing text between quotes with .NET regular expressions

I have the following input text:
#"This is some text #foo=bar #name=""John \""The Anonymous One\"" Doe"" #age=38"
I would like to parse the values with the #name=value syntax as name/value pairs. Parsing the previous string should result in the following named captures:
name:"foo"
value:"bar"
name:"name"
value:"John \""The Anonymous One\"" Doe"
name:"age"
value:"38"
I tried the following regex, which got me almost there:
#"(?:(?<=\s)|^)#(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>[A-Za-z0-9_-]+|(?="").+?(?=(?<!\\)""))"
The primary issue is that it captures the opening quote in "John \""The Anonymous One\"" Doe". I feel like this should be a lookbehind instead of a lookahead, but that doesn't seem to work at all.
Here are some rules for the expression:
Name must start with a letter and can contain any letter, number, underscore, or hyphen.
Unquoted must have at least one character and can contain any letter, number, underscore, or hyphen.
Quoted value can contain any character including any whitespace and escaped quotes.
Edit:
Here's the result from regex101.com:
(?:(?<=\s)|^)#(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)"))
(?:(?<=\s)|^) Non-capturing group
# matches the character # literally
(?<name>\w+[A-Za-z0-9_-]+?) Named capturing group name
\s* match any white space character [\r\n\t\f ]
= matches the character = literally
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)")) Named capturing group value
1st Alternative: [A-Za-z0-9_-]+
[A-Za-z0-9_-]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
A-Z a single character in the range between A and Z (case sensitive)
a-z a single character in the range between a and z (case sensitive)
0-9 a single character in the range between 0 and 9
_- a single character in the list _- literally
2nd Alternative: (?=").+?(?=(?<!\\)")
(?=") Positive Lookahead - Assert that the regex below can be matched
" matches the characters " literally
.+? matches any character (except newline)
Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
(?=(?<!\\)") Positive Lookahead - Assert that the regex below can be matched
(?<!\\) Negative Lookbehind - Assert that it is impossible to match the regex below
\\ matches the character \ literally
" matches the characters " literally
You can use a very useful .NET regex feature where multiple same-named captures are allowed. Also, there is an issue with your (?<name>) capture group: it allows a digit in the first position, which does not meet your 1st requirement.
So, I suggest:
(?si)(?:(?<=\s)|^)#(?<name>\w+[a-z0-9_-]+?)\s*=\s*(?:(?<value>[a-z0-9_-]+)|(?:"")?(?<value>.+?)(?=(?<!\\)""))
See demo
Note that you cannot debug .NET-specific regexes at regex101.com, you need to test them in .NET-compliant environment.
Use string methods.
Split
string myLongString = ""#"This is some text #foo=bar #name=""John \""The Anonymous One\"" Doe"" #age=38"
string[] nameValues = myLongString.Split('#');
From there either use Split function with "=" or use IndexOf("=").

Regular Expressions + Including one space in pattern

I'm trying to figure out how to write a pattern to match to the following: "3Z 5Z". The numbers in this can vary, but the Z's are constant. The issue I'm having is trying to include the white space... Currently I have this as my pattern
pattern = #"\b*Z\s*Z\b";
The '*' represent the wildcard for the number preceding the "Z", but it doesn't seem to want to work with the space in it. For example, I can use the following pattern successfully for matching to the same thing without the space (i.e. 3Z5Z)
pattern = #"\b*Z*Z\b";
I am writing this program in .NET 4.0 (C#). Any help is much appreciated!
EDIT: This pattern is part of a larger string, for example:
3Z 10Z lock 425"
Try this:
pattern = #"\b\d+Z\s+\d+Z\b";
Explanation:
"
\b # Assert position at a word boundary
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
Z # Match the character “Z” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
Z # Match the character “Z” literally
\b # Assert position at a word boundary
"
By the way:
\b*
Should throw an exception. \b is a word anchor. You can't quantify it.
Try this code.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string txt="3Z 5Z";
string re1="(\\d+)"; // Integer Number 1
string re2="(Z)"; // Any Single Character 1
string re3="( )"; // Any Single Character 2
string re4="(\\d+)"; // Integer Number 2
string re5="(Z)"; // Any Single Character 3
Regex r = new Regex(re1+re2+re3+re4+re5,RegexOptions.IgnoreCase|RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
String int1=m.Groups[1].ToString();
String c1=m.Groups[2].ToString();
String c2=m.Groups[3].ToString();
String int2=m.Groups[4].ToString();
String c3=m.Groups[5].ToString();
Console.Write("("+int1.ToString()+")"+"("+c1.ToString()+")"+"("+c2.ToString()+")"+"("+int2.ToString()+")"+"("+c3.ToString()+")"+"\n");
}
Console.ReadLine();
}
}
}
I addition to other posts I would add characters of the Begin and End of string.
patter = "^\d+Z\s\d+Z$"

Using Regular Expression Match a String that contains numbers letters and dashes

I need to match this string 011Q-0SH3-936729 but not 345376346 or asfsdfgsfsdf
It has to contain characters AND numbers AND dashes
Pattern could be 011Q-0SH3-936729 or 011Q-0SH3-936729-SDF3 or 000-222-AAAA or 011Q-0SH3-936729-011Q-0SH3-936729-011Q-0SH3-936729-011Q-0SH3-936729 and I want it to be able to match anyone of those. Reason for this is that I don't really know if the format is fixed and I have no way of finding out either so I need to come up with a generic solution for a pattern with any number of dashes and the pattern recurring any number of times.
Sorry this is probably a stupid question, but I really suck at Regular expressions.
TIA
foundMatch = Regex.IsMatch(subjectString,
#"^ # Start of the string
(?=.*\p{L}) # Assert that there is at least one letter
(?=.*\p{N}) # and at least one digit
(?=.*-) # and at least one dash.
[\p{L}\p{N}-]* # Match a string of letters, digits and dashes
$ # until the end of the string.",
RegexOptions.IgnorePatternWhitespace);
should do what you want. If by letters/digits you meant "only ASCII letters/digits" (and not international/Unicode letters, too), then use
foundMatch = Regex.IsMatch(subjectString,
#"^ # Start of the string
(?=.*[A-Z]) # Assert that there is at least one letter
(?=.*[0-9]) # and at least one digit
(?=.*-) # and at least one dash.
[A-Z0-9-]* # Match a string of letters, digits and dashes
$ # until the end of the string.",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
EDIT:
this will match any of the key provided in your comments:
^[0-9A-Z]+(-[0-9A-Z]+)+$
this means the key starts with the digit or letter and have at leats one dash symbol:
Without more info about the regularity of the dashes or otherwise, this is the best we can do:
Regex.IsMatch(input,#"[A-Z0-9\-]+\-[A-Z0-9]")
Although this will also match -A-0
Most naive implementation EVER (might get you started):
([0-9]|[A-Z])+(-)([0-9]|[A-Z])+(-)([0-9]|[A-Z])+
Tested with Regex Coach.
EDIT:
That does match only three groups; here another, slightly better:
([0-9A-Z]+\-)+([0-9A-Z]+)
Are you applying the regex to a whole string (i.e., validating or filtering)? If so, Tim's answer should put you right. But if you're plucking matches from a larger string, it gets a bit more complicated. Here's how I would do that:
string input = #"Pattern could be 011Q-0SH3-936729 or 011Q-0SH3-936729-SDF3 or 000-222-AAAA or 011Q-0SH3-936729-011Q-0SH3-936729-011Q-0SH3-936729-011Q-0SH3-936729 but not 345-3763-46 or ASFS-DFGS-FSDF or ASD123FGH987.";
Regex pluckingRegex = new Regex(
#"(?<!\S) # start of 'word'
(?=\S*\p{L}) # contains a letter
(?=\S*\p{N}) # contains a digit
(?=\S*-) # contains a hyphen
[\p{L}\p{N}-]+ # gobble up letters, digits and hyphens only
(?!\S) # end of 'word'
", RegexOptions.IgnorePatternWhitespace);
foreach (Match m in pluckingRegex.Matches(input))
{
Console.WriteLine(m.Value);
}
output: 011Q-0SH3-936729
011Q-0SH3-936729-SDF3
000-222-AAAA
011Q-0SH3-936729-011Q-0SH3-936729-011Q-0SH3-936729-011Q-0SH3-936729
The negative lookarounds serve as 'word' boundaries: they insure the matched substring starts either at the beginning of the string or after a whitespace character ((?<!\S)), and ends either at the end of the string or before a whitespace character ((?!\S)).
The three positive lookaheads work just like Tim's, except they use \S* to skip whatever precedes the first letter/digit/hyphen. We can't use .* in this case because that would allow it to skip to the next word, or the next, etc., defeating the purpose of the lookahead.

Categories

Resources