Regex doesn't give me expected result - c#

Okay, I give up - time to call upon the regex gurus for some help.
I'm trying to validate CSV file contents, just to see if it looks like the expected valid CSV data. I'm not trying to validate all possible CSV forms, just that it "looks like" CSV data and isn't binary data, a code file or whatever.
Each line of data comprises comma-separated words, each word comprising a-z, 0-9, and a small number of of punctuation chars, namely - and _. There may be several lines in the file. That's it.
Here's my simple code:
const string dataWord = #"[a-z0-9_\-]+";
const string dataLine = "("+dataWord+#"\s*,\s*)*"+dataWord;
const string csvDataFormat = "("+dataLine+") | (("+dataLine+#"\r\n)*"+dataLine +")";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
return validCSVDataPattern.IsMatch(fileContents);
}
This gives me a regex pattern of
(([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+) | ((([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+\r\n)*([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+)
However if I present this with a block of, say, C# code, the regex parser says it is a match. How is that? the C# code doesn't look anything like my CSV pattern (it has punctuation other than _ and -, for a start).
Can anyone point out my obvious error? Let me repeat - I am not trying to validate all possible CSV forms, just my simple subset.

Your regular expression is missing the ^ (beginning of line) and $ (end of line) anchors. This means that it would match any text that contains what is described by the expression, even if the text contains other completely unrelated parts.
For example, this text matches the expression:
foo, bar
and therefore this text also matches:
var result = calculate(foo, bar);
You can see where this is going.
Add ^ at the beginning and $ at the end of csvDataFormat to get the behavior you expect.

Here is a better pattern which looks for CSV groups such as XXX, or yyy for one to many in each line:
^([\w\s_\-]*,?)+$
^ - Start of each line
( - a CSV match group start
[\w\s_\-]* - Valid characters \w (a-zA-Z0-9) and _ and - in each CSV
,? - maybe a comma
)+ - End of the csv match group, 1 to many of these expected.
That will validate a whole file, line by line for a basic CSV structure and allow for empty ,, situations.

I came up with this regex:
^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$
Tests
asbc_- , khkhkjh, lkjlkjlkj_-, j : PASS
asbc, : FAIL
asbc_-,khkhkjh,lkjlkjlk909j_-,j : PASS
If you want to match empty lines like ,,, or when some values are blank like ,abcd,, use
^([a-z0-9_\-]*)(\s*)(,\s*[a-z0-9_\-]*)*$
Loop through all the lines to see if the file is ok:
const string dataLine = "^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
string[] lines = fileContents.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
foreach (var line in lines)
{
if (!validCSVDataPattern.IsMatch(line))
return false;
}
return true;
}

I think this is what you're looking for:
#"(?in)^[a-z0-9_-]+( *, *[a-z0-9_-]+)*([\r\n]+[a-z0-9_-]+( *, *[a-z0-9_-]+)*)*$"
The noteworthy changes are:
Added anchors (^ and $, because the regex is totally pointless without them
Removed spaces (which have to match literal spaces, and I don't think that's what you intended)
Replaced the \s in every occurrence of \s* with a literal space (because \s can match any whitespace character, and you only want to match actual spaces in those spots)
The basic structure of your regex looked pretty good until that | came along and bollixed things up. ;)
p.s., In case you're wondering, (?in) is an inline modifier that sets IgnoreCase and ExplicitCapture modes.

Related

How can I filter out certain combinations?

I'm trying to filter the input of a TextBox using a Regex. I need up to three numbers before the decimal point and I need two after it. This can be in any form.
I've tried changing the regex commands around, but it creates errors and single inputs won't be valid. I'm using a TextBox in WPF to collect the data.
bool containsLetter = Regex.IsMatch(units.Text, "^[0-9]{1,3}([.] [0-9] {1,3})?$");
if (containsLetter == true)
{
MessageBox.Show("error");
}
return containsLetter;
I want the regex filter to accept these types of inputs:
111.11,
11.11,
1.11,
1.01,
100,
10,
1,
As it has been mentioned in the comment, spaces are characters that will be interpreted literally in your regex pattern.
Therefore in this part of your regex:
([.] [0-9] {1,3})
a space is expected between . and [0-9],
the same goes for after [0-9] where the regex would match 1 to 3 spaces.
This being said, for readability purpose you have several way to construct your regex.
1) Put the comments out of the regex:
string myregex = #"\s" // Match any whitespace once
+ #"\n" // Match one newline character
+ #"[a-zA-Z]"; // Match any letter
2) Add comments within your regex by using the syntax (?#comment)
needle(?# this will find a needle)
Example
3) Activate free-spacing mode within your regex:
nee # this will find a nee...
dle # ...dle (the split means nothing when white-space is ignored)
doc: https://www.regular-expressions.info/freespacing.html
Example

Regex to extract string between parentheses which also contains other parentheses

I've been trying to figure this out, but I don't think I understand Regex well enough to get to where I need to.
I have string that resemble these:
filename.txt(1)attribute, 2)attribute(s), more!)
otherfile.txt(abc, def)
Basically, a string that always starts with a filename, then has some text between parentheses. And I'm trying to extract that part which is between the main parentheses, but the text that's there can contain absolutely anything, even some more parentheses (it often does.)
Originally, there was a 'hacky' expression made like this:
/\(([^#]+)\)\g
And it worked, until we ran into a case where the input string contained a # and we were stuck. Obviously...
I can't change the way the strings are generated, it's always a filename, then some parentheses and something of unknown length and content inside.
I'm hoping for a simple Regex expression, since I need this to work in both C# and in Perl -- is such a thing possible? Or does this require something more complex, like its own parsing method?
You can change exception for # symbol in your regex to regex matches any characters and add quantifier that matches from 0 to infinity symbols. And also simplify your regex by deleting group construction:
\(.*\)
Here is the explanation for the regular expression:
Symbol \( matches the character ( literally.
.* matches any character (except for line terminators)
* quantifier matches between zero and unlimited times, as many times
as possible, giving back as needed (greedy)
\) matches the character ) literally.
You can use regex101 to compose and debug your regular expressions.
Regex seems overkill to me in this case. Can be more reliably achieved using string manipulation methods.
int first = str.IndexOf("(");
int last = str.LastIndexOf(")");
if (first != -1 && last != -1)
{
string subString = str.Substring(first + 1, last - first - 1);
}
I've never used Perl, but I'll venture a guess that it has equivalent methods.

Replace with wildcards

I need some advice. Suppose I have the following string: Read Variable
I want to find all pieces of text like this in a string and make all of them like the following:Variable = MessageBox.Show. So as aditional examples:
"Read Dog" --> "Dog = MessageBox.Show"
"Read Cat" --> "Cat = MessageBox.Show"
Can you help me? I need a fast advice using RegEx in C#. I think it is a job involving wildcards, but I do not know how to use them very well... Also, I need this for a school project tomorrow... Thanks!
Edit: This is what I have done so far and it does not work: Regex.Replace(String, "Read ", " = Messagebox.Show").
You can do this
string ns= Regex.Replace(yourString,"Read\s+(.*?)(?:\s|$)","$1 = MessageBox.Show");
\s+ matches 1 to many space characters
(.*?)(?:\s|$) matches 0 to many characters till the first space (i.e \s) or till the end of the string is reached(i.e $)
$1 represents the first captured group i.e (.*?)
You might want to clarify your question... but here goes:
If you want to match the next word after "Read " in regex, use Read (\w*) where \w is the word character class and * is the greedy match operator.
If you want to match everything after "Read " in regex, use Read (.*)$ where . will match all characters and $ means end of line.
With either regex, you can use a replace of $1 = MessageBox.Show as $1 will reference the first matched group (which was denoted by the parenthesis).
Complete code:
replacedString = Regex.Replace(inStr, #"Read (.*)$", "$1 = MessageBox.Show");
The problem with your attempt is, that it cannot know that the replacement string should be inserted after your variable. Let's assume that valid variable names contain letters, digits and underscores (which can be conveniently matched with \w). That means, any other character ends the variable name. Then you could match the variable name, capture it (using parentheses) and put it in the replacement string with $1:
output = Regex.Replace(input, #"Read\s+(\w+)", "$1 = MessageBox.Show");
Note that \s+ matches one or more arbitrary whitespace characters. \w+ matches one or more letters, digits and underscores. If you want to restrict variable names to letters only, this is the place to change it:
output = Regex.Replace(input, #"Read\s+([a-zA-Z]+)", "$1 = MessageBox.Show");
Here is a good tutorial.
Finally note, that in C# it is advisable to write regular expressions as verbatim strings (#"..."). Otherwise, you will have to double escape everything, so that the backslashes get through to the regex engine, and that really lessens the readability of the regex.

C# Regex remove line

I need to apply a regex in C#.
The string looks like the following:
MSH|^~\&|OAZIS||C2M||20110310222404||ADT^A08|00226682|P|2.3||||||ASCII
EVN|A08
PD1
PV1|1|test
And what I want to do is delete all the lines that only contain 3 characters (with no delimiters '|'). So in this case, the 'PD1' line (3rd line) has to be deleted.
Is this possible with a regex?
Thx
The following will do what you want without regular expressions.
String inputString;
String resultingString = "";
for(var line in inputString.Split(new String[]{"\n"})) {
if (line.Trim().Length > 3 || line.Contains("|"))
resultingString += line + "\n";
}
This assumes that you have your file as one large string. And it gives you another string with the necessary lines removed.
(Or you could do it with the file directly:
string[] goodLines =
// read all of the lines of the file
File.ReadLines("fileLocation").
// filter out the ones you want
Where(line => line.Trim().Length > 3 || line.Contains("|")).ToArray();
You end up with a String[] with all of the correct lines in your file.)
This:
(?<![|])[^\n]{4}\n
Regex matched what you wanted in the online regex tester I used, however I believe that the {4} should actually be a {3}, so try switching them if it doesn't work for you.
EDIT:
This also works: \n[^|\n]{3}\n and is probably closer to what you are looking for.
EDIT 2:
The number is brackets is definitely {3}, tested it at home.
why not just get a handle to the file, make a temporary output file, and run through the lines one by one. If there is a line with 3 characters, just skip it. If the file can be held in memory entirely, then maybe use the GetLines() (i think that's what the method is called) to get an array of strings that represents the file line by line.
Are the three characters always going to be by themselves on a line? If so, you can use beginning of string/end of string markers.
Here's a Regex that matches three characters that are by themselves on a string:
\A.{3}\z
\A is the start of the string.
\z is the end of the string.
. is any character, {3} with 3 occurrences
^ - start of line.
\w - word character
{3} - repreated exactly 3 times
$ - end of line
^\w{3}$
Just a general observation from the solutions I've seen posted so far. The original question included the comment "delete all the lines that only contain 3 characters" [my emphasis]. I'm not sure if you meant literally "only 3 characters", but in case you did, you may want to change the logic of the proposed solutions from things like
if (line.Trim().Length > 3 ...)
to
if (line.Trim().Length != 3 ...)
...just in case lines with 2 characters are indeed valid, for example. (Same idea for the proposed regex solutions.)
This regex will identify the lines that meet your exclusion criteria ^[^|]{3}$ then it's just a matter of iterating over all lines (with data) and checking which ones meet exclusion criteria. Like this for instance.
foreach(Match match in Regex.Matches(data, #"^.+$")
{
if (!Regex.IsMatch(match.Value, #"^[^|]{3}$"))
{
// Do Something with legitamate match.value like write line to target file.
}
}
The question is a little vague.
As stated, the answer is something like this
(?:^|(?<=\n))[^\n|]{3}(?:\n|$) which allows whitespace in the match.
So "#\t)" will also be deleted.
To limit the characters to visual (non-whitespace), you could use
(?:^|(?<=\n))[^\s|]{3}(?:\n|$)
which doesent allow whitespace.
For both the context is a single string, replacement is '' and global.
Example context in perl: s/(?:^|(?<=\n))[^\n|]{3}(?:\n|$)//g
try this:
text = System.Text.RegularExpressions.Regex.Replace(
text,
#"^[^|]{3}(?:\r\n|[\r\n]|$)",
"",
System.Text.RegularExpressions.RegexOptions.Multiline);
You can do it Using Regex
string output = Regex.Replace(input, "^[a-zA-Z0-9]{3}$", "");
[a-zA-Z0-9] will match any character or number
{3} will match exact number of 3

Looking for a quote matching Reg Ex

I'm after a regex for C# which will turn this:
"*one*" *two** two and a bit "three four"
into this:
"*one*" "*two**" two and a bit "three four"
IE a quoted string should be unchanged whether it contains one or many words.
Any words with asterisks to be wrapped in double quotes.
Any unquoted words with no asterisks to be unchanged.
Nice to haves:
If multiple asterisks could be merged into one in the same step that would be better.
Noise words - eg and, a, the - which are not part of a quoted string should be dumped.
Thanks for any help / advice.
Julio
The following regex will do what you're looking for:
\*+ # Match 1 or more *
(
\w+ # Capture character string
)
\*+ # Match 1 or more *
If you use this in conjunction with this replace statement, all you words matched by (\w+) will be wrapped in "**":
string s = "\"one\" *two** two and a bit \"three four\"";
Regex r = new Regex(#"\*+(\w+)\*+");
var output = r.Replace(s, #"""*$1*""");
Note: This will leave the below string unquoted:
*two two*
If you wish to match those strings as well, use this regex:
\*+([^*]+)\*+
EDIT: updated code.
This solution works for your request, as well as the nice to have items:
string text = #"test the ""one"" and a *two** two and a the bit ""three four"" a";
string result = Regex.Replace(text, #"\*+(.*?)\*+", #"""*$1*""");
string noiseWordsPattern = #"(?<!"") # match if double quote prefix is absent
\b # word boundary to prevent partial word matches
(and|a|the) # noise words
\b # word boundary
(?!"") # match if double quote suffix is absent
";
// to use the commented pattern use RegexOptions.IgnorePatternWhitespace
result = Regex.Replace(result, noiseWordsPattern, "", RegexOptions.IgnorePatternWhitespace);
// or use this one line version instead
// result = Regex.Replace(result, #"(?<!"")\b(and|a|the)\b(?!"")", "");
// remove extra spaces resulting from noise words replacement
result = Regex.Replace(result, #"\s+", " ");
Console.WriteLine("Original: {0}", text);
Console.WriteLine("Result: {0}", result);
Output:
Original: test the "one" and a *two** two and a the bit "three four" a
Result: test "one" "*two*" two bit "three four"
The 2nd regex replacement for noise words causes potential duplicate of blank spaces. To remedy this side effect I added the 3rd regex replacement to clean it up.
Something like this. ArgumentReplacer is a callback that is called for each match. The return value is substituted into the returned string.
void Main() {
string text = "\"one\" *two** and a bit \"three *** four\"";
string finderRegex = #"
(""[^""]*"") # quoted
| ([^\s""*]*\*[^\s""]*) # with asteriks
| ([^\s""]+) # without asteriks
";
return Regex.Replace(text, finderRegex, ArgumentReplacer,
RegexOptions.IgnorePatternWhitespace);
}
public static String ArgumentReplacer(Match theMatch) {
// Don't touch quoted arguments, and arguments with no asteriks
if (theMatch.Groups[2].Value.Length == 0)
return theMatch.Value;
// Quote arguments with asteriks, and replace sequences of such
// by a single one.
return String.Format("\"%s\"",
Regex.Replace(theMatch.Value, #"\*\*+", "*"));
}
Alternatives to the left in the pattern has priority over those to the right. This is why I just needed to write "[^\s""]+" in the last alternative.
The quotes, on the other hand, are only matched if they occur at the beginning of the argument. They will not be detected if they occur in the middle of the argument, and we must stop before those if they occur.
Given that you wish to match pairs of quotes, I don’t think your language is regular, therefore I don’t think RegEx is a good solution. E.g
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems.
See "When not to use Regex in C# (or Java, C++ etc)"
I've decided to follow the advice of a couple of responses and go with a parser solution. I've tried the regexes contributed so far and they seem to fail in some cases. That's probably an indication that regexes aren't the appropriate solution to this problem. Thanks for all responses.

Categories

Resources