Looking for a quote matching Reg Ex - c#

I'm after a regex for C# which will turn this:
"*one*" *two** two and a bit "three four"
into this:
"*one*" "*two**" two and a bit "three four"
IE a quoted string should be unchanged whether it contains one or many words.
Any words with asterisks to be wrapped in double quotes.
Any unquoted words with no asterisks to be unchanged.
Nice to haves:
If multiple asterisks could be merged into one in the same step that would be better.
Noise words - eg and, a, the - which are not part of a quoted string should be dumped.
Thanks for any help / advice.
Julio

The following regex will do what you're looking for:
\*+ # Match 1 or more *
(
\w+ # Capture character string
)
\*+ # Match 1 or more *
If you use this in conjunction with this replace statement, all you words matched by (\w+) will be wrapped in "**":
string s = "\"one\" *two** two and a bit \"three four\"";
Regex r = new Regex(#"\*+(\w+)\*+");
var output = r.Replace(s, #"""*$1*""");
Note: This will leave the below string unquoted:
*two two*
If you wish to match those strings as well, use this regex:
\*+([^*]+)\*+

EDIT: updated code.
This solution works for your request, as well as the nice to have items:
string text = #"test the ""one"" and a *two** two and a the bit ""three four"" a";
string result = Regex.Replace(text, #"\*+(.*?)\*+", #"""*$1*""");
string noiseWordsPattern = #"(?<!"") # match if double quote prefix is absent
\b # word boundary to prevent partial word matches
(and|a|the) # noise words
\b # word boundary
(?!"") # match if double quote suffix is absent
";
// to use the commented pattern use RegexOptions.IgnorePatternWhitespace
result = Regex.Replace(result, noiseWordsPattern, "", RegexOptions.IgnorePatternWhitespace);
// or use this one line version instead
// result = Regex.Replace(result, #"(?<!"")\b(and|a|the)\b(?!"")", "");
// remove extra spaces resulting from noise words replacement
result = Regex.Replace(result, #"\s+", " ");
Console.WriteLine("Original: {0}", text);
Console.WriteLine("Result: {0}", result);
Output:
Original: test the "one" and a *two** two and a the bit "three four" a
Result: test "one" "*two*" two bit "three four"
The 2nd regex replacement for noise words causes potential duplicate of blank spaces. To remedy this side effect I added the 3rd regex replacement to clean it up.

Something like this. ArgumentReplacer is a callback that is called for each match. The return value is substituted into the returned string.
void Main() {
string text = "\"one\" *two** and a bit \"three *** four\"";
string finderRegex = #"
(""[^""]*"") # quoted
| ([^\s""*]*\*[^\s""]*) # with asteriks
| ([^\s""]+) # without asteriks
";
return Regex.Replace(text, finderRegex, ArgumentReplacer,
RegexOptions.IgnorePatternWhitespace);
}
public static String ArgumentReplacer(Match theMatch) {
// Don't touch quoted arguments, and arguments with no asteriks
if (theMatch.Groups[2].Value.Length == 0)
return theMatch.Value;
// Quote arguments with asteriks, and replace sequences of such
// by a single one.
return String.Format("\"%s\"",
Regex.Replace(theMatch.Value, #"\*\*+", "*"));
}
Alternatives to the left in the pattern has priority over those to the right. This is why I just needed to write "[^\s""]+" in the last alternative.
The quotes, on the other hand, are only matched if they occur at the beginning of the argument. They will not be detected if they occur in the middle of the argument, and we must stop before those if they occur.

Given that you wish to match pairs of quotes, I don’t think your language is regular, therefore I don’t think RegEx is a good solution. E.g
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems.
See "When not to use Regex in C# (or Java, C++ etc)"

I've decided to follow the advice of a couple of responses and go with a parser solution. I've tried the regexes contributed so far and they seem to fail in some cases. That's probably an indication that regexes aren't the appropriate solution to this problem. Thanks for all responses.

Related

Removing Sub-string with some pattern from a string

I have a string something like JSON format:
XYZ DIV Parameters: width=\"1280\" height=\"720\", session=\"1\"
Now I want to remove width=\"1280\" height=\"720\" from this string.
Note: There can be any number in place of 1280 and 720. So, I can't just replace it with null.
Please tell me how to solve it? Either by Regex or any other better method possible.
Regex to be replaced with empty string:
(width|height)=\\"\d+\\"
Regex visualization:
Code:
string input = #"XYZ DIV Parameters: width=\""1280\"" height=\""720\"", session=\""1\""";
string output = Regex.Replace(input, #"(width|height)=\\""\d+\\""", string.Empty);
You could do a find and replace using the following regex:
width=\\"\d*+\\" replace with a blank string, as well as replacing height=\\"\d*+\\" with a blank string.
This is removing the entire text of width=\"XYZ\", if you wanted to just replace the numbers or blank out the numbers you can replace with a string that suits your needs (width=\"\" for example)
If you can guarantee the width and height will ALWAYS be in that format and ALWAYS follow each other seperated by a space, you can combine that into one bigger regex find/replace using width=\\"\d*+\\" height=\\"\d*+\\".
A little more explanation on the regex so you take something away, not just a quick fix :)
width=\\"\d*+\\" breaks down to:
width= pretty simple, just find the text you are looking for to start your removal.
\\" since \ is a special char in regex you have to escape it, then the " char can just follow it up like normal.
\d*+ digits \d, zero or more of them *, and then non greedy +. The important part here is the non greedy on the digits. If you left that off, your regex would look and consume digits until it found the last ". Not 100% needed in your case (since height is buffering) but it is still a lot safer.
\\" to end the regex out
This will do it:
string resultString = null;
try {
Regex regexObj = new Regex(#"^(.*?)width=\\"".*?\\"" height=\\"".*?\\""(.*?)$", RegexOptions.IgnoreCase);
resultString = regexObj.Replace(subjectString, #"$1width=\""\"" height=\""\""$2");
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

Regex doesn't give me expected result

Okay, I give up - time to call upon the regex gurus for some help.
I'm trying to validate CSV file contents, just to see if it looks like the expected valid CSV data. I'm not trying to validate all possible CSV forms, just that it "looks like" CSV data and isn't binary data, a code file or whatever.
Each line of data comprises comma-separated words, each word comprising a-z, 0-9, and a small number of of punctuation chars, namely - and _. There may be several lines in the file. That's it.
Here's my simple code:
const string dataWord = #"[a-z0-9_\-]+";
const string dataLine = "("+dataWord+#"\s*,\s*)*"+dataWord;
const string csvDataFormat = "("+dataLine+") | (("+dataLine+#"\r\n)*"+dataLine +")";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
return validCSVDataPattern.IsMatch(fileContents);
}
This gives me a regex pattern of
(([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+) | ((([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+\r\n)*([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+)
However if I present this with a block of, say, C# code, the regex parser says it is a match. How is that? the C# code doesn't look anything like my CSV pattern (it has punctuation other than _ and -, for a start).
Can anyone point out my obvious error? Let me repeat - I am not trying to validate all possible CSV forms, just my simple subset.
Your regular expression is missing the ^ (beginning of line) and $ (end of line) anchors. This means that it would match any text that contains what is described by the expression, even if the text contains other completely unrelated parts.
For example, this text matches the expression:
foo, bar
and therefore this text also matches:
var result = calculate(foo, bar);
You can see where this is going.
Add ^ at the beginning and $ at the end of csvDataFormat to get the behavior you expect.
Here is a better pattern which looks for CSV groups such as XXX, or yyy for one to many in each line:
^([\w\s_\-]*,?)+$
^ - Start of each line
( - a CSV match group start
[\w\s_\-]* - Valid characters \w (a-zA-Z0-9) and _ and - in each CSV
,? - maybe a comma
)+ - End of the csv match group, 1 to many of these expected.
That will validate a whole file, line by line for a basic CSV structure and allow for empty ,, situations.
I came up with this regex:
^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$
Tests
asbc_- , khkhkjh, lkjlkjlkj_-, j : PASS
asbc, : FAIL
asbc_-,khkhkjh,lkjlkjlk909j_-,j : PASS
If you want to match empty lines like ,,, or when some values are blank like ,abcd,, use
^([a-z0-9_\-]*)(\s*)(,\s*[a-z0-9_\-]*)*$
Loop through all the lines to see if the file is ok:
const string dataLine = "^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
string[] lines = fileContents.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
foreach (var line in lines)
{
if (!validCSVDataPattern.IsMatch(line))
return false;
}
return true;
}
I think this is what you're looking for:
#"(?in)^[a-z0-9_-]+( *, *[a-z0-9_-]+)*([\r\n]+[a-z0-9_-]+( *, *[a-z0-9_-]+)*)*$"
The noteworthy changes are:
Added anchors (^ and $, because the regex is totally pointless without them
Removed spaces (which have to match literal spaces, and I don't think that's what you intended)
Replaced the \s in every occurrence of \s* with a literal space (because \s can match any whitespace character, and you only want to match actual spaces in those spots)
The basic structure of your regex looked pretty good until that | came along and bollixed things up. ;)
p.s., In case you're wondering, (?in) is an inline modifier that sets IgnoreCase and ExplicitCapture modes.

Replace with wildcards

I need some advice. Suppose I have the following string: Read Variable
I want to find all pieces of text like this in a string and make all of them like the following:Variable = MessageBox.Show. So as aditional examples:
"Read Dog" --> "Dog = MessageBox.Show"
"Read Cat" --> "Cat = MessageBox.Show"
Can you help me? I need a fast advice using RegEx in C#. I think it is a job involving wildcards, but I do not know how to use them very well... Also, I need this for a school project tomorrow... Thanks!
Edit: This is what I have done so far and it does not work: Regex.Replace(String, "Read ", " = Messagebox.Show").
You can do this
string ns= Regex.Replace(yourString,"Read\s+(.*?)(?:\s|$)","$1 = MessageBox.Show");
\s+ matches 1 to many space characters
(.*?)(?:\s|$) matches 0 to many characters till the first space (i.e \s) or till the end of the string is reached(i.e $)
$1 represents the first captured group i.e (.*?)
You might want to clarify your question... but here goes:
If you want to match the next word after "Read " in regex, use Read (\w*) where \w is the word character class and * is the greedy match operator.
If you want to match everything after "Read " in regex, use Read (.*)$ where . will match all characters and $ means end of line.
With either regex, you can use a replace of $1 = MessageBox.Show as $1 will reference the first matched group (which was denoted by the parenthesis).
Complete code:
replacedString = Regex.Replace(inStr, #"Read (.*)$", "$1 = MessageBox.Show");
The problem with your attempt is, that it cannot know that the replacement string should be inserted after your variable. Let's assume that valid variable names contain letters, digits and underscores (which can be conveniently matched with \w). That means, any other character ends the variable name. Then you could match the variable name, capture it (using parentheses) and put it in the replacement string with $1:
output = Regex.Replace(input, #"Read\s+(\w+)", "$1 = MessageBox.Show");
Note that \s+ matches one or more arbitrary whitespace characters. \w+ matches one or more letters, digits and underscores. If you want to restrict variable names to letters only, this is the place to change it:
output = Regex.Replace(input, #"Read\s+([a-zA-Z]+)", "$1 = MessageBox.Show");
Here is a good tutorial.
Finally note, that in C# it is advisable to write regular expressions as verbatim strings (#"..."). Otherwise, you will have to double escape everything, so that the backslashes get through to the regex engine, and that really lessens the readability of the regex.

regex.replace #number;#

What would be the regex expression to find (PoundSomenumberSemiColonPound) (aka #Number;#)? I used this but not working
string st = Regex.Replace(string1, #"(#([\d]);#)", string.Empty);
You're looking for #\d+;#.
\d matches a single numeric character
+ matches one or more of the preceding character.
(\x23\d+\x3B\x32)
# and / are both used around patterns, thus the trouble. Try using the above (usually when I come in to trouble with specific characters I revert to their hex facsimile (asciitable.com has a good reference)
EDIT Forgot to group for replacement.
EDITv2 The below worked for me:
String string1 = "sdlfkjsld#132;#sdfsdfsdf#1;#sdfsdfsf#34d;#sdfs";
String string2 = System.Text.RegularExpressions.Regex.Replace(string1, #"(\x23\d+\x3B\x23)", String.Empty);
Console.WriteLine("from: {0}\r\n to: {1}", string1, string2);;
Output:
from: sdlfkjsld#132;#sdfsdfsdf#1;#sdfsdfsf#34d;#sdfs
to: sdlfkjsldsdfsdfsdfsdfsdfsf#34d;#sdfs
Press any key to continue . . .
You don't need a character class when using \d, and as SLaks points out you need + to match one or more digits. Also, since you're not capturing anything the parentheses are redundant too, so something like this should do it
string st = Regex.Replace(string1, #"#\d+;#", string.Empty);
You may need to escape the # symbols, they're usually interpreted as comment markers, in addition to #SLaks comment about using + to allow multiple digits

Regex for SQL query gives an empty MatchCollection

I try to keep it brief and concise. I have to write a program that takes queries in SQL form and searches an XML. Right now I am trying to disassemble a string into logical pieces so I can work with them. I have a string as input and want to get a MatchCollection as output.
Please not that the test string below is of a special format that I impose on the user to keep things simple. Only one statement per line is permitted and nested queries are excluded-
string testString = "select apples \n from dblp \r where we ate \n group by all of them \r HAVING NO SHAME \n";
I use Regex with the following pattern:
Regex reg = new Regex(#"(?<select> \A\bselect\b .)" +
#"(?<from> ^\bfrom\b .)" +
#"(?<where> ^\bwhere\b .)" +
#"(?<groupBy> ^\bgroup by\b .)" +
#"(?<having> ^\bhaving\b .)"
, RegexOptions.IgnoreCase | RegexOptions.Multiline
);
As far as I know this should give me matches for every group with the test string. I would be looking for an exact match of "select" at the start of each line followed by any characters except newlines.
Now I create the collection:
MatchCollection matches = reg.Matches(testString);
To makes sure it worked I used a foreach and printed the matches like:
foreach(Match match in matches)
{
Console.WriteLine("Select: {0}", match.Groups["select"]);
//and so on
}
The problem is that the collection is always empty. There must be a flaw in the Regex somewhere but I am too inexperienced to find it. Could you please assist me? Thank you very much!
I tried using .* instead of just . until I was told that . would even mathc multiple character. I have no doubt that this could be a problem but even when replacing it I get no result.
I fail to see why it is so difficult to match a line starting with a defined word and having any characters appended to it until the regex finds a newline. Seems to me that this should be a relatively easy task.
I think you need to explicitly match the line terminators, as well as handle spaces better as others have suggested. Assuming the user can choose between \r and \n, try
#"(?<select>\Aselect .+)[\n\r]" +
#"(?<from>\s*from .+)[\n\r]" +
#"(?<where>\s*where .+)[\n\r]" +
#"(?<groupBy>\s*group by .+)[\n\r]" +
#"(?<having>\s*having .+)[\n\r]"
As long as you are using regular expressions, you probably want to do a bit better:
#"\Aselect (?<select>.+)[\n\r]" +
#"\s*from (?<from>.+)[\n\r]" +
#"\s*where (?<where>.+)[\n\r]" +
#"\s*group by (?<groupBy>.+)[\n\r]" +
#"\s*having (?<having>.+)[\n\r]"
My biggest problem with regular expressions for this sort of use is that the only error message you can give is that things failed. You can't give the user any further information about what they did wrong.
There may be a problem with the newline matching: is it LF (Unix standard), CR (MacOS), or CR LF (Windows)? If you don't know, perhaps you should match it with: [\n\r]+
edit: You included some whitespace in your test string, surrounding the newlines, that you don't account for in your rexex.
(?<from>^\s*from\b.*[\n\r]+$)
As you said, it's easy enough to match the keyword(s) and then use (.+) to match the rest of the line. But you have to match all of the intervening characters, and you aren't doing that. (The ^ line anchor matches the position following the line separator, not the separator itself.) You can use \s+ to consume the line separator as well as any leading whitespace on the next line.
#"select\s+(?<select>.+)\s+" +
#"from\s+(?<from>.+)\s+" +
#"where\s+(?<where>.+)\s+" +
#"group by\s+(?<groupBy>.+)\s+" +
#"having\s+(?<having>.+)";
I also rearranged things so that the SQL keywords aren't captured; that seems redundant, since you're using named groups.
I haven't tried to build a working regex for you, but I can see several issues. Others pointed out the first two issues, but not the third one.
You can't use a single dot to match the variable parts such as "apples". Try \w+ or \S+
Your string has embedded line breaks. You need to match those with [\r\n]+ or \s+
The .NET regex engine treats \n as a line break, but NOT \r or \r\n. Thus, ^ will match after \n, but NOT after \r. If you do step 2, you don't need the anchors anyway, so remove them.

Categories

Resources