I try to keep it brief and concise. I have to write a program that takes queries in SQL form and searches an XML. Right now I am trying to disassemble a string into logical pieces so I can work with them. I have a string as input and want to get a MatchCollection as output.
Please not that the test string below is of a special format that I impose on the user to keep things simple. Only one statement per line is permitted and nested queries are excluded-
string testString = "select apples \n from dblp \r where we ate \n group by all of them \r HAVING NO SHAME \n";
I use Regex with the following pattern:
Regex reg = new Regex(#"(?<select> \A\bselect\b .)" +
#"(?<from> ^\bfrom\b .)" +
#"(?<where> ^\bwhere\b .)" +
#"(?<groupBy> ^\bgroup by\b .)" +
#"(?<having> ^\bhaving\b .)"
, RegexOptions.IgnoreCase | RegexOptions.Multiline
);
As far as I know this should give me matches for every group with the test string. I would be looking for an exact match of "select" at the start of each line followed by any characters except newlines.
Now I create the collection:
MatchCollection matches = reg.Matches(testString);
To makes sure it worked I used a foreach and printed the matches like:
foreach(Match match in matches)
{
Console.WriteLine("Select: {0}", match.Groups["select"]);
//and so on
}
The problem is that the collection is always empty. There must be a flaw in the Regex somewhere but I am too inexperienced to find it. Could you please assist me? Thank you very much!
I tried using .* instead of just . until I was told that . would even mathc multiple character. I have no doubt that this could be a problem but even when replacing it I get no result.
I fail to see why it is so difficult to match a line starting with a defined word and having any characters appended to it until the regex finds a newline. Seems to me that this should be a relatively easy task.
I think you need to explicitly match the line terminators, as well as handle spaces better as others have suggested. Assuming the user can choose between \r and \n, try
#"(?<select>\Aselect .+)[\n\r]" +
#"(?<from>\s*from .+)[\n\r]" +
#"(?<where>\s*where .+)[\n\r]" +
#"(?<groupBy>\s*group by .+)[\n\r]" +
#"(?<having>\s*having .+)[\n\r]"
As long as you are using regular expressions, you probably want to do a bit better:
#"\Aselect (?<select>.+)[\n\r]" +
#"\s*from (?<from>.+)[\n\r]" +
#"\s*where (?<where>.+)[\n\r]" +
#"\s*group by (?<groupBy>.+)[\n\r]" +
#"\s*having (?<having>.+)[\n\r]"
My biggest problem with regular expressions for this sort of use is that the only error message you can give is that things failed. You can't give the user any further information about what they did wrong.
There may be a problem with the newline matching: is it LF (Unix standard), CR (MacOS), or CR LF (Windows)? If you don't know, perhaps you should match it with: [\n\r]+
edit: You included some whitespace in your test string, surrounding the newlines, that you don't account for in your rexex.
(?<from>^\s*from\b.*[\n\r]+$)
As you said, it's easy enough to match the keyword(s) and then use (.+) to match the rest of the line. But you have to match all of the intervening characters, and you aren't doing that. (The ^ line anchor matches the position following the line separator, not the separator itself.) You can use \s+ to consume the line separator as well as any leading whitespace on the next line.
#"select\s+(?<select>.+)\s+" +
#"from\s+(?<from>.+)\s+" +
#"where\s+(?<where>.+)\s+" +
#"group by\s+(?<groupBy>.+)\s+" +
#"having\s+(?<having>.+)";
I also rearranged things so that the SQL keywords aren't captured; that seems redundant, since you're using named groups.
I haven't tried to build a working regex for you, but I can see several issues. Others pointed out the first two issues, but not the third one.
You can't use a single dot to match the variable parts such as "apples". Try \w+ or \S+
Your string has embedded line breaks. You need to match those with [\r\n]+ or \s+
The .NET regex engine treats \n as a line break, but NOT \r or \r\n. Thus, ^ will match after \n, but NOT after \r. If you do step 2, you don't need the anchors anyway, so remove them.
Related
I have string that I would like to remove any word following a "\", whether in the middle or at the end, such as:
testing a\determiner checking test one\pronoun
desired result:
testing a checking test one
I have tried a simple regex that removes anything between the backslash and whitespace, but it gives the following result:
string input = "testing a\determiner checking test one\pronoun";
Regex regex = new Regex(#"\\.*\s");
string output = regex.Replace(input, " ");
Result:
testing a one\pronoun
It looks like this regex matches from the backslash until the last whitespace in the string. I cannot seem to figure out how to match from the backlash to the next whitespace. Also, I am not guaranteed a whitespace at the end, so I would need to handle that. I could continue processing the string and remove any text after the backslash, but I was hoping I could handle both cases with one step.
Any advice would be appreciated.
Change .* which match any characters, to \w*, which only match word characters.
Regex regex = new Regex(#"\\\w*");
string output = regex.Replace(input, "");
".*" matches zero or more characters of any kind. Consider using "\w+" instead, which matches one or more "word" characters (not including whitespace).
Using "+" instead of "*" would allow a backslash followed by a non-"word" character to remain unmatched. For example, no matches would be found in the sentence "Sometimes I experience \ an uncontrollable compulsion \ to intersperse backslash \ characters throughout my sentences!"
With your current pattern, .* tells the parser to be "greedy," that is, to take as much of the string as possible until it hits a space. Adding a ? right after that * tells it instead to make the capture as small as possible--to stop as soon as it hits the first space.
Next, you want to end at not just a space, but at either a space or the end of the string. The $ symbol captures the end of the string, and | means or. Group those together using parentheses and your group collectively tells the parser to stop at either a space or the end of the string. Your code will look like this:
string input = #"testing a\determiner checking test one\pronoun";
Regex regex = new Regex(#"\\.*?(\s|$)");
string output = regex.Replace(input, " ");
Try this regex (\\[^\s]*)
(\\[^\s]*)
1st Capturing group (\\[^\s]*)
\\ matches the character \ literally
[^\s]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ].
Consider the following string fragment:
var someInput = ..... +
"admin-state : up" +
"opr-state/tx-rate-ds : up :32093" +
"cur-op-mode : g993-2-8d" +
"tx-rate-us : 5048" +
"tx-rate-ds : 32093" +
"noise-margin-down : 204" +
"noise-margin-up : 165" +
"actual-tps-tc-mode : ptm" +
"overrule-state : not-created" +
.....;
I am trying to extract the three sections of the line:
"opr-state/tx-rate-ds : up :32093"
I am using regexstorm to try out my expressions. And to get each of the values I came up with these:
#"(?<paramName>opr-.[^\s]*)" // Gets "opr-state/tx-rate-ds"
#"opr.*:\s*(?<middle>.*(?=:))" // Gets "up"
#"opr.*:\s*.*:(?<value>[\d]*)" // Gets 32093
The problem is that it works considering each line in the input independently but, I am getting the input as a single string which basically is as if I am running the regex in single line mode on the tester so the results I get in the application are as follows:
#"(?<paramName>opr-.[^\s]*)" // Gets "opr-state/tx-rate-ds"
#"opr.*:\s*(?<middle>.*(?=:))" // Gets everything from the first ": up" until the last ":" before "not-created"
#"opr.*:\s*.*:(?<value>[\d]*)" // Gets 32093
So trying to phrase what I want this expression to do would be something like:
In a single string, find whatever is between opr.*:\s* and the
following colon
So far I've tried changing the options on the Match method to run it as Singleline and changing the expression to opr.*:\s*(?<middle>[^:]) but none of those have worked.
I really suck at regular expressions, please help.
Thank you.
The problen you're facing is because the regex engine is greedy by default. Any quantifier, such as *, ?, or {n,m} will try to match as much as it can, only backtracking if the rest of the pattern doesn't match. I find this article quite useful to understand the internals:
Watch Out for The Greediness!.
Solution:
Use lazy quantifiers adding an extra ? immediately afterwards. Examples:
.*?
\s+?
[a-z]{5,}?
These will try to match as less as they can, only consuming more characters when the engine backtracks.
In your case, it works if you modify the expression to opr.*?:\s*(?<middle>[^:]+)
However, let's try a different approach. In regular expressions, it helps to be as specific as you can. If you look at it from another angle, all you're trying to match in every token are characters except colons (:) or, even better, anything except colons and whitespace.
Code:
Regex regex = new Regex(#"(?<paramName> opr-[^\s:]+ ) # literal `opr-` followed by any chars except whitespace or `:`
\s*:\s* # separator: literal `:` optionally surrounded by any number of whitespace chars
(?<middle> [^\s:]+ ) # any chars except whitespace or `:`
\s*:\s* # separator
(?<value> \d+ ) # 1 or more digits (an integer)
"
, RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
foreach (Match ItemMatch in regex.Matches(someInput))
{
Console.WriteLine("{0}\t{1}\t{2}",
ItemMatch.Groups["paramName"].Value,
ItemMatch.Groups["middle"].Value,
ItemMatch.Groups["value"].Value);
}
*Notice I used RegexOptions.IgnorePatternWhitespace to ignore spaces in the pattern, and to allow the comments.
The [^\s:]+ is a character class to match all characters, except:
\s whitespace
: a literal colon
Using that construct, you don't need to worry about greediness.
Online test: Check the code here
Use non-greedy repetition:
#"opr.*?:\s*(?<middle>.*?(?=:))"
.* tries to match as many characters as possible. .*? will make it match only as little as it needs. And given that you have set clear boundaries (:), little is just enough.
See it in action
I'm trying to learn regex, but still have no clue. I have this line of code, which successfully seperates the placeholder 'FirstWord' by the '{' delimiter from all following text:
var regexp = new Regex(#"(?<FirstWord>.*?)\{(?<TextBetweenCurlyBrackets>.*?)\}");
Which reads this string with no problem:
Greetings{Hello World}
What I want to do is to replace the '{' with a character chain like for instance '/>>'
so I tried this:
var regexp = new Regex(#"(?<FirstWord>.*?)\/>>(?<OtherText>.*?)\");
I removed the last bracket and replaced the first one with '/>>' But it throws an ArgumentException. How would the correct character combination look like?
/ does not need to be escaped, unless you use it as the pattern-delimiter.:
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)\"
Also your last \ will basically escape the " which should end the String (c#-wise: remove it):
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)"
And since you want most likely fetch until the END of the String (.*? will fetch as less characters as required to satisfy the expression), you should use the $ at the end or use any other sort of delimiter (whitspace, linebreak, etc...).
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)$"
Example:
(.*?)/>>(.*?)$
Debuggex Demo
Removing the trailing $ will fetch the empty string for the second match group, because "" is the shortest string possible satisfying the expression .*?
(.*?)/>>(.*?)$ on This/>>Test One will match This and Test One
(.*?)/>>(.*?)\s on This/>>Test One will match This and Test
(.*?)/>>(.*?) on This/>>Test One will match This and ""
Note: I'm saying "" is the shortest string possible satisfying the expression .?* on purpose! A frequent Misstake is to interpret .*?a as "everything until a":
Regex is greedy by default!
Searching for the expressiong (.*?)a$ on "caba" will NOT fail to match - it will return cab!, because cab followed by a is satisfying the expression AND cab is the shortest string possible for any match.
One might also expect b to be matched - but regex is working from left to right, hence aborting once it found cab - even if b would be shorter.
I'm currently facing a (little) blocking issue. I'd like to replace a substring by one another using regular expression. But here is the trick : I suck at regex.
Regex.Replace(contenu, "Request.ServerVariables("*"))",
"ServerVariables('test')");
Basically I'd like to replace whatever is between the " by "test". I tried ".{*}" as a pattern but it doesn't work.
Could you give me some tips, I'd appreciate it!
There are several issues you need to take care of.
You are using special characters in your regex (., parens, quotes) -- you need to escape these with a slash. And you need to escape the slashes with another slash as well because we 're in a C# string literal, unless you prefix the string with # in which case the escaping rules are different.
The expression to match "any number of whatever characters" is .*. In this case, you would want to match any number of non-quote characters, which is [^"]*.
In contrast to (1) above, the replacement string is not a regular expression so you don't want any slashes there.
You need to store the return value of the replace somewhere.
The end result is
var result = Regex.Replace(contenu,
#"Request\.ServerVariables\(""[^""]*""\)",
"Request.ServerVariables('test')");
Based purely on my knowledge of regex (and not how they are done in C#), the pattern you want is probably:
"[^"]*"
ie - match a " then match everything that's not a " then match another "
You may need to escape the double-quotes to make your regex-parser actually match on them... that's what I don't know about C#
Try to avoid where you can the '.*' in regex, you can usually find what you want to get by avoiding other characters, for example [^"]+ not quoted, or ([^)]+) not in parenthesis. So you may just want "([^"]+)" which should give you the whole thing in [0], then in [1] you'll find 'test'.
You could also just replace '"' with '' I think.
Taryn Easts regex includes the *. You should remove it, if it is just a placeholder for any value:
"[^"]"
BTW: You can test this regex with this cool editor: http://rubular.com/r/1MMtJNF3kM
How can I replace lone instances of \n with \r\n (LF alone with CRLF) using a regular expression in C#?
I know to do it using plan String.Replace, like:
myStr.Replace("\n", "\r\n");
myStr.Replace("\r\r\n", "\r\n");
However, this is inelegant, and would destroy any "\r+\r\n" already in the text (although they are not likely to exist).
It might be faster if you use this.
(?<!\r)\n
It basically looks for any \n that is not preceded by a \r. This would most likely be faster, because in the other case, almost every letter matches [^\r], so it would capture that, and then look for the \n after that. In the example I gave, it would only stop when it found a \n, and them look before that to see if it found \r
Will this do?
[^\r]\n
Basically it matches a '\n' that is preceded with a character that is not '\r'.
If you want it to detect lines that start with just a single '\n' as well, then try
([^\r]|$)\n
Which says that it should match a '\n' but only those that is the first character of a line or those that are not preceded with '\r'
There might be special cases to check since you're messing with the definition of lines itself the '$' might not work too well. But I think you should get the idea.
EDIT: credit #Kibbee Using look-ahead s is clearly better since it won't capture the matched preceding character and should help with any edge cases as well. So here's a better regex + the code becomes:
myStr = Regex.Replace(myStr, "(?<!\r)\n", "\r\n");
I was trying to do the code below to a string and it was not working.
myStr.Replace("(?<!\r)\n", "\r\n")
I used Regex.Replace and it worked
Regex.Replace( oldValue, "(?<!\r)\n", "\r\n")
I guess that "myStr" is an object of type String, in that case, this is not regex.
\r and \n are the equivalents for CR and LF.
My best guess is that if you know that you have an \n for EACH line, no matter what, then you first should strip out every \r. Then replace all \n with \r\n.
The answer chakrit gives would also go, but then you need to use regex, but since you don't say what "myStr" is...
Edit:looking at the other examples tells me one thing.. why do the difficult things, when you can do it easy?, Because there is regex, is not the same as "must use" :D
Edit2: A tool is very valuable when fiddling with regex, xpath, and whatnot that gives you strange results, may I point you to: http://www.regexbuddy.com/
myStr.Replace("([^\r])\n", "$1\r\n");
$ may need to be a \
Try this: Replace(Char.ConvertFromUtf32(13), Char.ConvertFromUtf32(10) + Char.ConvertFromUtf32(13))
If I know the line endings must be one of CRLF or LF, something that works for me is
myStr.Replace("\r?\n", "\r\n");
This essentially does the same neslekkiM's answer except it performs only one replace operation on the string rather than two. This is also compatible with Regex engines that don't support negative lookbehinds or backreferences.