How can I replace lone instances of \n with \r\n (LF alone with CRLF) using a regular expression in C#?
I know to do it using plan String.Replace, like:
myStr.Replace("\n", "\r\n");
myStr.Replace("\r\r\n", "\r\n");
However, this is inelegant, and would destroy any "\r+\r\n" already in the text (although they are not likely to exist).
It might be faster if you use this.
(?<!\r)\n
It basically looks for any \n that is not preceded by a \r. This would most likely be faster, because in the other case, almost every letter matches [^\r], so it would capture that, and then look for the \n after that. In the example I gave, it would only stop when it found a \n, and them look before that to see if it found \r
Will this do?
[^\r]\n
Basically it matches a '\n' that is preceded with a character that is not '\r'.
If you want it to detect lines that start with just a single '\n' as well, then try
([^\r]|$)\n
Which says that it should match a '\n' but only those that is the first character of a line or those that are not preceded with '\r'
There might be special cases to check since you're messing with the definition of lines itself the '$' might not work too well. But I think you should get the idea.
EDIT: credit #Kibbee Using look-ahead s is clearly better since it won't capture the matched preceding character and should help with any edge cases as well. So here's a better regex + the code becomes:
myStr = Regex.Replace(myStr, "(?<!\r)\n", "\r\n");
I was trying to do the code below to a string and it was not working.
myStr.Replace("(?<!\r)\n", "\r\n")
I used Regex.Replace and it worked
Regex.Replace( oldValue, "(?<!\r)\n", "\r\n")
I guess that "myStr" is an object of type String, in that case, this is not regex.
\r and \n are the equivalents for CR and LF.
My best guess is that if you know that you have an \n for EACH line, no matter what, then you first should strip out every \r. Then replace all \n with \r\n.
The answer chakrit gives would also go, but then you need to use regex, but since you don't say what "myStr" is...
Edit:looking at the other examples tells me one thing.. why do the difficult things, when you can do it easy?, Because there is regex, is not the same as "must use" :D
Edit2: A tool is very valuable when fiddling with regex, xpath, and whatnot that gives you strange results, may I point you to: http://www.regexbuddy.com/
myStr.Replace("([^\r])\n", "$1\r\n");
$ may need to be a \
Try this: Replace(Char.ConvertFromUtf32(13), Char.ConvertFromUtf32(10) + Char.ConvertFromUtf32(13))
If I know the line endings must be one of CRLF or LF, something that works for me is
myStr.Replace("\r?\n", "\r\n");
This essentially does the same neslekkiM's answer except it performs only one replace operation on the string rather than two. This is also compatible with Regex engines that don't support negative lookbehinds or backreferences.
Related
I'm currently facing a (little) blocking issue. I'd like to replace a substring by one another using regular expression. But here is the trick : I suck at regex.
Regex.Replace(contenu, "Request.ServerVariables("*"))",
"ServerVariables('test')");
Basically I'd like to replace whatever is between the " by "test". I tried ".{*}" as a pattern but it doesn't work.
Could you give me some tips, I'd appreciate it!
There are several issues you need to take care of.
You are using special characters in your regex (., parens, quotes) -- you need to escape these with a slash. And you need to escape the slashes with another slash as well because we 're in a C# string literal, unless you prefix the string with # in which case the escaping rules are different.
The expression to match "any number of whatever characters" is .*. In this case, you would want to match any number of non-quote characters, which is [^"]*.
In contrast to (1) above, the replacement string is not a regular expression so you don't want any slashes there.
You need to store the return value of the replace somewhere.
The end result is
var result = Regex.Replace(contenu,
#"Request\.ServerVariables\(""[^""]*""\)",
"Request.ServerVariables('test')");
Based purely on my knowledge of regex (and not how they are done in C#), the pattern you want is probably:
"[^"]*"
ie - match a " then match everything that's not a " then match another "
You may need to escape the double-quotes to make your regex-parser actually match on them... that's what I don't know about C#
Try to avoid where you can the '.*' in regex, you can usually find what you want to get by avoiding other characters, for example [^"]+ not quoted, or ([^)]+) not in parenthesis. So you may just want "([^"]+)" which should give you the whole thing in [0], then in [1] you'll find 'test'.
You could also just replace '"' with '' I think.
Taryn Easts regex includes the *. You should remove it, if it is just a placeholder for any value:
"[^"]"
BTW: You can test this regex with this cool editor: http://rubular.com/r/1MMtJNF3kM
Alright, Regex gurus, how can I change my logic to fix this one?
I've made a regex:
(,[,]+)
It's supposed to remove extra commas on the end of a line. (end of line being \r\n) when formatted as a string.
It works (sort of).
This is the string:
Date,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24,\r\nDate,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24,,,,,\r\nDate,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24,,,,,\r\nDate,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24,,\r\n
When I run that regex, it gives a result of:
Date,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24,\r\nDate,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24\r\nDate,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24\r\nDate,1-Jul-18,1-Jul-19,1-Jul-20,1-Jul-21,1-Jul-22,1-Jul-23,1-Jul-24\r\n
I need to remove the comma at the end of the first line (I think I need to be finding \r\n and killing any commas before that, until a non-comma.
Any thoughts about how to do this?
Thanks
(,+$) perhaps? (One or more commas followed immediately by the end of a line.)
If your language supports positive lookahead, try this -
([,]*)(?=\\r\\n)
I think you can match one or more , followed by \r\n by using ,+\\r\\n. Don't know how to replace that using C# sorry. In perl I would do
perl -pi -e 's/,+\\r\\n/\\r\\n/g' c.txt
(assuming that c.txt is a file containing your input text).
Hola. I'm failing to write a method to test for words within a plain text or html document. I was reasonably literate with regex, and I am newer to c# (from way more java).
Just 'cause,
string html = source.ToLower();
string plaintext = Regex.Replace(html, #"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, #"\s+", " "); // remove excess white space
and then,
string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,#"\b" + Regex.Escape(tag) + #"\b");
bool foundAsContains = plaintext.Contains(tag);
For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. My google-fu is weak, so I didn't get much back on "what the hell". Any ideas or pointers welcome!
edit:
I'm searching for matches on skills in resumes. for example, the distinct value "c++".
edit:
a real excerpt is given below:
"...administration- c, c++, perl, shell programming..."
The problem is that \b matches between a word character and a non-word character. Given the expression \bc\+\+\b, you have a problem. "+" is a non-word character. So searching for the pattern in "xxx c++, xxx", you're not going to find anything. There's no "word break" after the "+" character.
If you're looking for non-word characters then you'll have to change your logic. Not sure what the best thing would be. I suppose you can use \W, but then it's not going to match at the beginning or end of the line, so you'll need (^|\W) and (\W|$) ... which is ugly. And slow, although perhaps still fast enough depending on your needs.
Your regular expression is turning into:
/\bc\+\+\b/
Which means you're looking for a word boundary, followed by the string c++, followed by another word boundary. This means it won't match on strings like abc++, whereas plaintext.Contains will succeed.
If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer.
Edit: My original regex was /\bc++\b/, which is incorrect, as c++ is being passed to Regex.Escape(), which escapes out regular expression metacharacters like +. I've fixed it above.
I don't claim to be a RegEx guru at all, and I am a bit confused on what this statement is doing. I am trying to refactor and this is being called on a key press and eating a lot of CPU.
Regex.Replace(_textBox.Text, "(?<!\r)\n", Environment.NewLine);
Thanks.
The regular expression (?<!\r)\n will match any \n character that is not preceeded by a \r character. The syntax (?<!expr) is a negative look-behind assertion and means that expr must not match the part that’s before the current position.
In addition to the answers explaining what the regex does (match all \n's without a \r before it), I'd just like to point out that this use of Replace() is most likely never necessary, unless you have users hellbent on typing just \n's somehow. And even then, you probably don't need it on the keypress, just when the text as a whole is used (i.e. after the data is submitted somehow).
And if that was put in there to sanitize copy-pasted text, then you can refactor it to only run when a large amount of the text has been changed.
It's replacing every instance where there is a \n not preceeded by a \r with a Environment.NewLine string. This string is the platform specific newline (on Windows it will be the string "\r\n")
The regular expression will match any newline character \n that isn't preceded by a carriage return character \r with the platform specific NewLine character(s).
The NewLine character is:
\r\n for non-Unix platforms
\n for Unix platforms
I try to keep it brief and concise. I have to write a program that takes queries in SQL form and searches an XML. Right now I am trying to disassemble a string into logical pieces so I can work with them. I have a string as input and want to get a MatchCollection as output.
Please not that the test string below is of a special format that I impose on the user to keep things simple. Only one statement per line is permitted and nested queries are excluded-
string testString = "select apples \n from dblp \r where we ate \n group by all of them \r HAVING NO SHAME \n";
I use Regex with the following pattern:
Regex reg = new Regex(#"(?<select> \A\bselect\b .)" +
#"(?<from> ^\bfrom\b .)" +
#"(?<where> ^\bwhere\b .)" +
#"(?<groupBy> ^\bgroup by\b .)" +
#"(?<having> ^\bhaving\b .)"
, RegexOptions.IgnoreCase | RegexOptions.Multiline
);
As far as I know this should give me matches for every group with the test string. I would be looking for an exact match of "select" at the start of each line followed by any characters except newlines.
Now I create the collection:
MatchCollection matches = reg.Matches(testString);
To makes sure it worked I used a foreach and printed the matches like:
foreach(Match match in matches)
{
Console.WriteLine("Select: {0}", match.Groups["select"]);
//and so on
}
The problem is that the collection is always empty. There must be a flaw in the Regex somewhere but I am too inexperienced to find it. Could you please assist me? Thank you very much!
I tried using .* instead of just . until I was told that . would even mathc multiple character. I have no doubt that this could be a problem but even when replacing it I get no result.
I fail to see why it is so difficult to match a line starting with a defined word and having any characters appended to it until the regex finds a newline. Seems to me that this should be a relatively easy task.
I think you need to explicitly match the line terminators, as well as handle spaces better as others have suggested. Assuming the user can choose between \r and \n, try
#"(?<select>\Aselect .+)[\n\r]" +
#"(?<from>\s*from .+)[\n\r]" +
#"(?<where>\s*where .+)[\n\r]" +
#"(?<groupBy>\s*group by .+)[\n\r]" +
#"(?<having>\s*having .+)[\n\r]"
As long as you are using regular expressions, you probably want to do a bit better:
#"\Aselect (?<select>.+)[\n\r]" +
#"\s*from (?<from>.+)[\n\r]" +
#"\s*where (?<where>.+)[\n\r]" +
#"\s*group by (?<groupBy>.+)[\n\r]" +
#"\s*having (?<having>.+)[\n\r]"
My biggest problem with regular expressions for this sort of use is that the only error message you can give is that things failed. You can't give the user any further information about what they did wrong.
There may be a problem with the newline matching: is it LF (Unix standard), CR (MacOS), or CR LF (Windows)? If you don't know, perhaps you should match it with: [\n\r]+
edit: You included some whitespace in your test string, surrounding the newlines, that you don't account for in your rexex.
(?<from>^\s*from\b.*[\n\r]+$)
As you said, it's easy enough to match the keyword(s) and then use (.+) to match the rest of the line. But you have to match all of the intervening characters, and you aren't doing that. (The ^ line anchor matches the position following the line separator, not the separator itself.) You can use \s+ to consume the line separator as well as any leading whitespace on the next line.
#"select\s+(?<select>.+)\s+" +
#"from\s+(?<from>.+)\s+" +
#"where\s+(?<where>.+)\s+" +
#"group by\s+(?<groupBy>.+)\s+" +
#"having\s+(?<having>.+)";
I also rearranged things so that the SQL keywords aren't captured; that seems redundant, since you're using named groups.
I haven't tried to build a working regex for you, but I can see several issues. Others pointed out the first two issues, but not the third one.
You can't use a single dot to match the variable parts such as "apples". Try \w+ or \S+
Your string has embedded line breaks. You need to match those with [\r\n]+ or \s+
The .NET regex engine treats \n as a line break, but NOT \r or \r\n. Thus, ^ will match after \n, but NOT after \r. If you do step 2, you don't need the anchors anyway, so remove them.