Regex replace Windows line break characters - c#

I have this bit of code, which is supposed to replace the Windows linebreak character (\r\n) with an empty character.
However, it does not seem to replace anything, as if I view the string after the regex expression is applied to it, the linebreak characters are still there.
private void SetLocationsAddressOrGPSLocation(Location location, string locationString)
{
//Regex check for the characters a-z|A-Z.
//Remove any \r\n characters (Windows Newline characters)
locationString = Regex.Replace(locationString, #"[\\r\\n]", "");
int test = Regex.Matches(locationString, #"[\\r\\n]").Count; //Curiously, this outputs 0
int characterCount = Regex.Matches(locationString,#"[a-zA-Z]").Count;
//If there were characters, set the location's address to the locationString
if (characterCount > 0)
{
location.address = locationString;
}
//Otherwise, set the location's coordinates to the locationString.
else
{
location.coordinates = locationString;
}
} //End void SetLocationsAddressOrGPSLocation()

You are using verbatim string literal, thus \\ is treated as a literal \.
So, your regex is actually matching \, r and n.
Use
locationString = Regex.Replace(locationString, #"[\r\n]+", "");
This [\r\n]+ pattern will make sure you will remove each and every \r and \n symbol, and you won't have to worry if you have a mix of newline characters in your file. (Sometimes, I have both \n and \r\n endings in text files).

Related

How to get the Indexof backslash(\) in C#?

is there way to find out the indexof backslash in string variable?
i have string
var str = " \"AAP, FB, VOD, ART, BAG, CAT, DDL\"\n "
int stIdx = str.indexof('\"') // output as 6
int edIdx = str.indexof('\', stIdx+1); // output as -1
output i'm looking for is as below
AAP, FB, VOD, ART, BAG, CAT, DDL
You need to escape your backslash to use it as a char value.
Cause backslash is a special character, '\' didn't refer to backslash as a character, but as an instruction which let you escape any character.
int index = str.IndexOf('\\'); //should give you the right answer
A \" in a string is not a backslash followed by a quote, but just a quote. The backslash in the string literal escapes the quote: it should be just a character, not the end of the string constant.
The \n is not an escaped 'n' (it wouldn't need escaping) but a (single) newline character.
Your string literal doesn't really contain any backslashes.
Your code, with typos fixed:
var str = " \"AAP, FB, VOD, ART, BAG, CAT, DDL\"\n ";
int stIdx = str.IndexOf('"')+1; // output as 7 - you don't want position of " but 'A'
int edIdx = str.IndexOf('"', stIdx); // output as 39
Console.WriteLine($"from {stIdx} to {edIdx}"); // "from 7 to 39"
Console.WriteLine(str.Substring(stIdx, edIdx-stIdx)); // "AAP, FB, VOD, ART, BAG, CAT, DDL"
In a character literal (delimited by single quotes), you do not need to escape the double quote, so '"' works fine. Although it is no problem if you do escape it: '\"' - which is still a single " character.
The character sequence '\n' is used for a newline.
string str = "Hi\n";
In the above example str consists of the characters 'H', 'i', and a newline character. There is no '\' character.
Try the following:
int edIdx = str.indexof('\n', stIdx+1);

Trim Non-alphanum from beginning and end of string

what is the best way to trim ALL non alpha numeric characters from the beginning and end of a string ? I tried to add characters that I do no need manually but it doesn't work well and use the . I just need to trim anything not alphanumeric.
I tried using this function:
string something = "()&*1#^#47*^#21%Littering aaaannnndóú(*&^1#*32%#**)7(#9&^";
string somethingNew = Regex.Replace(something, #"[^\p{L}-\s]+", "");
But it removes all characters that are non alpha numeric from the string. What I basically want is like this:
"test1" -> test1
#!#!2test# -> 2test
(test3) -> test3
##test4---- -> test4
I do want to support unicode characters but not symbols..
EDIT:
The output of the example should be:
Littering aaaannnndóú
Regards
Assuming you want to trim non-alphanumeric characters from the start and end of your string:
s = new string(s.SkipWhile(c => !char.IsLetterOrDigit(c))
.TakeWhile(char.IsLetterOrDigit)
.ToArray());
#"[^\p{L}\s-]+(test\d*)|(test\d*)[^\p{L}\s-]+","$1"
You can use String function String.Trim Method (Char[]) in .NET library to trim the unnecessary characters from the given string.
From MSDN : String.Trim Method (Char[])
Removes all leading and trailing occurrences of a set of characters
specified in an array from the current String object.
Before trimming the unwanted characters, you need to first identify whether the character is Letter Or Digit, if it is non-alphanumeric then you can use String.Trim Method (Char[]) function to remove it.
you need to use Char.IsLetterOrDigit() function to identify wether the character is alphanumeric or not.
From MSDN: Char.IsLetterOrDigit()
Indicates whether a Unicode character is categorized as a letter or a
decimal digit.
Try This:
string str = "()&*1#^#47*^#21%Littering aaaannnndóú(*&^1#*32%#**)7(#9&^";
foreach (char ch in str)
{
if (!char.IsLetterOrDigit(ch))
str = str.Trim(ch);
}
Output:
1#^#47*^#21%Littering aaaannnndóú(*&^1#*32%#**)7(#9
If you need to remove any character which is not alphanumeric, you can use IsLetterOrDigit paired with a Where to go through every character. And because we're working at the char level, we'll need a little Concat at the end to bring everything back into a string.
string result = string.Concat(input.Where(char.IsLetterOrDigit));
which you can easily convert into an extension method
public static class Extensions
{
public static string ToAlphaNum(this string input)
{
return string.Concat(input.Where(char.IsLetterOrDigit));
}
}
that you can use like this :
string testString = "#!#!\"(test123)\"";
string result = testString.ToAlphaNum(); //test123
Note: this will remove every non-alphanumeric character from your string, if you really need to remove only those at the beginning/end, please add more details about what defines a beginning or an end and add more examples.
And you could also replace all the non-letters/numbers at the beginning and/or end of the line:
^[^\p{L}\p{N}]*|[^\p{L}\p{N}]*$
used as
resultString = Regex.Replace(subjectString, #"^[^\p{L}\p{N}]*|[^\p{L}\p{N}]*$", "", RegexOptions.Multiline);
If you really want to only remove characters at the beginning and end of the "String" and not do this line by line, then remove the ^$ match at linebreak option (RegexOption.Multiline)
If you wanted to include leading or trailing underscores, as characters to be retained, you could simplify the regex to:
^\W+|\W+$
The core of the regex:
[^\p{L}\p{N}]
is a negated character class which includes all of the characters in the Unicode class of Letters \p{L} or Numbers \p{N}
In other words:
Trim non-unicode alphanumeric characters
^[^\p{L}\p{N}]*|[^\p{L}\p{N}]*$
Options: Case sensitive; Exact spacing; Dot doesn't match line breaks; ^$ match at line breaks; Parentheses capture
Match this alternative «^[^\p{L}\p{N}]*»
Assert position at the beginning of a line «^»
Match any single character NOT present in the list below «[^\p{L}\p{N}]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A character from the Unicode category “letter” «\p{L}»
A character from the Unicode category “number” «\p{N}»
Or match this alternative «[^\p{L}\p{N}]*$»
Match any single character NOT present in the list below «[^\p{L}\p{N}]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A character from the Unicode category “letter” «\p{L}»
A character from the Unicode category “number” «\p{N}»
Assert position at the end of a line «$»
Created with RegexBuddy
Without using regex:
In Java, you could do: (in c# syntax would be nearly the same with same functionality)
while (true) {
if (word.length() == 0) {
return ""; // bad
}
if (!Character.isLetter(word.charAt(0))) {
word = word.substring(1);
continue; // so we are doing front first
}
if (!Character.isLetter(word.charAt(word.length()-1))) {
word = word.substring(0, word.length()-1);
continue; // then we are doing end
}
break; // if front is done, and end is done
}
you could use this pattern
^[^[:alnum:]]+|[^[:alnum:]]+$
with g option
Demo

Remove whitespace near a character using regex in a long text

How do I delete one or mores white spaces near a character in a long text. I do not want to remove other white spaces which are not present adjacent to the matching string. I only want to remove all white spaces next to the matching character and not all white spaces of the input string. For example:
[text][space][space]![space][text] should result in [text]![text]
[text][space][space]![space][space][space][text] should result in [text]![text]
[text][space]![space][space][text] should result in [text]![text]
[text][space]![space][text] should result in [text]![text]
[text]![space][space][text] should result in [text]![text]
[text][space][space]![text] should result in [text]![text]
[text][space][space]! should result in [text]!
![space][space][text] should result in ![text]
The code I am going to write is:
for (int i = 0 to length of string)
{
if (string[i] == character) //which is the desired character "!"
{
int location = i+1;
//remove all whitespace after the character till a non-whitespace character
//is found or string ends
while (string[location] == whitespace)
{
string[location].replace(" ", "");
location++;
}
int location = i-1;
//remove all whitespace before the character till a non-whitespace character
//is found or string ends
while (string[location] == whitespace)
{
string[location].replace(" ", "");
location--;
}
}
}
Is there a better way of removing whitespaces near a character using Regex?
UPDATE: I do not want to remove other white spaces which are not present adjacent to the matching string. For example:
some_text[space]some_other_text[space][space]![space]some_text[space]some_other_text
is
some_text[space]some_other_text!some_text[space]some_other_text
Regex rgx = new Regex(pattern);
string input = "This is text with far too much " +
"whitespace.";
string pattern = "\\s*!\\s*";
string replacement = "!";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);
taken from http://msdn.microsoft.com/de-de/library/vstudio/xwewhkd1.aspx

Replace special character with white space through regex

I have a function which replace character.
public static string Replace(string value)
{
value = Regex.Replace(value, "[\n\r\t]", " ");
return value;
}
value="abc\nbcd abcd abcd\ " if in string there is any unwanted white space they are also remove.Means I want result like this
value="abcabcdabcd".Help to change Regex Pattern to get desire result.Thanks a lot.
If you need to remove any number of whitespace characters from the string, probably you're looking for something like this:
value = Regex.Replace(value, #"\s+", "");
where \s matches any whitespace character and + means one or more times.
Instead of replacing your newline, tab, etc. characters with a space, just replace all whitespace with nothing:
public static string RemoveWhitespace(string value)
{
return Regex.Replace(value, "\\s", "");
}
\s is a special character group that matches all whitespace characters. (The backslash is doubled because the backslash has a special meaning in C# strings as well.) The following MSDN link contains the exact definition of that character group:
Character Classes: White-Space Character: \s
You may want to try \s indicating white spaces. With the statement Regex.Replace(value, #"\s", ""), the output will be "abcabcdabcd".

Removing all whitespace lines from a multi-line string efficiently

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.
EDIT: I should add I'm using .NET 2.0.
Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.
First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.
Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.
If you want to remove lines containing any whitespace (tabs, spaces), try:
string fix = Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline);
Edit (for #Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:
string fix =
Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline)
.TrimEnd();
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Trim().Length > 0)
writer.WriteLine(line);
}
outputString = writer.ToString();
}
off the top of my head...
string fixed = Regex.Replace(input, "\s*(\n)","$1");
turns this:
fdasdf
asdf
[tabs]
[spaces]
asdf
into this:
fdasdf
asdf
asdf
Using LINQ:
var result = string.Join("\r\n",
multilineString.Split(new string[] { "\r\n" }, ...None)
.Where(s => !string.IsNullOrWhitespace(s)));
If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.
Alright this answer is in accordance to the clarified requirements specified in the bounty:
I also need to remove any trailing newlines, and my Regex-fu is
failing. My bounty goes to anyone who can give me a regex which passes
this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") ==
"test\r\nthis"
So Here's the answer:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z
Or in the C# code provided by #Chris Schmich:
string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", #"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);
Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.
(?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
(?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
(\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)
That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.
EDIT: #Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:
\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of #Will's provided test cases.
So all together now, it should be:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
EDIT #2: Alright there is one more possible case #Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.
\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.
So now we've got:
\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
So now we have four patterns for matching:
whitespace at the beginning of the file,
redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
redundant line breaks with no content, (ex: \r\n\r\n)
whitespace at the end of the file
not good. I would use this one using JSON.net:
var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.
Use RegexOptions.Multiline with this pattern:
^\s+(?!\B)|\s*(?>[\r\n]+)$
Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.
string[] inputs =
{
"one\r\n \r\ntwo\r\n\t\r\n \r\n",
"test\r\n \r\nthis\r\n\r\n",
"\r\n\r\ntest!",
"\r\ntest\r\n ! test",
"\r\ntest \r\n ! "
};
string[] outputs =
{
"one\r\ntwo",
"test\r\nthis",
"test!",
"test\r\n ! test",
"test \r\n ! "
};
string pattern = #"^\s+(?!\B)|\s*(?>[\r\n]+)$";
for (int i = 0; i < inputs.Length; i++)
{
string result = Regex.Replace(inputs[i], pattern, "",
RegexOptions.Multiline);
Console.WriteLine(result == outputs[i]);
}
EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.
string corrected =
System.Text.RegularExpressions.Regex.Replace(input, #"\n+", "\n");
I'll go with:
public static string RemoveEmptyLines(string value) {
using (StringReader reader = new StringReader(yourstring)) {
StringBuilder builder = new StringBuilder();
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Trim().Length > 0)
builder.AppendLine(line);
}
return builder.ToString();
}
}
Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.
public static string RemoveEmptyLines(this string text) {
var builder = new StringBuilder();
using (var reader = new StringReader(text)) {
while (reader.Peek() != -1) {
string line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
builder.AppendLine(line);
}
}
return builder.ToString();
}
Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:
public static bool IsNullOrWhiteSpace(string text) {
return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
In response to Will's bounty here is a Perl sub that gives correct response to the test case:
sub StripWhitespace {
my $str = shift;
print "'",$str,"'\n";
$str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
print "'",$str,"'\n";
return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");
output:
'test
this
'
'test
this'
In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;
There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
if its only White spaces why don't you use the C# string method
string yourstring = "A O P V 1.5";
yourstring.Replace(" ", string.empty);
result will be "AOPV1.5"
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
Here is something simple if working against each individual line...
(^\s+|\s+|^)$
Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips
All empty lines from the start of a string
Not including any spaces at the beginning of the first non-whitespace line
All empty lines after the first non-whitespace line and before the last non-whitespace line
Again, preserving all whitespace at the beginning of any non-whitespace line
All empty lines after the last non-whitespace line, including the last newline
(?<=(\r\n)|^)\s*\r\n|\r\n\s*$
which essentially says:
Immediately after
The beginning of the string OR
The end of the last line
Match as much contiguous whitespace as possible that ends in a newline*
OR
Match a newline and as much contiguous whitespace as possible that ends at the end of the string
The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.
Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.
*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)
String Extension
public static string UnPrettyJson(this string s)
{
try
{
// var jsonObj = Json.Decode(s);
// var sObject = Json.Encode(value); dont work well with array of strings c:['a','b','c']
object jsonObj = JsonConvert.DeserializeObject(s);
return JsonConvert.SerializeObject(jsonObj, Formatting.None);
}
catch (Exception e)
{
throw new Exception(
s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
}
}
Im not sure is it efficient but =)
List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
Try this.
string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);
string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
s = Regex.Replace(s, #"^[^\n\S]*\n", "");
[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:
s = Regex.Replace(s, #"^[ \t\r]*\n", "");
And if you want it to catch the last line, without a final linefeed:
s = Regex.Replace(s, #"^[ \t\r]*\n?", "");

Categories

Resources