.NET Regex To Remove Line Breaks Within Quotes

.NET Regex To Remove Line Breaks Within Quotes - c#

I am trying to clean up a text file so that it can be imported into Excel but the text file contains line breaks within several of the double quoted fields. The file is tab delimited.
Example would be:
"12313"\t"1234"\t"123
5679"
"test"\t"test"\t"test"
"test"\t"test"\t"test"
"12313"\t"1234"\t"123
5679"
I need to remove the line breaks so that it will ultimately display like:
"12313"\t"1234"\t"1235679"
"test"\t"test"\t"test"
"test"\t"test"\t"test"
"12313"\t"1234"\t"1235679"
The "\t" is the tab delimiter.
I've looked at several other solutions on SO but they don't seem to deal with multiple lines. We've tried using several CSV parser solutions but can't seem to get them to work for this scenario. The goal is to pass the entire string into a REGEX expression and have it return with all line breaks between quotes removed while the line breaks outside of the quotes remain.

You can use this regex:
(?!(([^"]*"){2})*[^"]*$)\n+
Working Demo
This one matches one or more newline character that are not followed by even number of quotes (It assumes there is no escaping exceptions in the data).

This worked for me:
var fixedCsvFileContent = Regex.Replace(csvFileContent, #"(?!(([^""]*""){2})*[^""]*$)\n+", string.Empty);
This didnt work:
var fixedCsvFileContent = Regex.Replace(csvFileContent, #"(?!(([^""]*""){2})*[^""]*$)\n+", string.Empty, RegexOptions.Multiline);
Thus one must not add RegexOptions.Multiline when doing the check on the input string.

If just removing blank lines works:
string text = Regex.Replace( inputString, #"\n\n", "" , RegexOptions.None | RegexOptions.Multiline );

I have been running into a similar problem, but also some of the files might be really large. So using a RegEx on everything would be a heavy solution, and instead I wanted to try to make something a bit like ReadLine except that it would ignore breaklines within quotes. This is the solution I am using.
It is an extension to the StreamReader class, used to reading the CSV files and like some of the RegEx solutions here, it ensures there is an even number of quotes. So it uses ReadLine, checks if there is an odd number of quotes and if there is it does another ReadLine until the number of quotes is even:
public static class Extensions
{
public static string ReadEntry(this StreamReader sr)
{
string strReturn = "";
//get first bit
strReturn += sr.ReadLine();
//And get more lines until the number of quotes is even
while (strReturn.GetNumberOf("\"").IsOdd())
{
string strNow = sr.ReadLine();
strReturn += strNow;
}
//Then return what we've gotten
if (strReturn == "")
{
return null;
}
else
{
return strReturn;
}
}
public static int GetNumberOf(this string s, string strSearchString)
{
return s.Length - s.Replace(strSearchString, "").Length;
}
public static Boolean IsOdd(this int i)
{
return i % 2 != 0;
}
}

string output = Regex.Replace(input, #"(?<=[^""])\r\n", string.Empty);
Demo with the input provided

Related

How do I get a non lowercase string after quotes in the titlecase condition

In my article titles, I use CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower()); but I think, it is not working after double quotes. At least for Turkish.
For example, an article's title like this:
KİRA PARASININ ÖDENMEMESİ NEDENİYLE YAPILAN "İLAMSIZ TAHLİYE"
TAKİPLERİNDE "TAKİP TALEBİ"NİN İÇERİĞİ.
After using the method like this:
private static string TitleCase(this string str)
{
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower());
}
var art_title = textbox1.Text.TitleCase(); It returns
Kira Parasının Ödenmemesi Nedeniyle Yapılan "İlamsız Tahliye"
Takiplerinde "Takip Talebi"Nin İçeriği.
The problem is here. Because it must be like this:
... "Takip Talebi"nin ...
but it is like this:
... "Takip Talebi"Nin ...
What's more, in the MS Word, when I click "Start a Word Initial Expense," it's transforming like that
... "Takip Talebi"Nin ...
But it is absolutely wrong. How can I fix this problem?
EDIT: Firstly I cut the sentence from the blanks and obtained the words. If a word includes double quote, it would get a lowercase string until the first space after the second double quote. Here is the idea:
private static string _TitleCase(this string str)
{
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower());
}
public static string TitleCase(this string str)
{
var words = str.Split(' ');
string sentence = null;
var i = 1;
foreach (var word in words)
{
var space = i < words.Length ? " " : null;
if (word.Contains("\""))
{
// After every second quotes, it would get a lowercase string until the first space after the second double quote... But how?
}
else
sentence += word._TitleCase() + space;
i++;
}
return sentence?.Trim();
}
Edit - 2 After 3 Hours: After 9 hours, I found a way to solve the problem. I believe that it is absolutely not scientific. Please don't condemn me for this. If the whole problem is double quotes, I replace it with a number that I think it is unique or an unused letter in Turkish, like alpha, beta, omega etc. before sending it to the ToTitleCase. In this case, the ToTitleCase realizes the title transformation without any problems. Then I replace number or unused letter with double quotes in return time. So the purpose is realized. Please share it in here if you have a programmatic or scientific solution.
Here is my non-programmatic solution:
public static string TitleCase(this string str)
{
str = str.Replace("\"", "9900099");
str = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower());
return str.Replace("9900099", "\"").Trim();
}
var art_title = textbox1.Text.TitleCase();
And the result:
Kira Parasının Ödenmemesi Nedeniyle Yapılan "İlamsız Tahliye" Takiplerinde "Takip Talebi"nin İçeriği

Indeed, Microsoft documentation ToTitleCase states that ToTitleCase is (at least currently) not linguistically correct. In fact, it is REALLY hard to do this correctly (see these blog posts of the great Michael Kaplan: Sometimes, uppercasing sucks and "Michael, why does ToTitleCase suck so much?").
I'm not aware of any service or library providing a linguistically correct version.
So - unless you want to spend a lot of effort - you probably have to live with this inaccuracy.

You can find the apostrophe or quote character with RegEx and replace the character after it.
For apostrophe
Regex.Replace(str, "’(?:.)", m => m.Value.ToLower());
or
Regex.Replace(str, "'(?:.)", m => m.Value.ToLower());

C# Replace with regex

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.
EDIT: Per comments this code block has been changed.
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
fileContents = fileContents.Replace(fileContents, #"regex", "");
regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);
My files are formatted like this:
"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,
So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso
(?<=,")([^"]+,[^"]+)(?=",)
I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?
SOLVED:
Combined [^"]+ with look behind/ahead:
(?<=,"[^"]+)(,)(?=[^"]+",)
FINAL EDIT:
Here's my final complete solution:
//read file contents
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");
//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));
//write result back to file
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);

Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",
(?<=,"[^"]+)(,)(?=[^"]+",)

Try to parse out all your columns with this:
Regex regex = new Regex("(?<=\").*?(?=\")");
Then you can just do:
foreach(Match match in regex.Matches(filecontents))
{
fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
}
Might not be as fast but should work.

I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text.
This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.
I find keeping your regexes simple will pay benefits when you're trying to maintain them later.
Note: this is similar to the answer by #Florian, but this replace restricts itself to replacement only in the matched text.
string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp);
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))

What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.
While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.
(?<=,"[^"]*),(?=[^"]*",)
Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:
"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,
In this case you are going to be out of luck with a regular expression.
Thankfully the code to deal with CSV files is very simple:
public static IList<string> ParseCSVLine(string csvLine)
{
List<string> result = new List<string>();
StringBuilder buffer = new StringBuilder();
bool inQuotes = false;
char lastChar = '\0';
foreach (char c in csvLine)
{
switch (c)
{
case '"':
if (inQuotes)
{
inQuotes = false;
}
else
{
if (lastChar == '"')
{
buffer.Append('"');
}
inQuotes = true;
}
break;
case ',':
if (inQuotes)
{
buffer.Append(',');
}
else
{
result.Add(buffer.ToString());
buffer.Clear();
}
break;
default:
buffer.Append(c);
break;
}
lastChar = c;
}
result.Add(buffer.ToString());
buffer.Clear();
return result;
}
PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

How to eliminate ALL line breaks in string?

I have a need to get rid of all line breaks that appear in my strings (coming from db).
I do it using code below:
value.Replace("\r\n", "").Replace("\n", "").Replace("\r", "")
I can see that there's at least one character acting like line ending that survived it. The char code is 8232.
It's very lame of me, but I must say this is the first time I have a pleasure of seeing this char. It's obvious that I can just replace this char directly, but I was thinking about extending my current approach (based on replacing combinations of "\r" and "\n") to something much more solid, so it would not only include the '8232' char but also all others not-found-by-me yet.
Do you have a bullet-proof approach for such a problem?
EDIT#1:
It seems to me that there are several possible solutions:
use Regex.Replace
remove all chars if it's IsSeparator or IsControl
replace with " " if it's IsWhiteSpace
create a list of all possible line endings ( "\r\n", "\r", "\n",LF ,VT, FF, CR, CR+LF, NEL, LS, PS) and just replace them with empty string. It's a lot of replaces.
I would say that the best results will be after applying 1st and 4th approaches but I cannot decide which will be faster. Which one do you think is the most complete one?
EDIT#2
I posted anwer below.

Below is the extension method solving my problem. LineSeparator and ParagraphEnding can be of course defined somewhere else, as static values etc.
public static string RemoveLineEndings(this string value)
{
if(String.IsNullOrEmpty(value))
{
return value;
}
string lineSeparator = ((char) 0x2028).ToString();
string paragraphSeparator = ((char)0x2029).ToString();
return value.Replace("\r\n", string.Empty)
.Replace("\n", string.Empty)
.Replace("\r", string.Empty)
.Replace(lineSeparator, string.Empty)
.Replace(paragraphSeparator, string.Empty);
}

According to wikipedia, there are numerous line terminators you may need to handle (including this one you mention).
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

8232 (0x2028) and 8233 (0x2029) are the only other ones you might want to eliminate. See the documentation for char.IsSeparator.

Props to Yossarian on this one, I think he's right. Replace all whitespace with a single space:
data = Regex.Replace(data, #"\s+", " ");

I'd recommend removing ALL the whitespace (char.IsWhitespace), and replacing it with single space.. IsWhiteSpace takes care of all weird unicode whitespaces.

This is my first attempt at this, but I think this will do what you want....
var controlChars = from c in value.ToCharArray() where Char.IsControl(c) select c;
foreach (char c in controlChars)
value = value.Replace(c.ToString(), "");
Also, see this link for details on other methods you can use: Char Methods

Have you tried string.Replace(Environment.NewLine, "") ? That usually gets a lot of them for me.

Check out this link: http://msdn.microsoft.com/en-us/library/844skk0h.aspx
You wil lhave to play around and build a REGEX expression that works for you. But here's the skeleton...
static void Main(string[] args)
{
StringBuilder txt = new StringBuilder();
txt.Append("Hello \n\n\r\t\t");
txt.Append( Convert.ToChar(8232));
System.Console.WriteLine("Original: <" + txt.ToString() + ">");
System.Console.WriteLine("Cleaned: <" + CleanInput(txt.ToString()) + ">");
System.Console.Read();
}
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
return Regex.Replace(strIn, #"[^\w\.#-]", "");
}

Assuming that 8232 is unicode, you can do this:
value.Replace("\u2028", string.Empty);

personally i'd go with
public static String RemoveLineEndings(this String text)
{
StringBuilder newText = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
if (!char.IsControl(text, i))
newText.Append(text[i]);
}
return newText.ToString();
}

If you've a string say "theString" then
use the method Replace and give it the arguments shown below:
theString = theString.Replace(System.Environment.NewLine, "");

Here are some quick solutions with .NET regex:
To remove any whitespace from a string: s = Regex.Replace(s, #"\s+", ""); (\s matches any Unicode whitespace chars)
To remove all whitespace BUT CR and LF: s = Regex.Replace(s, #"[\s-[\r\n]]+", ""); ([\s-[\r\n]] is a character class containing a subtraction construct, it matches any whitespace but CR and LF)
To remove any vertical whitespace, subtract \p{Zs} (any horizontal whitespace but tab) and \t (tab) from \s: s = Regex.Replace(s, #"[\s-[\p{Zs}\t]]+", "");.
Wrapping the last one into an extension method:
public static string RemoveLineEndings(this string value)
{
return Regex.Replace(value, #"[\s-[\p{Zs}\t]]+", "");
}
See the regex demo.

Removing all whitespace lines from a multi-line string efficiently

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.
EDIT: I should add I'm using .NET 2.0.
Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.
First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.
Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.

If you want to remove lines containing any whitespace (tabs, spaces), try:
string fix = Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline);
Edit (for #Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:
string fix =
Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline)
.TrimEnd();

string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Trim().Length > 0)
writer.WriteLine(line);
}
outputString = writer.ToString();
}

off the top of my head...
string fixed = Regex.Replace(input, "\s*(\n)","$1");
turns this:
fdasdf
asdf
[tabs]
[spaces]
asdf
into this:
fdasdf
asdf
asdf

Using LINQ:
var result = string.Join("\r\n",
multilineString.Split(new string[] { "\r\n" }, ...None)
.Where(s => !string.IsNullOrWhitespace(s)));
If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.

Alright this answer is in accordance to the clarified requirements specified in the bounty:
I also need to remove any trailing newlines, and my Regex-fu is
failing. My bounty goes to anyone who can give me a regex which passes
this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") ==
"test\r\nthis"
So Here's the answer:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z
Or in the C# code provided by #Chris Schmich:
string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", #"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);
Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.
(?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
(?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
(\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)
That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.
EDIT: #Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:
\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of #Will's provided test cases.
So all together now, it should be:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
EDIT #2: Alright there is one more possible case #Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.
\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.
So now we've got:
\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
So now we have four patterns for matching:
whitespace at the beginning of the file,
redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
redundant line breaks with no content, (ex: \r\n\r\n)
whitespace at the end of the file

not good. I would use this one using JSON.net:
var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);

In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.
Use RegexOptions.Multiline with this pattern:
^\s+(?!\B)|\s*(?>[\r\n]+)$
Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.
string[] inputs =
{
"one\r\n \r\ntwo\r\n\t\r\n \r\n",
"test\r\n \r\nthis\r\n\r\n",
"\r\n\r\ntest!",
"\r\ntest\r\n ! test",
"\r\ntest \r\n ! "
};
string[] outputs =
{
"one\r\ntwo",
"test\r\nthis",
"test!",
"test\r\n ! test",
"test \r\n ! "
};
string pattern = #"^\s+(?!\B)|\s*(?>[\r\n]+)$";
for (int i = 0; i < inputs.Length; i++)
{
string result = Regex.Replace(inputs[i], pattern, "",
RegexOptions.Multiline);
Console.WriteLine(result == outputs[i]);
}
EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.

string corrected =
System.Text.RegularExpressions.Regex.Replace(input, #"\n+", "\n");

I'll go with:
public static string RemoveEmptyLines(string value) {
using (StringReader reader = new StringReader(yourstring)) {
StringBuilder builder = new StringBuilder();
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Trim().Length > 0)
builder.AppendLine(line);
}
return builder.ToString();
}
}

Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.
public static string RemoveEmptyLines(this string text) {
var builder = new StringBuilder();
using (var reader = new StringReader(text)) {
while (reader.Peek() != -1) {
string line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
builder.AppendLine(line);
}
}
return builder.ToString();
}
Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:
public static bool IsNullOrWhiteSpace(string text) {
return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}

In response to Will's bounty here is a Perl sub that gives correct response to the test case:
sub StripWhitespace {
my $str = shift;
print "'",$str,"'\n";
$str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
print "'",$str,"'\n";
return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");
output:
'test
this
'
'test
this'
In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;
There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;

if its only White spaces why don't you use the C# string method
string yourstring = "A O P V 1.5";
yourstring.Replace(" ", string.empty);
result will be "AOPV1.5"

char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)

Here is something simple if working against each individual line...
(^\s+|\s+|^)$

Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips
All empty lines from the start of a string
Not including any spaces at the beginning of the first non-whitespace line
All empty lines after the first non-whitespace line and before the last non-whitespace line
Again, preserving all whitespace at the beginning of any non-whitespace line
All empty lines after the last non-whitespace line, including the last newline
(?<=(\r\n)|^)\s*\r\n|\r\n\s*$
which essentially says:
Immediately after
The beginning of the string OR
The end of the last line
Match as much contiguous whitespace as possible that ends in a newline*
OR
Match a newline and as much contiguous whitespace as possible that ends at the end of the string
The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.
Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.
*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)

String Extension
public static string UnPrettyJson(this string s)
{
try
{
// var jsonObj = Json.Decode(s);
// var sObject = Json.Encode(value); dont work well with array of strings c:['a','b','c']
object jsonObj = JsonConvert.DeserializeObject(s);
return JsonConvert.SerializeObject(jsonObj, Formatting.None);
}
catch (Exception e)
{
throw new Exception(
s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
}
}

Im not sure is it efficient but =)
List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());

Try this.
string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);
string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);

s = Regex.Replace(s, #"^[^\n\S]*\n", "");
[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:
s = Regex.Replace(s, #"^[ \t\r]*\n", "");
And if you want it to catch the last line, without a final linefeed:
s = Regex.Replace(s, #"^[ \t\r]*\n?", "");

Replace Line Breaks in a String C#

How can I replace Line Breaks within a string in C#?

Use replace with Environment.NewLine
myString = myString.Replace(System.Environment.NewLine, "replacement text"); //add a line terminating ;
As mentioned in other posts, if the string comes from another environment (OS) then you'd need to replace that particular environments implementation of new line control characters.

The solutions posted so far either only replace Environment.NewLine or they fail if the replacement string contains line breaks because they call string.Replace multiple times.
Here's a solution that uses a regular expression to make all three replacements in just one pass over the string. This means that the replacement string can safely contain line breaks.
string result = Regex.Replace(input, #"\r\n?|\n", replacementString);

To extend The.Anyi.9's answer, you should also be aware of the different types of line break in general use. Dependent on where your file originated, you may want to look at making sure you catch all the alternatives...
string replaceWith = "";
string removedBreaks = Line.Replace("\r\n", replaceWith).Replace("\n", replaceWith).Replace("\r", replaceWith);
should get you going...

I would use Environment.Newline when I wanted to insert a newline for a string, but not to remove all newlines from a string.
Depending on your platform you can have different types of newlines, but even inside the same platform often different types of newlines are used. In particular when dealing with file formats and protocols.
string ReplaceNewlines(string blockOfText, string replaceWith)
{
return blockOfText.Replace("\r\n", replaceWith).Replace("\n", replaceWith).Replace("\r", replaceWith);
}

If your code is supposed to run in different environments, I would consider using the Environment.NewLine constant, since it is specifically the newline used in the specific environment.
line = line.Replace(Environment.NewLine, "newLineReplacement");
However, if you get the text from a file originating on another system, this might not be the correct answer, and you should replace with whatever newline constant is used on the other system. It will typically be \n or \r\n.

if you want to "clean" the new lines, flamebaud comment using regex #"[\r\n]+" is the best choice.
using System;
using System.Text.RegularExpressions;
class MainClass {
public static void Main (string[] args) {
string str = "AAA\r\nBBB\r\n\r\n\r\nCCC\r\r\rDDD\n\n\nEEE";
Console.WriteLine (str.Replace(System.Environment.NewLine, "-"));
/* Result:
AAA
-BBB
-
-
-CCC
DDD---EEE
*/
Console.WriteLine (Regex.Replace(str, #"\r\n?|\n", "-"));
// Result:
// AAA-BBB---CCC---DDD---EEE
Console.WriteLine (Regex.Replace(str, #"[\r\n]+", "-"));
// Result:
// AAA-BBB-CCC-DDD-EEE
}
}

Use new in .NET 6 method
myString = myString.ReplaceLineEndings();
Replaces ALL newline sequences in the current string.
Documentation:
ReplaceLineEndings

Don't forget that replace doesn't do the replacement in the string, but returns a new string with the characters replaced. The following will remove line breaks (not replace them). I'd use #Brian R. Bondy's method if replacing them with something else, perhaps wrapped as an extension method. Remember to check for null values first before calling Replace or the extension methods provided.
string line = ...
line = line.Replace( "\r", "").Replace( "\n", "" );
As extension methods:
public static class StringExtensions
{
public static string RemoveLineBreaks( this string lines )
{
return lines.Replace( "\r", "").Replace( "\n", "" );
}
public static string ReplaceLineBreaks( this string lines, string replacement )
{
return lines.Replace( "\r\n", replacement )
.Replace( "\r", replacement )
.Replace( "\n", replacement );
}
}

To make sure all possible ways of line breaks (Windows, Mac and Unix) are replaced you should use:
string.Replace("\r\n", "\n").Replace('\r', '\n').Replace('\n', 'replacement');
and in this order, to not to make extra line breaks, when you find some combination of line ending chars.

Why not both?
string ReplacementString = "";
Regex.Replace(strin.Replace(System.Environment.NewLine, ReplacementString), #"(\r\n?|\n)", ReplacementString);
Note: Replace strin with the name of your input string.

I needed to replace the \r\n with an actual carriage return and line feed and replace \t with an actual tab. So I came up with the following:
public string Transform(string data)
{
string result = data;
char cr = (char)13;
char lf = (char)10;
char tab = (char)9;
result = result.Replace("\\r", cr.ToString());
result = result.Replace("\\n", lf.ToString());
result = result.Replace("\\t", tab.ToString());
return result;
}

var answer = Regex.Replace(value, "(\n|\r)+", replacementString);

As new line can be delimited by \n, \r and \r\n, first we’ll replace \r and \r\n with \n, and only then split data string.
The following lines should go to the parseCSV method:
function parseCSV(data) {
//alert(data);
//replace UNIX new lines
data = data.replace(/\r\n/g, "\n");
//replace MAC new lines
data = data.replace(/\r/g, "\n");
//split into rows
var rows = data.split("\n");
}

Use the .Replace() method
Line.Replace("\n", "whatever you want to replace with");

Best way to replace linebreaks safely is
yourString.Replace("\r\n","\n") //handling windows linebreaks
.Replace("\r","\n") //handling mac linebreaks
that should produce a string with only \n (eg linefeed) as linebreaks.
this code is usefull to fix mixed linebreaks too.

Another option is to create a StringReader over the string in question. On the reader, do .ReadLine() in a loop. Then you have the lines separated, no matter what (consistent or inconsistent) separators they had. With that, you can proceed as you wish; one possibility is to use a StringBuilder and call .AppendLine on it.
The advantage is, you let the framework decide what constitutes a "line break".

string s = Regex.Replace(source_string, "\n", "\r\n");
or
string s = Regex.Replace(source_string, "\r\n", "\n");
depending on which way you want to go.
Hopes it helps.

If you want to replace only the newlines:
var input = #"sdfhlu \r\n sdkuidfs\r\ndfgdgfd";
var match = #"[\\ ]+";
var replaceWith = " ";
Console.WriteLine("input: " + input);
var x = Regex.Replace(input.Replace(#"\n", replaceWith).Replace(#"\r", replaceWith), match, replaceWith);
Console.WriteLine("output: " + x);
If you want to replace newlines, tabs and white spaces:
var input = #"sdfhlusdkuidfs\r\ndfgdgfd";
var match = #"[\\s]+";
var replaceWith = "";
Console.WriteLine("input: " + input);
var x = Regex.Replace(input, match, replaceWith);
Console.WriteLine("output: " + x);

This is a very long winded one-liner solution but it is the only one that I had found to work if you cannot use the the special character escapes like "\r" and "\n" and \x0d and \u000D as well as System.Environment.NewLine as parameters to thereplace() method
MyStr.replace( System.String.Concat( System.Char.ConvertFromUtf32(13).ToString(), System.Char.ConvertFromUtf32(10).ToString() ), ReplacementString );
This is somewhat offtopic but to get it to work inside Visual Studio's XML .props files, which invoke .NET via the XML properties, I had to dress it up like it is shown below.
The Visual Studio XML --> .NET environment just would not accept the special character escapes like "\r" and "\n" and \x0d and \u000D as well as System.Environment.NewLine as parameters to thereplace() method.
$([System.IO.File]::ReadAllText('MyFile.txt').replace( $([System.String]::Concat($([System.Char]::ConvertFromUtf32(13).ToString()),$([System.Char]::ConvertFromUtf32(10).ToString()))),$([System.String]::Concat('^',$([System.Char]::ConvertFromUtf32(13).ToString()),$([System.Char]::ConvertFromUtf32(10).ToString())))))

Based on #mark-bayers answer and for cleaner output:
string result = Regex.Replace(ex.Message, #"(\r\n?|\r?\n)+", "replacement text");
It removes \r\n , \n and \r while perefer longer one and simplify multiple occurances to one.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

.NET Regex To Remove Line Breaks Within Quotes - c#

You can use this regex: (?!(([^"]"){2})[^"]*$)\n+ Working Demo This one matches one or more newline character that are not followed by even number of quotes (It assumes there is no escaping exceptions in the data).

If just removing blank lines works: string text = Regex.Replace( inputString, #"\n\n", "" , RegexOptions.None | RegexOptions.Multiline );

string output = Regex.Replace(input, #"(?<=[^""])\r\n", string.Empty); Demo with the input provided

Related

How do I get a non lowercase string after quotes in the titlecase condition

C# Replace with regex

How to eliminate ALL line breaks in string?

Removing all whitespace lines from a multi-line string efficiently

Replace Line Breaks in a String C#

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

.NET Regex To Remove Line Breaks Within Quotes - c#

You can use this regex: (?!(([^"]*"){2})*[^"]*$)\n+ Working Demo This one matches one or more newline character that are not followed by even number of quotes (It assumes there is no escaping exceptions in the data).

If just removing blank lines works: string text = Regex.Replace( inputString, #"\n\n", "" , RegexOptions.None | RegexOptions.Multiline );

string output = Regex.Replace(input, #"(?<=[^""])\r\n", string.Empty); Demo with the input provided

Related

How do I get a non lowercase string after quotes in the titlecase condition

C# Replace with regex

How to eliminate ALL line breaks in string?

Removing all whitespace lines from a multi-line string efficiently

Replace Line Breaks in a String C#

Categories

Resources

You can use this regex: (?!(([^"]"){2})[^"]*$)\n+ Working Demo This one matches one or more newline character that are not followed by even number of quotes (It assumes there is no escaping exceptions in the data).