Parsing File with C# And Replace method - c#

I'm trying to parse a a bunch of file with Replace method(string) while is doing what I expect: I feels is not practical. for instance I will process 10K files but in the First 72 I found like 30 values that need to be replace And this is the rule :
My Goal :"
My goal is to replace all Instance of the ':' Dont follows this Rules :
1- the 2nd or 3rd Character foward is Not Another ':'
2-the 3rd or 2nd Chacarcter backward is Not Another ':'
All other should be Replaced
1- Any time that I found this character (:) and this character is not preceded by two char or three characters like :00: or :12A: I should replace it with an (*).
This is the method that I have so far.....
private static string cleanMesage(string str)
{
string result = String.Empty;
try
{
result = str.Replace("BNF:", "BNF*").Replace("B/O:", "B/O*").Replace("O/B:", "O/B*");
result = result.Replace("Epsas:", "Epsas*").Replace("2017:", "2017*").Replace("BANK:", "BANK*");
result = result.Replace("CDT:", "CDT*").Replace("ENT:", "").Replace("GB22:", "GB22*");
result = result.Replace("A / C:", "A/C*").Replace("ORD:", "ORD*").Replace("A/C:", "A/C*");
result = result.Replace("REF:", "REF*").Replace("ISIN:", "ISIN*").Replace("PAY:", "PAY*");
result = result.Replace("DEPOSITO:", "DEPOSITO*").Replace("WITH:", "WITH*");
result = result.Replace("Operaciones:", "Operaciones*").Replace("INST:", "INST*");
result = result.Replace("DETAIL:", "DETAIL*").Replace("WITH:", "WITH*").Replace("BO:", "BO*");
result = result.Replace("CUST:", "CUST*").Replace("ISIN:", "ISIN*").Replace("SEDL:", "SEDL*");
result = result.Replace("Enero:", "Enero*").Replace("enero:", "Enero*");
result = result.Replace("agosto:", "agosto*").Replace("febrero:", "febrero*");
result = result.Replace("marzo:", "marzo*").Replace("abril:", "abril*");
result = result.Replace("mayo:", "mayo*").Replace("junio:", "junio*").Replace("RE:", "RE:*");
result = result.Replace("julio:", "julio*").Replace("septiembre:", "septiembre*");
result = result.Replace("NIF:", "NIF*").Replace("INST:", "INST*").Replace("SHS:", "SHS*")
.Replace("SK:", "");
result = result.Replace("PARTY:", "PARTY*").Replace("SEDOL:", "SEDOL*").Replace("PD:", "PD*");
}
catch (Exception e)
{
}
return result;
}
And this is some sample data :"
:13: <-- keep /ISIN/XS SVUNSK UXPORTKRUDIT ZX PZY DZTU:<- replace UX DZ
TU:<- replace02ZUG12 RZTU:<- replace W/H TZX RZTU:<- replace0.00000 SHZRUS PZID:<- replace
0.000000 IDDSIN:<- replace
:31: <-- keep 1201000100CD05302,24NSUC20523531001//00520023531014
:13: <-- keep /ISIN/XS0153242003 SVUNSK UXPORTKRUDIT ZX PZY DZTU:<- replace00ZUG12 UX DZ
TU:02ZUG12 RZTU:0.30241 W/H TZX RZTU:<- replace0.00000 SHZRUS PZID:<- replace
0.000000 ISIN:XS0153242003
:31: <-- keep 1201000100DD121253,25S202IMSSMSZUX534C//S0322211DF4301
S F/O 0150001400
:13: <-- keep XNF:<- replace this

If your goal is to replace all instances of the ':' character where it is not followed by 2 or 3 other characters. You could indeed try the System.Text.RegularExpressions library. You could then simplify your cleanMessage function in the following way.
using System.Text.RegularExpressions;
function string cleanMessage(string str)
{
string pattern = ":(\s)"; //This will be a ':' followed by a space
Regex rgx = new Regex(pattern);
string replaceResult = rgx.Replace(str,"*$1") //this will replace the pattern with a '*' followed by a space.
return replaceResult;
}
If your goal is to replace all instances of the ':' character where it is not followed by 2 or 3 other characters and the 2nd or 3rd character forward or backward is not another ':'. You could change your cleanMessage to the following instead.
using System.Text.RegularExpressions;
function string cleanMessage(string str)
{
string pattern = "([^;]{2}.):(\s[^:]{2})";
//This will be 2 characters that cannot be ':' followed by anything then a ':' followed by a space and 2 more characters that cannot by ':'
//For instance, "BNF: :F" would FAIL and not get replaced but "BNF: HH" would pass and become "BNF* HH"
Regex rgx = new Regex(pattern);
string replaceResult = rgx.Replace(str,"$1*$2") //this will replace the : with a *
return replaceResult;
}
More information on the System.Text.RegularExpressions library replace can be found at
https://msdn.microsoft.com/en-us/library/xwewhkd1(v=vs.110).aspx

As #dymanoid mentioned, regular expressions are a way to handle this. By using the following you'd get what you want:
result = Regex.Replace(str, "([a-zA-Z0-9]{2,3})\:", "$1*");
However for large datasets this won't perform well. In that case I'd look at walking through str character by character using a for-loop. If the current character is not a colon, add it to the result string and to a temporary string. When the current character is a colon (:) and the temporary string has a length of 2 or 3, write an asterisk to the result and clear the temporary string.
In this case you don't do any string replacement, you just select what to write to a new string.
See here for a speed comparison between string replacement and regex replacement.

Related

C# Regex split() without removing the split condition character

I am splitting a string with regex using its Split() method.
var splitRegex = new Regex(#"[\s|{]");
string input = "/Tests/ShowMessage { 'Text': 'foo' }";
//second version of the input:
//string input = "/Tests/ShowMessage{ 'Text': 'foo' }";
string[] splittedText = splitRegex.Split(input, 2);
The string is just a sample pattern of the input. There are two different structures of input, once with a space before the { or without the space. I want to split the input on the { bracket in order to get the following result:
/Tests/ShowMessage
{ 'Text': 'foo' }
If there is a space, the string gets splitted there (space gets removed) and i get my desired result. But if there isnt a space i split the string on the {, so the { gets removed, what i dont want though. How can i use Regex.Split() without removing the split condition character?
The square brackets create a character set, so you want it to match exactly one of those inner characters. For your desire start off by removing them.
So to match it a random count of whitespaces you have to add *, the result is this one\s*.
\s is a whitespace
* means zero-or-more
That you don't remove the split condition character, you can use lookahead assertion (?=...).
(?=...) or (?!...) is a lookahead assertion
The combined Regex looks like this: \s*(?={)
This is a really good and detailed documentation of all the different Regex parts, you might have a look at it. Furthermore you can test your Regex easy and for free here.
In order to not include the curly brace in the match you can put it into a look ahead
\s*(?={)
That will match any number of white spaces up to the position before a open curly brace.
You can use regular string split, on "{" and trim the spaces off:
var bits = "/Tests/ShowMessage { 'Text': 'foo' }".Split("{", StringSplitOptions.RemoveEmptyEntries);
bits[0] = bits[0].TrimEnd();
bits[1] = "{" + bits[1];
If you want to use the RegEx route, you can add the { back if you change the regex a bit:
var splitRegex = new Regex(#"\s*{");
string input = "/Tests/ShowMessage { 'Text': 'foo' }";
//second version of the input:
//string input = "/Tests/ShowMessage{ 'Text': 'foo' }";
string[] splittedText = splitRegex.Split(input, 2);
splittedText[1] = "{" + splittedText[1];
It means "split at occurrence of (zero or more whitespace followed by {)" - so the split operation nukes your spaces (you want), and your { (you don't want) but you can put the { back with certainty that it will mean you get what you want
var splitedList = srt.Text.Replace(".", ".#").Replace("?", "?#").Replace("!", "!#").Split(new[] { "#"}, StringSplitOptions.RemoveEmptyEntries).ToList();
This will split text for .!? and will not remove condition chars. For better result just replace # with some uniq char. Like this one for example '®' That is all. Simple as it is. No regex.split which is slow and difficult due to many different task criterias, etc...
passing-> "Hello. I'am dev!"
result (split condition character exist )
"Hello."
"I'am dev!"

Regex in C# to process a text

I am trying to remove some text and keep only small text from the string.
Actually I am very new to regex, I have read an article and did not get it very well.
Here is an example of my text (every line in separate string object)
2015-03-08 10:30:00 /user841/column-width
2015-03-08 10:30:01 /user849/connect
2015-03-08 10:30:01 /user262/open-level2-price/some other text
2015-03-08 10:30:01 /user839/open-detailed-quotes
I want to process them using regex in c# and have the following output:
column-width
connect
open-level2-price/some other text
open-detailed-quotes
I have used the following line to do that but it throws an exception:
Match match = Regex.Match(line, #"*./user\d+/*.");
The Exception:
System.ArgumentException: 'parsing "*./user\d+/*." - Quantifier {x,y} following nothing.'
could anyone help please!
The error you get is caused by the fact that you try to quantify the start of the pattern, which is considered an error in a .NET regex. Perhaps, you meant to use .* instead of the *. (to match any 0+ chars greedily, as many as possible), but it is certainly not what you need judging by the expected results.
You need
/user\d+/(.*)
See the regex demo
Details:
/user - a literal substring /user
\d+ - 1 or more digits (use RegexOptions.ECMAScript option to only match ASCII digits with \d in a .NET regex)
/ - a literal /
(.*) - A capturing group #1 that matches any 0+ chars other than a newline (replace * with + to match at least 1 char).
C#:
var results = Regex.Matches(s, #"/user\d+/(.*)")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Instead of using Regex, just split on the '/' character and use the last index of the array (using LINQ):
string inputString = "2015-03-08 10:30:01 /user262/open-level2-price";
inputString.Split('/').Last();
Split returns an array of strings, in your case with the sample input above the string array would look like:
array[0] = "2015-03-08 10:30:01 "
array[1] = "user262"
array[2] = "open-level2-price"
You indicate you always want the last part so just use LINQ to take the .Last() index of the array.
Fiddle here
Here's a simple example of how to use the Regex.Replace static method.
https://dotnetfiddle.net/JuUF9E
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string[] lines = new string[] {
"2015-03-08 10:30:00 /user841/column-width",
"2015-03-08 10:30:01 /user849/connect",
"2015-03-08 10:30:01 /user262/open-level2-price",
"2015-03-08 10:30:01 /user839/open-detailed-quotes"
};
string pattern = #"(.*/.*/)(.*)";
string replacement = "$2";
foreach(var line in lines)
{
Console.WriteLine(Regex.Replace(line, pattern, replacement));
}
}
}
I don't know why you're trying to do this simple thing with regex, you just have to read the lines and split by the '\', them select the last index and that's it. For example, if you have that data in a file you can use something like this:
string newString = "";
StreamReader sr = new StreamReader('log.txt');
while(!sr.ReadLine)
{
string[] splitted = sr.ReadLine().Split('/');
if(splitted.Length > 0)
newString += splitted[splitted.Length - 1];
}
sr.Close();
At the end, the newString variable will contains what you want. Otherwise you can add every line in a list if you will do some with the data.
How about using Look around
var line = "2015-03-08 10:30:01 /user839/open-detailed otes/dsada/dsa/das/dsadsa";
// dsadsa
var match = Regex.Match(line, #"(?!.*/).*").Value;

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Search string pattern

If I have a string like MCCORMIC 3H R Final 08-26-2011.dwg or even MCCORMIC SMITH 2N L Final 08-26-2011.dwg and I wanted to capture the R in the first string or the L in the second string in a variable, what is the best method for doing so? I was thinking about trying the below statement but it does not work.
string filename = "MCCORMIC 3H R Final 08-26-2011.dwg"
string WhichArea = "";
int WhichIndex = 0;
WhichIndex = filename.IndexOf("Final");
WhichArea = filename.Substring(WhichIndex - 1,1); //Trying to get the R in front of word Final
Just split by space:
var parts = filename.Split(new [] {' '},
StringSplitOptions.RemoveEmptyEntries);
WhichArea = parts[parts.Length - 3];
It looks like the file names have a very specific format, so this will work just fine.
Even with any number of spaces, using StringSplitOptions.RemoveEmptyEntries means spaces will not be part of the split result set.
Code updated to deal with both examples - thanks Nikola.
I had to do something similar, but with Mirostation drawings instead of Autocad. I used regex in my case. Here's what I did, just in case you feel like making it more complex.
string filename = "MCCORMIC 3H R Final 08-26-2011.dwg";
string filename2 = "MCCORMIC SMITH 2N L Final 08-26-2011.dwg";
Console.WriteLine(TheMatch(filename));
Console.WriteLine(TheMatch(filename2));
public string TheMatch(string filename) {
Regex reg = new Regex(#"[A-Za-z0-9]*\s*([A-Z])\s*Final .*\.dwg");
Match match = reg.Match(filename);
if(match.Success) {
return match.Groups[1].Value;
}
return String.Empty;
}
I don't think Oded's answer covers all cases. The first example has two words before the wanted letter, and the second one has three words before it.
My opinion is that the best way to get this letter is by using RegEx, assuming that the word Final always comes after the letter itself, separated by any number of spaces.
Here's the RegEx code:
using System.Text.RegularExpressions;
private string GetLetter(string fileName)
{
string pattern = "\S(?=\s*?Final)";
Match match = Regex.Match(fileName, pattern);
return match.Value;
}
And here's the explanation of RegEx pattern:
\S(?=\s*?Final)
\S // Anything other than whitespace
(?=\s*?Final) // Positive look-ahead
\s*? // Whitespace, unlimited number of repetitions, as few as possible.
Final // Exact text.

Removing all whitespace lines from a multi-line string efficiently

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.
EDIT: I should add I'm using .NET 2.0.
Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.
First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.
Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.
If you want to remove lines containing any whitespace (tabs, spaces), try:
string fix = Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline);
Edit (for #Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:
string fix =
Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline)
.TrimEnd();
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Trim().Length > 0)
writer.WriteLine(line);
}
outputString = writer.ToString();
}
off the top of my head...
string fixed = Regex.Replace(input, "\s*(\n)","$1");
turns this:
fdasdf
asdf
[tabs]
[spaces]
asdf
into this:
fdasdf
asdf
asdf
Using LINQ:
var result = string.Join("\r\n",
multilineString.Split(new string[] { "\r\n" }, ...None)
.Where(s => !string.IsNullOrWhitespace(s)));
If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.
Alright this answer is in accordance to the clarified requirements specified in the bounty:
I also need to remove any trailing newlines, and my Regex-fu is
failing. My bounty goes to anyone who can give me a regex which passes
this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") ==
"test\r\nthis"
So Here's the answer:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z
Or in the C# code provided by #Chris Schmich:
string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", #"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);
Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.
(?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
(?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
(\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)
That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.
EDIT: #Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:
\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of #Will's provided test cases.
So all together now, it should be:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
EDIT #2: Alright there is one more possible case #Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.
\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.
So now we've got:
\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
So now we have four patterns for matching:
whitespace at the beginning of the file,
redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
redundant line breaks with no content, (ex: \r\n\r\n)
whitespace at the end of the file
not good. I would use this one using JSON.net:
var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.
Use RegexOptions.Multiline with this pattern:
^\s+(?!\B)|\s*(?>[\r\n]+)$
Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.
string[] inputs =
{
"one\r\n \r\ntwo\r\n\t\r\n \r\n",
"test\r\n \r\nthis\r\n\r\n",
"\r\n\r\ntest!",
"\r\ntest\r\n ! test",
"\r\ntest \r\n ! "
};
string[] outputs =
{
"one\r\ntwo",
"test\r\nthis",
"test!",
"test\r\n ! test",
"test \r\n ! "
};
string pattern = #"^\s+(?!\B)|\s*(?>[\r\n]+)$";
for (int i = 0; i < inputs.Length; i++)
{
string result = Regex.Replace(inputs[i], pattern, "",
RegexOptions.Multiline);
Console.WriteLine(result == outputs[i]);
}
EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.
string corrected =
System.Text.RegularExpressions.Regex.Replace(input, #"\n+", "\n");
I'll go with:
public static string RemoveEmptyLines(string value) {
using (StringReader reader = new StringReader(yourstring)) {
StringBuilder builder = new StringBuilder();
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Trim().Length > 0)
builder.AppendLine(line);
}
return builder.ToString();
}
}
Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.
public static string RemoveEmptyLines(this string text) {
var builder = new StringBuilder();
using (var reader = new StringReader(text)) {
while (reader.Peek() != -1) {
string line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
builder.AppendLine(line);
}
}
return builder.ToString();
}
Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:
public static bool IsNullOrWhiteSpace(string text) {
return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
In response to Will's bounty here is a Perl sub that gives correct response to the test case:
sub StripWhitespace {
my $str = shift;
print "'",$str,"'\n";
$str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
print "'",$str,"'\n";
return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");
output:
'test
this
'
'test
this'
In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;
There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
if its only White spaces why don't you use the C# string method
string yourstring = "A O P V 1.5";
yourstring.Replace(" ", string.empty);
result will be "AOPV1.5"
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
Here is something simple if working against each individual line...
(^\s+|\s+|^)$
Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips
All empty lines from the start of a string
Not including any spaces at the beginning of the first non-whitespace line
All empty lines after the first non-whitespace line and before the last non-whitespace line
Again, preserving all whitespace at the beginning of any non-whitespace line
All empty lines after the last non-whitespace line, including the last newline
(?<=(\r\n)|^)\s*\r\n|\r\n\s*$
which essentially says:
Immediately after
The beginning of the string OR
The end of the last line
Match as much contiguous whitespace as possible that ends in a newline*
OR
Match a newline and as much contiguous whitespace as possible that ends at the end of the string
The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.
Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.
*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)
String Extension
public static string UnPrettyJson(this string s)
{
try
{
// var jsonObj = Json.Decode(s);
// var sObject = Json.Encode(value); dont work well with array of strings c:['a','b','c']
object jsonObj = JsonConvert.DeserializeObject(s);
return JsonConvert.SerializeObject(jsonObj, Formatting.None);
}
catch (Exception e)
{
throw new Exception(
s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
}
}
Im not sure is it efficient but =)
List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
Try this.
string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);
string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
s = Regex.Replace(s, #"^[^\n\S]*\n", "");
[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:
s = Regex.Replace(s, #"^[ \t\r]*\n", "");
And if you want it to catch the last line, without a final linefeed:
s = Regex.Replace(s, #"^[ \t\r]*\n?", "");

Categories

Resources