Regular Expression to Group Strings - c#

I am relatively new with Regular Expressions so please excuse me.
I am currently trying to group each line based on the record line. So, for example, I want all lines proceding the record Line to be grouped into one string, until the next record line. I have been trying to use regular expressions, and I have obtained a result that is very close to what I want, however, there is a newline present at the beginning of the array that I am reading it into.
This is the code I am using to split the data up.
using (StreamReader sr = new StreamReader(file))
{
string line;
line = sr.ReadToEnd();
string[] parts = Regex.Split(line, #"(?=PA11)");
List<string> parameterList = new List<string>(parts);
foreach (string s in parameterList)
{
listBox1.Items.Add(s);
}
}
And this is the result looks like this:
*newline*
LINE 000001 000001 TEST A B TEST OUTPUT *More Lines*
LINE 000002 000002 TEST A B TEST OUTPUT *More Lines*
If anyone can tell me what it is I am doing wrong, I would greatly appreciate it. Thank you in advance.

If your need is that simple, don't use a REGEX.
using (StreamReader sr = new StreamReader(file))
{
string line = sr.ReadLine();
while( line != null ){
if( line.StartsWith( "PA11" ) ){
string[] parts = line.Split( " " );
List<string> parameterList = new List<string>(parts);
foreach (string s in parameterList)
listBox1.Items.Add(s);
}
}
}

Looks to me like it's not inserting a newline but a blank entry. Your regex matches the very beginning of the input because the first line starts with PA11, and it doesn't consume any characters, so the first item in the parts array is an empty string. You should be able to prevent that by forcing the regex to consume some characters, such as the newline preceding the PA11 line:
string[] parts = Regex.Split(line, #"[\r\n]+(?=PA11)");
...or by making sure it doesn't match unless there's a newline before PA11:
string[] parts = Regex.Split(line, #"(?<=[\r\n])(?=PA11)");

Why not use string.split? string[] parts = line.split("PA11")..
you can reinsert the demimater back into each part.

The reason it creates an empty [0] element is there is probably whitespace (newline) at the beginning of the string.
The below will work, code tested here-> http://www.ideone.com/tsOlI (I'm no .NET expert)
string[] parts = Regex.Split(line, #"(?=(?<!^\s*)PA11)");
Expanded:
(?= # look ahead, we're at the first 'PA11'
(?<!^\s*) # before its ok, there can't be '^\s*' before us
PA11 # ok, this 'PA11' is good to split
) # end look ahead
Beware that if there is anything other than whitespace before the first PA11,
it will create a [0] element with that block.
It could be done a little more meaningfull in a match all context with something like this:
(?:^\s*|(?<=\n))\s*(PA11.*?)(?=\n+PA11|$)
use single line modifier or change .*? to [\S\s]*?
It will only match from beginning of block to before the next beginning (or end of string)
and strips residual boundry whitespace characters.

Related

C# insert string in multiline string

I have a multiline string (from a txt-file using ReadAllText).
the string looks like this:
R;0035709310000026542510X0715;;;
R;0035709310000045094410P1245;;;
R;0035709310000026502910Z1153;;;
I want to put in a ";" in each line on place 22, so it looks like this:
R;00357093100000265425;10X0715;;;
R;00357093100000450944;10P1245;;;
R;00357093100000265029;10Z1153;;;
The multiline string always contain the samme amount of data but not always 3 lines - sometimes more lines.
How do I make this? Please show some code.
Thanks alot :-)
Best regards
Bent
Try this ...
using System.IO;
using System.Linq;
var lines = File.ReadAllLines("data.txt");
var results = lines.Select(x => x.Insert(22, ";"));
Step 1, don't use ReadAllText(). Use ReadAllLines() instead.
string[] myLinesArray = File.ReadAllLines(...);
Step 2, replace all lines (strings) with the changed version.
for(int i = 0; i < myLinesArray.Length; i++)
myLinesArray[i] = myLinesArray[i].Insert(22, ";");
Step 3, Use WriteAllLines()
try this
string s ="R;0035709310000026542510X0715;;;";
s = s.Insert(22,";");
Console.Write(s);
or use Regex
string s =#"R;0035709310000026542510X0715;;;
R;0035709310000045094410P1245;;;
R;0035709310000026502910Z1153;;;";
string resultString = Regex.Replace(s, "^.{22}", "$0;", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Console.Write(resultString);
I think it would be better to read the source file line by line and modify the line as you go.
You could build up your new file in a StringBuilder or, if is large,
write it to a new file, used to replace the source at the end.
Something like this,
using System.IO;
string tempFileName = Path.GetTempFileName();
using (StreamWriter target = File.CreateText(tempFileName))
{
using(StreamReader source = file.OpenText("YourSourceFile.???"))
{
while (source.Peek() >= 0)
{
target.WriteLine(source.ReadLine().Insert(22, ";"));
}
}
}
File.Delete("YourSourceFile.???");
File.Move(tempFileName, "YourSourceFile.???");
This approach becomes is especially appropriate for large files since it avoids loading all the data into memory at once but the performance will be good for all but very large files or, I guess, if the lines were very (very) long.
As suggested, you can use the Insert method to achieve your goal.
If your file contains a lot of lines and you need to work on 1 line at a time, you might also consider reading it line by line from a TextReader.
You could go with Regex:
myString = Regex.Replace(myString, #"(^.{22})", #"\1;", RegexOptions.Multiline);
Explanation:
you have 3 string arguments:
1st one is the input
2nd is the pattern
3rd is the replacement string
In the pattern:
() is a capturing group: you can call it in the replacement string with \n, n being the 1-based index of the capturing group in the pattern. In this case, \1 is whatever matched "(^.{22})"
"^" is the beginning of a line (because we set the multiline options, otherwise it would be the beginning of the input string)
"." matches any character
{22} means you want preceeding pattern (in this case ".", any character) 22 times
So what that means is:
"in any line with 22 characters or more, replace the 22 first characters by those same 22 characters plus ";"

how can i optimize the performance of this regular expression?

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces.
I'm running the regex on file content through a script task in SSIS. The file content is over 6000 lines long.
I saw an example of using a regex on file content that looked like this
String FileContent = ReadFile(FilePath, ErrInfo);
Regex r = new Regex(#"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");
That replace can understandably take its sweet time on a decent sized file.
Is there a more efficient way to run this regex?
Would it be faster to read the file line by line and run the regex per line?
It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).
In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).
Alternatively, you can check whether each line has quotes and use a simpler method accordingly.
The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n2) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:
Regex csvRegex = new Regex(#"
(?<Quoted>
"" # Open quotes
(?:[^""]|"""")* # not quotes, or two quotes (escaped)
"" # Closing quotes
)
| # OR
(?<Comma>,) # A comma
",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
match => match.Groups["Comma"].Success ? "\t" : match.Value);
Here we match free command and quoted strings. The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.
The simplest optimization would be
Regex r = new Regex(#"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
Console.WriteLine(r.Replace(line, "\t"));
I haven't profiled it, but I wouldn't be surprised if the speedup was huge.
If that's not enough I suggest some manual labour:
var input = new StreamReader(File.OpenRead("input.txt"));
char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
var result = new StringBuilder(line);
bool inquotes = false;
for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
{
bool isquote = (line[index] == '\"');
inquotes = inquotes != isquote;
if (!(isquote || inquotes))
result[index] = '\t';
}
Console.WriteLine (result);
}
PS: I assumed #"\t" was a typo for "\t", but perhaps it isn't :)

C# Reading strings from a file and finding matches between Start string and an End String different by one character

Suppose I have a txt file with strings {ABAA, AAAA, ABZA, ABZZ, and AAZZ} and my Start word is AAAA and my end word is AAZZ.
I need to find all the words between the start word and end word different by one character; so from the example given my results would be: AAAA, ABZZ and AAZZ.
At the moment what I am doing is creating a list and reading the file line-by-line and passing it to the list.
// 1 Declare new List.
List<string> lines = new List<string>();
// 2
// Use using StreamReader for disposing.
using (StreamReader sr = new StreamReader(PATH))
{
// 3
// Use while != null pattern for loop
string line;
while ((line = sr.ReadLine()) != null)
{
// 4
// Insert logic here.
// ...
// "line" is a line in the file. Add it to our List.
lines.Add(line);
}
}
My question is: how do I look for strings different by one character? Do I need to break the string that I read from the file into characters and do a comparison to my Start and End Strings?
bool compareStrings(a, b): return a.Zip(b, (a,b) => { a, b }).Where(x => x.a != x.b).Take(2).Count() <= 1;
Regular expressions are very good at finding this sort of thing and .NET has excellent support for regular expressions. First you need to define the regular expression.
Your requirements are a bit vague but according to your description, example data and example results I'm inferring that you want to match the start word and every word that varies from the end word by exactly one character. The regex you need is:
\bAAAA\b|\bAAZ\w\b|\bAA\wZ\b|\bA\wZZ\b|\b\wAZZ\b
Let me break that down for left to right.
'\b' means "word boundary" which could be whitespace or a curly brace or other such non-word character.
'AAAA' is your start word and would be matched litterally
'\b' means "word boundary"
'|' means "alternation" which essentially means "match the expression on the left OR match the expression on the right"
'\b' means "word boundary"
'AAZ\w' is the first permutation of one-character differences from your end word. '\w' means "any word character."
'\b' means "word boundary"
'\bAA\wZ\b' is the second permutation of one-character differences from your end word.
'\bA\wZZ\b' is the third permutation.
'\b\wAZZ\b' is the fourth and final permutation and would also match the end word.
See http://www.regular-expressions.info/reference.html for definitions of "word boundary" and "word character."
Now for the code:
using System;
using System.Text.RegularExpressions;
string pattern = #"\bAAAA\b|\bAAZ\w\b|\bAA\wZ\b|\bA\wZZ\b|\b\wAZZ\b";
// 1 Declare new List.
List<string> lines = new List<string>();
// 2
// Use using StreamReader for disposing.
using (StreamReader sr = new StreamReader(PATH))
{
// 3
// Use while != null pattern for loop
string line;
while ((line = sr.ReadLine()) != null)
{
// 4
if (Regex.IsMatch(line, pattern, RegexOptions.IgnoreCase))
{
// ...
// "line" is a line in the file. Add it to our List.
lines.Add(line);
}
}
}
I'm not sure of all the requirements, but this function should return the amount of characters that match between two words.
private int CheckWord(string startWord, string otherWord)
{
List<char> start = new List<char>(startWord.ToArray());
List<char> wordt = new List<char>(otherWord.ToArray());
return start.Intersect(wordt).Count();
}
This call CheckWord("start", "srart"); returns 4. Match that number against the length of the string to determine how different they are.

Removing all whitespace lines from a multi-line string efficiently

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.
EDIT: I should add I'm using .NET 2.0.
Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.
First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.
Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.
If you want to remove lines containing any whitespace (tabs, spaces), try:
string fix = Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline);
Edit (for #Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:
string fix =
Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline)
.TrimEnd();
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Trim().Length > 0)
writer.WriteLine(line);
}
outputString = writer.ToString();
}
off the top of my head...
string fixed = Regex.Replace(input, "\s*(\n)","$1");
turns this:
fdasdf
asdf
[tabs]
[spaces]
asdf
into this:
fdasdf
asdf
asdf
Using LINQ:
var result = string.Join("\r\n",
multilineString.Split(new string[] { "\r\n" }, ...None)
.Where(s => !string.IsNullOrWhitespace(s)));
If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.
Alright this answer is in accordance to the clarified requirements specified in the bounty:
I also need to remove any trailing newlines, and my Regex-fu is
failing. My bounty goes to anyone who can give me a regex which passes
this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") ==
"test\r\nthis"
So Here's the answer:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z
Or in the C# code provided by #Chris Schmich:
string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", #"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);
Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.
(?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
(?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
(\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)
That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.
EDIT: #Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:
\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of #Will's provided test cases.
So all together now, it should be:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
EDIT #2: Alright there is one more possible case #Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.
\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.
So now we've got:
\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
So now we have four patterns for matching:
whitespace at the beginning of the file,
redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
redundant line breaks with no content, (ex: \r\n\r\n)
whitespace at the end of the file
not good. I would use this one using JSON.net:
var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.
Use RegexOptions.Multiline with this pattern:
^\s+(?!\B)|\s*(?>[\r\n]+)$
Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.
string[] inputs =
{
"one\r\n \r\ntwo\r\n\t\r\n \r\n",
"test\r\n \r\nthis\r\n\r\n",
"\r\n\r\ntest!",
"\r\ntest\r\n ! test",
"\r\ntest \r\n ! "
};
string[] outputs =
{
"one\r\ntwo",
"test\r\nthis",
"test!",
"test\r\n ! test",
"test \r\n ! "
};
string pattern = #"^\s+(?!\B)|\s*(?>[\r\n]+)$";
for (int i = 0; i < inputs.Length; i++)
{
string result = Regex.Replace(inputs[i], pattern, "",
RegexOptions.Multiline);
Console.WriteLine(result == outputs[i]);
}
EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.
string corrected =
System.Text.RegularExpressions.Regex.Replace(input, #"\n+", "\n");
I'll go with:
public static string RemoveEmptyLines(string value) {
using (StringReader reader = new StringReader(yourstring)) {
StringBuilder builder = new StringBuilder();
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Trim().Length > 0)
builder.AppendLine(line);
}
return builder.ToString();
}
}
Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.
public static string RemoveEmptyLines(this string text) {
var builder = new StringBuilder();
using (var reader = new StringReader(text)) {
while (reader.Peek() != -1) {
string line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
builder.AppendLine(line);
}
}
return builder.ToString();
}
Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:
public static bool IsNullOrWhiteSpace(string text) {
return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
In response to Will's bounty here is a Perl sub that gives correct response to the test case:
sub StripWhitespace {
my $str = shift;
print "'",$str,"'\n";
$str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
print "'",$str,"'\n";
return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");
output:
'test
this
'
'test
this'
In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;
There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
if its only White spaces why don't you use the C# string method
string yourstring = "A O P V 1.5";
yourstring.Replace(" ", string.empty);
result will be "AOPV1.5"
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
Here is something simple if working against each individual line...
(^\s+|\s+|^)$
Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips
All empty lines from the start of a string
Not including any spaces at the beginning of the first non-whitespace line
All empty lines after the first non-whitespace line and before the last non-whitespace line
Again, preserving all whitespace at the beginning of any non-whitespace line
All empty lines after the last non-whitespace line, including the last newline
(?<=(\r\n)|^)\s*\r\n|\r\n\s*$
which essentially says:
Immediately after
The beginning of the string OR
The end of the last line
Match as much contiguous whitespace as possible that ends in a newline*
OR
Match a newline and as much contiguous whitespace as possible that ends at the end of the string
The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.
Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.
*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)
String Extension
public static string UnPrettyJson(this string s)
{
try
{
// var jsonObj = Json.Decode(s);
// var sObject = Json.Encode(value); dont work well with array of strings c:['a','b','c']
object jsonObj = JsonConvert.DeserializeObject(s);
return JsonConvert.SerializeObject(jsonObj, Formatting.None);
}
catch (Exception e)
{
throw new Exception(
s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
}
}
Im not sure is it efficient but =)
List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
Try this.
string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);
string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
s = Regex.Replace(s, #"^[^\n\S]*\n", "");
[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:
s = Regex.Replace(s, #"^[ \t\r]*\n", "");
And if you want it to catch the last line, without a final linefeed:
s = Regex.Replace(s, #"^[ \t\r]*\n?", "");

Regex not working in .NET

So I'm trying to match up a regex and I'm fairly new at this. I used a validator and it works when I paste the code but not when it's placed in the codebehind of a .NET2.0 C# page.
The offending code is supposed to be able to split on a single semi-colon but not on a double semi-colon. However, when I used the string
"entry;entry2;entry3;entry4;"
I get a nonsense array that contains empty values, the last letter of the previous entry, and the semi-colons themselves. The online javascript validator splits it correctly. Please help!
My regex:
((;;|[^;])+)
Split on the following regular expression:
(?<!;);(?!;)
It means match semicolons that are neither preceded nor succeeded by another semicolon.
For example, this code
var input = "entry;entry2;entry3;entry4;";
foreach (var s in Regex.Split(input, #"(?<!;);(?!;)"))
Console.WriteLine("[{0}]", s);
produces the following output:
[entry]
[entry2]
[entry3]
[entry4]
[]
The final empty field is a result of the semicolon on the end of the input.
If the semicolon is a terminator at the end of each field rather than a separator between consecutive fields, then use Regex.Matches instead
foreach (Match m in Regex.Matches(input, #"(.+?)(?<!;);(?!;)"))
Console.WriteLine("[{0}]", m.Groups[1].Value);
to get
[entry]
[entry2]
[entry3]
[entry4]
Why not use String.Split on the semicolon?
string sInput = "Entry1;entry2;entry3;entry4";
string[] sEntries = sInput.Split(';');
// Do what you have to do with the entries in the array...
Hope this helps,
Best regards,
Tom.
As tommieb75 wrote, you can use String.Split with StringSplitOptions Enumeration so you can control your output of newly created splitting array
string input = "entry1;;entry2;;;entry3;entry4;;";
char[] charSeparators = new char[] {';'};
// Split a string delimited by characters and return all non-empty elements.
result = input.Split(charSeparators, StringSplitOptions.RemoveEmptyEntries);
The result would contain only 4 elements like this:
<entry1><entry2><entry3><entry4>

Categories

Resources