Regular Expression for parsing ASCII data

Regular Expression for parsing ASCII data - c#

Right now I have a couple separate regular expressions to filter data from a string but I'm curious if there's a way to do it all in one go.
Sample Data:
(DATA$0$34.0002,5.3114$34.0002,5.2925$34.0004,5.3214$34.0007,2.2527$34.0002,44.3604$34.0002,43.689$34.0004,38.3179$34.0007,8.1299)
Need to verify there's an open and close parentheses ( )
Need to verify there's a "DATA$0" after the open parenthesis
Need to split the results by $
Need to split that subset by comma
Need to capture only the last item of that subset (i.e. 5.3114, 5.2925, 5.3214, etc.)
My first check is on parenthesis using (([^)]+)) as my RegEx w/ RightToLeft & ExplicitCapture options (some lines can have multiple data sets).
Next I filter for the DATA$0 using (?:(DATA$0)
Finally I do my splits and take the last value in the array to get what I need but I'm trying to figure out if there's a better way.
string DataPattern = #"(?:\(DATA\$0)";
string ParenthesisPattern = #"\(([^)]+)\)";
RegexOptions options = RegexOptions.RightToLeft | RegexOptions.ExplicitCapture;
StreamReader sr = new StreamReader(FilePath);
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
Console.WriteLine(line);
Match parentMatch = Regex.Match(line, ParenthesisPattern, options);
if (parentMatch.Success)
{
string value = parentMatch.Value;
Match dataMatch = Regex.Match(value, DataPattern);
if (dataMatch.Success)
{
string output = parentMatch.Value.Replace("(DATA$0", "").Replace(")", "");
string[] splitOutput = Regex.Split(output, #"\$");
foreach (string x in splitOutput)
{
if (!string.IsNullOrEmpty(x))
{
string[] splitDollar = Regex.Split(x, ",");
if (splitDollar.Length > 0)
Console.WriteLine("Value: " + splitDollar[splitDollar.Length - 1]);
}
}
}
else
Console.WriteLine("NO DATA");
}
else
Console.WriteLine("NO PARENTHESIS");
Console.ReadLine();
}
TIA

You can use
var results = Regex.Matches(text, #"(?<=\(DATA\$0[^()]*,)[^(),$]+(?=(?:\$[^()]*)?\))")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the regex demo. Details:
(?<=\(DATA\$0[^()]*,) - a positive lookbehind that matches a location that is immediately preceded with (DATA$0, zero or more chars other than ( and ) (as many as possible) and a comma
[^(),$]+ - one or more chars other than (, ), $ and a comma
(?=(?:\$[^()]*)?\)) - the current location must be immediately followed with an optional occurrence of a $ char and then zero or more chars other than ( and ), and then a ) char.
An alternative:
var results = Regex.Matches(text, #"(?:\G(?!^)|\(DATA\$0)[^()]*?,([^(),$]+)(?=(?:\$[^()]*)?\))")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.ToList();
See the regex demo. Details:
(?:\G(?!^)|\(DATA\$0) - either the end of the previous successful match, or (DATA$0 string
[^()]*? - zero or more chars other than (, ), ,, as few as possible
, - a comma
([^(),$]+) - Group 1: one or more chars other than (, ), ,, $
(?=(?:\$[^()]*)?\)) - a positive lookahead matching the location that is immediately followed with an optional occurrence of a $ char followed with zero or more chars other than ( and ), and then a ) char.

Related

Regex How to Match 2 fields

How would capture both the filenames inside the quotes, and the numbers following as named captures (Regex / C#)?
Files("fileone.txt", 5969784, "file2.txt", 45345333)
Out of every occurrence in the string, the ability to capture "fileone.txt" and the integer following (a loop cycles each pair)
I am trying to use this https://regex101.com/r/MwMzBo/1 but having issues matching without the '[' and ']'.
Required to be able to loop each filename+size as a pair and moving next.
Any help is appreciated!
UPDATE
string file = "Files(\"fileone.txt\", 5969784, \"file2.txt\", 45345333, \"file2.txt\", 45345333)";
var regex = new Regex(#"(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)");
var match = regex.Match(file);
var names = match.Groups["file"].Captures.Cast<Capture>();
var lengths = match.Groups["number"].Captures.Cast<Capture>();
var filelist = names.Zip(lengths, (f, n) => new { file = f.Value, length = long.Parse(n.Value) }).ToArray();
foreach (var item in filelist)
{
// Only returning 1 pair result, ignoring the rest
}
Reading match.Value to confirm what is being read. Only first pair is being picked up.
while (match.Success)
{
MessageBox.Show(match.Value);
match = match.NextMatch();
}
Now we are getting all results properly. I read, that Regex.Match only returns the first matched result. This explains a lot.

You can use
(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)
See the regex demo
Details:
(?:\G(?!\A)\s*,\s*|\w+\() - end of the previous successful match and a comma enclosed with zero or more whitespaces, or a word and an opening ( char
(?:""(?<file>.*?)""|'(?<file>.*?)') - ", Group "file" capturing any zero or more chars other than a newline char as few as possible and then a ", or a ', Group "file" capturing any zero or more chars other than a newline char as few as possible and then a '
\s*,\s* - a comma enclosed with zero or more whitespaces
(?<number>\d+) - Group "number": one or more digits.

I like doing it in smaller pieces :
string input = "cov('Age', ['5','7','9'])";
string pattern1 = #"\((?'key'[^,]+),\s+\[(?'values'[^\]]+)";
Match match = Regex.Match(input, pattern1);
string key = match.Groups["key"].Value.Trim(new char[] {'\''});
string pattern2 = #"'(?'value'[^']+)'";
string values = match.Groups["values"].Value;
MatchCollection matches = Regex.Matches(values, pattern2);
int[] number = matches.Cast<Match>().Select(x => int.Parse(x.Value.Replace("'",string.Empty))).ToArray();

Regex.Split string into substrings by a delimiter while preserving whitespace

I created a Regex to split a string by a delimiter ($), but it's not working the way I want.
var str = "sfdd fgjhk fguh $turn.bak.orm $hahr*____f";
var list = Regex.Split(str, #"(\$\w+)").Where(x => !string.IsNullOrEmpty(x)).ToList();
foreach (var item in list)
{
Console.WriteLine(item);
}
Output:
"sfdd fgjhk fguh "
"$turn"
".bak.orm "
"$hahr"
"*____f"
The problem is \w+ is not matching any periods or stars. Here's the output I want:
"sfdd fgjhk fguh "
"$turn.bak.orm"
" "
"$hahr*____f"
Essentially, I want to split a string by $ and make sure $ appears at the beginning of a substring and nowhere else (it's okay for a substring to be $ only). I also want to make sure whitespace characters are preserved as in the first substring, but any match should not contain whitespace as in the second and fourth cases. I don't care for case sensitivity.

It appears you want to split with a pattern that starts with a dollar and then has any 0 or more chars other than whitespace and dollar chars:
var list = Regex.Split(s, #"(\$[^\s$]*)")
.Where(x => !string.IsNullOrEmpty(x))
.ToList();
Details
( - start of a capturing group (so that Regex.Split tokenized the string, could keep the matches inside the resulting array)
\$ - a dollar sign
[^\s$]* - a negated character class matching 0 or more chars other than whitespace (\s) and dollar symbols
) - end of the capturing group.
See the regex demo:
To include a second delimiter, you may use #"([€$][^\s€$]*)".

Match properties using regex

I have a string like that represent a set of properties, for example:
AB=0, TX="123", TEST=LDAP, USR=" ", PROPS="DN=VB, XN=P"
I need to extract this properties in:
AB=0
TX=123
TEST=LDAP
USR=
PROPS=DN=VB, XN=P
To resolve this problem I tried to use a regex, but without success.
public IEnumerable<string> SplitStr(string input)
{
Regex reg= new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))", RegexOptions.Compiled);
foreach (Match match in reg.Matches(input))
{
yield return match.Value.Trim(',');
}
}
I can't find the ideal regex to expected output. With the above regex the output is:
AB=0
123
TEST=LDAP
DN=VB, XN=P
Anyone can help me?

You may use
public static IEnumerable<string> SplitStr(string input)
{
var matches = Regex.Matches(input, #"(\w+=)(?:""([^""]*)""|(\S+)\b)");
foreach (Match match in matches)
{
yield return string.Concat(match.Groups.Cast<Group>().Skip(1).Select(x => x.Value)).Trim();
}
}
The regex details:
(\w+=) - Group 1: one or more word chars and a = char
(?:""([^""]*)""|(\S+)\b) - a non-capturing group matching either of the two alternatives:
"([^"]*)" - a ", then 0 or more chars other than " and then a "
| - or
(\S+)\b - any 1+ chars other than whitespace, as many as possible, up to the word boundary position.
See the regex demo.
The string.Concat(match.Groups.Cast<Group>().Skip(1).Select(x => x.Value)).Trim() code omits the Group 0 (whole match) value from the groups, takes Group 1, 2 and 3 and concats them into a single string, and trims it afterwards.
C# test:
var s = "AB=0, TX=\"123\", TEST=LDAP, USR=\" \", PROPS=\"DN=VB, XN=P\"";
Console.WriteLine(string.Join("\n", SplitStr(s)));
Output:
AB=0
TX=123
TEST=LDAP
USR=
PROPS=DN=VB, XN=P

Another way could be to use 2 capturing groups where the first group captures the first part including the equals sign and the second group captures the value after the equals sign.
Then you can concatenate the groups and use Trim to remove the double quotes. If you also want to remove the whitespaces after that, you could use Trim again.
([^=\s,]+=)("[^"]+"|[^,\s]+)
That will match
( First capturing group
[^=\s,]+= Match 1+ times not an equals sign, comma or whitespace char, then match = (If the property name can contain a comma, you could instead use character class and specify what you would allow to match like for example[\w,]+)
) Close group
( Second capturing group
"[^"]+" Match from opening till closing double quote
| Or
[^,\s]+ Match 1+ times not a comma or whitespace char
)
Regex demo | C# demo
Your code might look like:
public IEnumerable<string> SplitStr(string input)
{
foreach (Match m in Regex.Matches(input, #"([^=\s,]+=)(""[^""]+""|[^,\s]+)"))
{
yield return string.Concat(m.Groups[1].Value, m.Groups[2].Value.Trim('"'));
}
}

Regex to find longest text fragment where last letter of word matches first letter of next word

For example if I had a text like
first line of text
badger Royal lemon, night trail
light of. Random string of words
that don't match anymore.
My result would have to be lines of words where the last character of each word matches the first character of the next word, even if there are separators in between. In this case:
badger Royal lemon, night trail
light
What is the easiest way to do this if I want to use Regex?

A regular expression that matches each of the sequences of words would be:
(?:\b\w+(\w)\b[\W]*(?=\1))*\1\w+
You'll need to adjust the \W part depending on your rules regarding allowing full-stops, semi-colons, commas, etc.
Note this also assumes single letter words break a sequence.
You could then loop over the each of the occurrences and find the longest:
try {
Regex regexObj = new Regex(#"(?:\b\w+(\w)\b[\W+]*(?=\1))*\1\w+", RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
// #todo here test and keep the longest match.
matchResults = matchResults.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
// (?:\b\w+(\w)\b[\W]*(?=\1))*\1\w+
//
// Options: Case insensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Numbered capture
//
// Match the regular expression below «(?:\b\w+(\w)\b[\W]*(?=\1))*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) «\b»
// Match a single character that is a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «\w+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regex below and capture its match into backreference number 1 «(\w)»
// Match a single character that is a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «\w»
// Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) «\b»
// Match a single character that is NOT a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «[\W]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=\1)»
// Match the same text that was most recently matched by capturing group number 1 (case insensitive; fail if the group did not participate in the match so far) «\1»
// Match the same text that was most recently matched by capturing group number 1 (case insensitive; fail if the group did not participate in the match so far) «\1»
// Match a single character that is a “word character” (Unicode; any letter or ideograph, digit, connector punctuation) «\w+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

I know that this is NOT a regex implementation but ... maybe it helps. This is a simple implementation in C#:
public static string Process (string s)
{
var split = s.Split(new[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
if (split.Length < 2)
return null; // impossible to find something if the length is not at least two
string currentString = null;
string nextString = null;
for (var i = 0; i < split.Length - 1; i++)
{
var str = split[i];
if (str.Length == 0) continue;
var lastChar = str[str.Length - 1];
var nextStr = split[i + 1];
if (nextStr.Length == 0) continue;
var nextChar = nextStr[0];
if (lastChar == nextChar)
{
if (currentString == null)
{
currentString = str;
nextString = nextStr.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)[0];
}
else
{
if (str.Length > currentString.Length)
{
currentString = str;
nextString = nextStr.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)[0];
}
}
}
}
return currentString == null ? null : currentString + "\n" + nextString;
}

Regex won't really be able to tell the longest in the string.
But, using #DeanTaylor method if a global match you could store the longest
based on the string length of the match.
This is a slight variation of his regex, but it works the same.
(?:\w*(\w)\W+(?=\1))+\w+
Formatted:
(?:
\w*
( \w ) # (1)
\W+
(?= \1 )
)+
\w+

Split by comma if that comma is not located between two double quotes

I am looking to split such string by comma :
field1:"value1", field2:"value2", field3:"value3,value4"
into a string[] that would look like:
0 field1:"value1"
1 field2:"value2"
2 field3:"value3,value4"
I am trying to do that with Regex.Split but can't seem to work out the regular expression.

It'll be much easier to do this with Matches than with Split, for example
string[] asYouWanted = Regex.Matches(input, #"[A-Za-z0-9]+:"".*?""")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
although if there is any chance of your values (or fields!) containing escaped quotes (or anything similarly tricky), then you might be better off with a proper CSV parser.
If you do have escaped quotes in your values, I think the following regex the work - give it a test:
#"field3:""value3\\"",value4""", #"[A-Za-z0-9]+:"".*?(?<=(?<!\\)(\\\\)*)"""
The added (?<=(?<!\\)(\\\\)*) is supposed to make sure that the " it stops matching on is preceeded by only an even number of slashes, as an odd number of slashes means it is escaped.

Untested but this should be Ok:
string[] parts = string.Split(new string[] { ",\"" }, StringSplitOptions.None);
remember to add the " back on the end if you need it.

string[] arr = str.Split(new string[] {"\","}}, StringSplitOptions.None).Select(str => str + "\"").ToArray();
Split by \, as webnoob mentioned and then suffix with the trailing " using a select, then cast to an array.

try this
// (\w.+?):"(\w.+?)"
//
// Match the regular expression below and capture its match into backreference number 1 «(\w.+?)»
// Match a single character that is a “word character” (letters, digits, and underscores) «\w»
// Match any single character that is not a line break character «.+?»
// Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
// Match the characters “:"” literally «:"»
// Match the regular expression below and capture its match into backreference number 2 «(\w.+?)»
// Match a single character that is a “word character” (letters, digits, and underscores) «\w»
// Match any single character that is not a line break character «.+?»
// Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
// Match the character “"” literally «"»
try {
Regex regObj = new Regex(#"(\w.+?):""(\w.+?)""");
Match matchResults = regObj.Match(sourceString);
string[] arr = new string[match.Captures.Count];
int i = 0;
while (matchResults.Success) {
arr[i] = matchResults.Value;
matchResults = matchResults.NextMatch();
i++;
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

The easiest inbuilt way is here. I checed it . It is working fine. It splits "Hai,\"Hello,World\"" into {"Hai","Hello,World"}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression for parsing ASCII data - c#

Related

Regex How to Match 2 fields

Regex.Split string into substrings by a delimiter while preserving whitespace

Match properties using regex

Regex to find longest text fragment where last letter of word matches first letter of next word

Split by comma if that comma is not located between two double quotes

Categories

Resources