Parsing a formatted string with RegEx or similar

Parsing a formatted string with RegEx or similar - c#

I have an application which sends a TCP message to a server, and gets one back.
The message it gets back is in this format:
0,"120"1,"Data Field 1"2,"2401"3,"Data Field 3"1403-1,"multiple
occurence 1"1403-2,"multiple occurence 2"99,""
So basically it is a set of fields concatenated together.
Each field has a tag, a comma, and a value - in that order.
The tag is the number, the value is in quotes, the comma seperates them.
0,"120"
0 is the tag, 120 is the value.
A complete message always starts with a 0 field and ends with 99,"" field.
To complicate things, some tags have dashes because they are split into more than 1 value.
The order of the numbers is not significant.
(For reference, this is a "Fedex Tagged Transaction" message).
So I'm looking for a decent way of validating that we have a "complete" message (ie has the 0 and 99 fields) - because it's from a TCP message I guess I have to account for not having received the full message yet.
Then splitting it up to get all the values I need.
The best I have come up with is for parsing is some poor regex and some cleaning-up afterwards.
The heart of it is this: (\d?\d?\d?\d?-?\d?\d,") to split it
string s = #"(\d?\d?\d?\d?-?\d?\d,"")";
string[] strArray = Regex.Split(receivedData, r);
Assert.AreEqual(14, strArray.Length, "Array length should be 14", since we have 7 fields.);
Dictionary<string, string> fields = new Dictionary<string, string>();
//Now put it into a dictionary which should be easier to work with than an array
for (int i = 0; i <= strArray.Length-2; i+=2)
{
fields.Add(strArray[i].Trim('"').Trim(','), strArray[i + 1].Trim('"'));
}
Which doesn't really work.
It has a lot of quotes and commas left over, and doesn't seem particularly well-formed...
I'm not good with Regex so I can't put together what I need it to do.
I don't even know if it is the best way.
Any help appreciated.

I suggest you use Regex.Matches rather than Regex.Split. This way you can iterate over all the matches, and use capture groups to just grab the data you want directly, while still maintaining structure. I provided a regex that should work for this below in the example:
MatchCollection matchlist = Regex.Matches(receivedData, #"(?<tag>\d+(?:-\d+)?),""(?<data>.*?)""");
foreach (Match match in matchlist)
{
string tag = match.Groups["tag"].Value;
string data = match.Groups["data"].Value;
}

Try this expression
\d*(-\d*)?,"[^"]*"
Match count: 7
0,"120"
1,"Data Field 1"
2,"2401"
3,"Data Field 3"
1403-1,"multiple occurence 1"
1403-2,"multiple occurence 2"
99,""

Related

Reading in a text file more 'intelligently'

I have a text file which contains a list of alphabetically organized variables with their variable numbers next to them formatted something like follows:
aabcdef 208
abcdefghijk 1191
bcdefga 7
cdefgab 12
defgab 100
efgabcd 999
fgabc 86
gabcdef 9
h 11
ijk 80
...
...
I would like to read each text as a string and keep it's designated id# something like read "aabcdef" and store it into an array at spot 208.
The 2 issues I'm running into are:
I've never read from file in C#, is there a way to read, say from
start of line to whitespace as a string? and then the next string as
an int until the end of line?
given the nature and size of these files I do not know the highest ID value of each file (not all numbers are used so some
files could house a number like 3000, but only actually list 200
variables) So how could I make a flexible way to store these
variables when I don't know how big the array/list/stack/etc.. would
need to be.

Basically you need a Dictionary instead of an array or list. You can read all lines with File.ReadLines method then split each of them based on space and \t (tab), like this:
var values = File.ReadLines("path")
.Select(line => line.Split(new [] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(parts => int.Parse(parts[1]), parts => parts[0]);
Then values[208] will give you aabcdef. It looks like an array doesn't it :)
Also make sure you have no duplicate numbers because Dictionary keys should be unique otherwise you will get an exception.

I've been thinking about how I would improve other answers and I've found this alternative solution based on Regex which makes the search into the whole string (either coming from a file or not) safer.
Check that you can alter the whole regular expression to include other separators. Sample expression will detect spaces and tabs.
At the end of the day, I found that MatchCollection returns a safer result, since you always know that 3rd group is an integer and 2nd group is a text because regular expression does a lot of checking for you!
StringBuilder builder = new StringBuilder();
builder.AppendLine("djdodjodo\t\t3893983");
builder.AppendLine("dddfddffd\t\t233");
builder.AppendLine("djdodjodo\t\t39838");
builder.AppendLine("djdodjodo\t\t12");
builder.AppendLine("djdodjodo\t\t444");
builder.AppendLine("djdodjodo\t\t5683");
builder.Append("djdodjodo\t\t33");
// Replace this line with calling File.ReadAllText to read a file!
string text = builder.ToString();
MatchCollection matches = Regex.Matches(text, #"([^\s^\t]+)(?:[\s\t])+([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
// Here's the magic: we convert an IEnumerable<Match> into a dictionary!
// Check that using regexps, int.Parse should never fail because
// it matched numbers only!
IDictionary<int, string> lines = matches.Cast<Match>()
.ToDictionary(match => int.Parse(match.Groups[2].Value), match => match.Groups[1].Value);
// Now you can access your lines as follows:
string value = lines[33]; // <-- By value
Update:
As we discussed in chat, this solution wasn't working in some actual use case you showed me, but it's not the approach what's not working but your particular case, because keys are "[something].[something]" (for example: address.Name).
I've changed given regular expression to ([\w\.]+)[\s\t]+([0-9]+) so it covers the case of key having a dot.
It's about improving the matching regular expression to fit your requirements! ;)
Update 2:
Since you told me that you need keys having any character, I've changed the regular expression to ([^\s^\t]+)(?:[\s\t])+([0-9]+).
Now it means that key is anything excepting spaces and tabs.
Update 3:
Also I see you're stuck in .NET 3.0 and ToDictionary was introduced in .NET 3.5. If you want to get the same approach in .NET 3.0, replace ToDictionary(...) with:
Dictionary<int, string> lines = new Dictionary<int, string>();
foreach(Match match in matches)
{
lines.Add(int.Parse(match.Groups[2].Value), match.Groups[1].Value);
}

validate excel worksheet name

I'm getting the below error when setting the worksheet name dynamically. Does anyone has regexp to validate the name before setting it ?
The name that you type does not exceed 31 characters. The name does
not contain any of the following characters: : \ / ? * [ or ]
You did not leave the name blank.

You can use the method to check if the sheet name is valid
private bool IsSheetNameValid(string sheetName)
{
if (string.IsNullOrEmpty(sheetName))
{
return false;
}
if (sheetName.Length > 31)
{
return false;
}
char[] invalidChars = new char[] {':', '\\', '/', '?', '*', '[', ']'};
if (invalidChars.Any(sheetName.Contains))
{
return false;
}
return true;
}

To do worksheet validation for those specified invalid characters using Regex, you can use something like this:
string wsName = #"worksheetName"; //verbatim string to take special characters literally
Match m = Regex.Match(wsName, #"[\[/\?\]\*]");
bool nameIsValid = (m.Success || (string.IsNullOrEmpty(wsName)) || (wsName.Length > 31)) ? false : true;
This also includes a check to see if the worksheet name is null or empty, or if it's greater than 31. Those two checks aren't done via Regex for the sake of simplicity and to avoid over engineering this problem.

Let's match the start of the string, then between 1 and 31 things that aren't on the forbidden list, then the end of the string. Requiring at least one means we refuse empty strings:
^[^\/\\\?\*\[\]]{1,31}$
There's at least one nuance that this regex will miss: this will accept a sequence of spaces, tabs and newlines, which will be a problem if that is considered to be blank (as it probably is).
If you take the length check out of the regex, then you can get the blankness check by doing something like:
^[^\/\\\?\*\[\]]*[^ \t\/\\\?\*\[\]][^\/\\\?\*\[\]]*$
How does that work? If we defined our class above as WORKSHEET, that would be:
^[^WORKSHEET]*[^\sWORKSHEET][^WORKSHEET]*$
So we match one or more non-forbidden characters, then a character that is neither forbidden nor whitespace, then zero or more non-forbidden characters. The key is that we demand at least one non-whitespace character in the middle section.
But we've lost the length check. It's hard to do both the length check and the regex in one expression. In order to count, we have to phrase things in terms of matching n times, and the things being matched have to be known to be of length 1. But in order to allow whitespace to be placed freely - as long as it's not all whitespace - we need to have a part of the match that is not necessarily of length 1.
Well, that's not quite true. At this point this starts to become a really bad idea, but nevertheless: onwards, into the breach! (for educational purposes only)
Instead of using * for the possibly-blank sections, we can specify the number we expect of each, and include all the possible ways for those three sections to add up to 31. How many ways are there for two numbers to add up to 30? Well, there's 30 of them. 0+30, 1+29, 2+28, ... 30+0:
^[^WORKSHEET]{0}[^\sWORKSHEET][^WORKSHEET]{30}$
|^[^WORKSHEET]{1}[^\sWORKSHEET][^WORKSHEET]{29}$
|^[^WORKSHEET]{2}[^\sWORKSHEET][^WORKSHEET]{28}$
....
|^[^WORKSHEET]{30}[^\sWORKSHEET][^WORKSHEET]{0}$
Obviously if this was a good idea, you'd write a program that expression rather than specifying it all by hand (and getting something wrong). But I don't think I need to tell you it's not a good idea. It is, however, the only answer I have to your question.
While admittedly not actually answering your question, I think #HatSoft has the right approach, encoding the conditions directly and clearly. After all, I'm now satisfied that an answer to your question as asked is not actually a helpful thing.

You might want to do a check for the name History as this is a reserved sheet name in Excel.

Something like that?
public string validate(string name)
{
foreach (char c in Path.GetInvalidFileNameChars())
name = name.Replace(c.ToString(), "");
if (name.Length > 31)
name = name.Substring(0, 31);
return name;
}

Regex C# for ~ delimited text file

I am trying to match 190 in the following ~ delimited text file
GPSE~21~ADVANCED PAVING~P.O. BOX 12847~Ogden~UT~84201~190~12/5/2008~OVER 60~2/3/2009~112458~12/5/2008~12/5/2008~5176~WESTERN GAS PROCESSOR, GRANGER~MOUNTAIN GAS PLANT~GRANGER~WY~82934~7533~TESORO REFINING~474 WEST 900 NORTH~SALT LAKE CITY~UT~841031494~BUT~Freight~5000~0.0577~288.5~360.63
GPSE~21~ADVANCED PAVING~P.O. BOX 12847~Ogden~UT~84201~190~12/5/2008~OVER 60~2/3/2009~~12/5/2008~12/5/2008~~~~~~~~~~~~~~FUEL SURCHARGE~288.5~0.25~72.13~360.63
there are basically 2 lines with number 190. I wantto use regex to match "190". I am new with regex and I dunno How i can match this. Can anyone help me with creating a regular expression to match "190" in both the lines. Thanks.

Since you essentially only need to get the 8th field, you won't need regular expressions at all.
This little snippet should do the trick (nicely wrapped in a method for easy usage - I even did the error handling part for you):
public string GetInvoiceNumber(string line)
{
if(line == null)
{
throw new ArgumentNullException("line");
}
var res = line.Split('~');
if(res.Length < 8)
{
throw new ArgumentException("The given line of text does not contain an invoice number!", "line");
}
return res[7];
}

A regex to match "190" between ~ symbols would be:
/~190~/
If you're trying to match the eighth field in your ~ delimited file, split on ~, then take the eighth field. In Perl, for example:
my #fields = split /~/, $string;
my $wanted = $fields[7];
Your question is fairly ambiguous as to what you're actually trying to do.

EDIT: Ops! Now I realized you want the regex in C#. Omit this message in that case.
One solution with a 'Perl' regular expression. It matches any character except '~' followed by '~'. And that process seven times. After this, it selects all characters until it finds the first '~' (which would be the eighth field of your file). Parentheses saves that content in variable '$1'.
/(?:[^~]*~){7}([^~]*)/
A test:
Content of script.pl
use warnings;
use strict;
while ( <DATA> ) {
print qq[$1\n] if m/(?:[^~]*~){7}([^~]*)/;
}
__DATA__
GPSE~21~ADVANCED PAVING~P.O. BOX 12847~Ogden~UT~84201~190~12/5/2008~OVER 60~2/3/2009~112458~12/5/2008~12/5/2008~5176~WESTERN GAS PROCESSOR, GRANGER~MOUNTAIN GAS PLANT~GRANGER~WY~82934~7533~TESORO REFINING~474 WEST 900 NORTH~SALT LAKE CITY~UT~841031494~BUT~Freight~5000~0.0577~288.5~360.63
GPSE~21~ADVANCED PAVING~P.O. BOX 12847~Ogden~UT~84201~190~12/5/2008~OVER 60~2/3/2009~~12/5/2008~12/5/2008~~~~~~~~~~~~~~FUEL SURCHARGE~288.5~0.25~72.13~360.63
Running the script:
perl script.pl
And result:
190
190

c# Split sentence

Is it possible to split this combined words into two?
ex: "Firstname" to
"First"
"Name"
I have a bunch of properties eg FirstName,LastName etc. and I need to display this on my page. Thats why I need to separate this property name to display into more appropriate way.

Your aim is fuzzy.
If properties alway have Uppercase letter, you can find positions of all uppercase letters in the word and devide it by that positions.
If uppercase letters is not guaranteed, the best way would be to create transform table. The table would be define pairs of initial property name and resulting text. In this way you will have simple map for transormation

Edit: OP specified that he needs to split property names
If you follow CamelCase naming convention for properties (i.e. "FirstName" instead of "Firstname"), you can split the words by upper case characters quite easily.
string[] SplitByCaps(string input)
{
StringBuilder output = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (i > 0 && Char.IsUpper(c))
output.Append(' ');
output.Append(c);
}
return output.ToString().Split(' ');
}
Orinal answer:
I would say, for practical purposes, it's not possible to do this for any arbitrary string.
Of course it is possible to write a program to do this, but whatever your actual needs are, that program would be overkill. There might also be libraries that already do this, but they would be so heavy that you wouldn't want to take a dependency on them.
Any program which could achieve this would have to have know all words in the English language (let's not even consider multilanguage solutions). You would also require an intelligent lexical parser, because for any word, there might be more than one possible way to split it.
I suggest you look into some other way to solve your particular problem.

Unless you have a dictionary of all 'single' words the only solution I can think of is to split on upper letters:
FirstName -> First Name
The problem will still exist for UIFilter -> UI Filter.

You can use substring to get the first 5 characters from the string. Then replace the first five characters in original string to blank.
string str = "Firstname";
string firstPart = str.Substring(0,5); // "First"
string secondPart = str.replace(firstPart,""); // "name"
If you want to make it generic for any word to be split, then you need to have some definite criteria on which you can divide the word into parts. Without definite criteria, it is not possible to split the string as expected by you.

C# - Finding which prefix applies

just looking for a bit of help in terms of the best way to proceed with the following problem:
I have a list of a bunch of dialled numbers, don't think you need me to show you but for e.g.
006789 1234
006656 1234
006676 1234
006999 1234
007000 1234
006999 6789
Now: I also have a list of prefixes (prefix being the bit dialled first, also tells you where the call is going(important bit)). Important also - they have leading 0's and, they are of differing length.
say for e.g.
006789 = australia
006789 = russia
006656 = france
006676 = austria
0069 = brazil
00700 = china
So what i am trying to do is write C# algorithm to find which prefix to apply.
The logic works as follows, say we have one dialled number and these prefixes
dialled number:0099876 5555 6565,
prefix1: 0099876 = Lyon (France)
prefix2: 0099 = France
Now both prefixes apply, except "the more detailed one" always wins. i.e. this call is to Lyon (France) and 0099876 should be result even though 0099 also applies.
Any help on getting me started with this algorithm would be great, because looking at it, im not sure if I should be comparing strings or ints! I have .Contains with strings, but as portrayed in my examples, that doesn't exactly work if the prefix is later in the number
i.e.
6999 6978
6978 1234
Cheers!!!

Looks like a good match for a trie to me. given your prefixes are guaranteed to be short this should be nice and quick to search. you could also find all matching prefixes at the same time. the longest prefix would be the last to match in the trie and would be O(m) to find (worst case) where m is the length of the prefix.

I guess you could sort your prefixes by length (longest first).
Then when you need to process a number, you can run through the prefixes in order, and stop when yourNumber.startsWith(prefix) is true.

Find longest. Use LINQ:
prefixes.Where(p => number.StartsWith(p)).OrderByDescending(p => p.Length).FirstOrDefault();

If you already know what prefixes you're looking for, you're better off using a HashMap (I believe it's a Dictionary in C#) to store the prefix and the country it corresponds to. Then, for every number that comes in, you can do a standard search on all the prefixes in the list. Store the ones that match in a list, and then pick the longest match.

Another approach would be to shorten the dialed number by one from the right and test if this number is within the list:
Dictionary<string, string> numbers = new Dictionary<string, string>();
//Build up the possible numbers from somewhere
numbers.Add("006789", "australia");
numbers.Add("006790", "russia");
numbers.Add("006656", "france");
numbers.Add("006676", "austria");
numbers.Add("0069", "brazil");
numbers.Add("00700", "china");
numbers.Add("0099876", "Lyon (France)");
numbers.Add("0099", "France");
//Get the dialed number from somewhere
string dialedNumber = "0099 876 1234 56";
//Remove all whitespaces (maybe minus signs, plus sign against double zero, remove brackets, etc)
string normalizedNumber = dialedNumber.Replace(" ", "");
string searchForNumber = normalizedNumber;
while (searchForNumber.Length > 0)
{
if(numbers.ContainsKey(searchForNumber))
{
Console.WriteLine("The number '{0}' is calling from {1}", dialedNumber, numbers[searchForNumber]);
return;
}
searchForNumber = searchForNumber.Remove(searchForNumber.Length - 1);
}
Console.WriteLine("The number '{0}' doesn't contain any valid prefix", dialedNumber);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing a formatted string with RegEx or similar - c#

Try this expression \d(-\d)?,"[^"]*" Match count: 7 0,"120" 1,"Data Field 1" 2,"2401" 3,"Data Field 3" 1403-1,"multiple occurence 1" 1403-2,"multiple occurence 2" 99,""

Related

Reading in a text file more 'intelligently'

validate excel worksheet name

Regex C# for ~ delimited text file

c# Split sentence

C# - Finding which prefix applies

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing a formatted string with RegEx or similar - c#

Try this expression \d*(-\d*)?,"[^"]*" Match count: 7 0,"120" 1,"Data Field 1" 2,"2401" 3,"Data Field 3" 1403-1,"multiple occurence 1" 1403-2,"multiple occurence 2" 99,""

Related

Reading in a text file more 'intelligently'

validate excel worksheet name

Regex C# for ~ delimited text file

c# Split sentence

C# - Finding which prefix applies

Categories

Resources

Try this expression \d(-\d)?,"[^"]*" Match count: 7 0,"120" 1,"Data Field 1" 2,"2401" 3,"Data Field 3" 1403-1,"multiple occurence 1" 1403-2,"multiple occurence 2" 99,""