parse text into key/value pair or json - c#

I have text in the following format, I was wondering what the best approach might be to create a user object from it with the fields as its properties.
I dont know regular expressions that well and i was looking at the string methods in csharp particularly IndexOf and LastIndexOf, but i think that would be too messy as there are approximately 15 fields.
I am trying to do this in c sharp
Some characteristics:
The keys/fields are fixed and known beforehand, so i know that i have to look for things like title, company etc
The address part is single valued and following that there's some multi-valued fields
The multi-valued field may/maynot end with a comma (,)
There is one or two line brakes between the fields eg "country" is followed by 2 line brakes before we encounter "interest"
Title: Mr
Company: abc capital
Address1: 42 mystery lane
Zip: 112312
Country: Ireland
Interest: Biking, Swimming, Hiking,
Topic of Interest: Europe, Asia, Capital

This will split the the data up into key value pairs and store them in a dictionary. You may have to modify further for more requirements.
var dictionary = data
.Split(
new[] {"\r\n"},
StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Split(':'))
.ToDictionary(
k => k[0].Trim(),
v => v[1].Trim());

I'd probably go with something like this:
private Dictionary<string, IEnumerable<string>> ParseValues(string providedValues)
{
Dictionary<string, IEnumerable<string>> parsedValues = new Dictionary<string, IEnumerable<string>>();
string[] lines = providedValues.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries); //Your newline character here might differ, being '\r', '\n', '\r\n'...
foreach (string line in lines)
{
string[] lineSplit = line.Split(':');
string key = lineSplit[0].Trim();
IEnumerable<string> values = lineSplit[1].Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries).Select(x => x.Trim()); //Removing empty entries here will ensure you don't get an empty for the "Interest" line, where you have 'Hiking' followed by a comma, followed by nothing else
parsedValues.Add(key, values);
}
return parsedValues;
}
or if you subscribe to the notion that readability and maintainability are not as cool as a great big chain of calls:
private static Dictionary<string, IEnumerable<string>> ParseValues(string providedValues)
{
return providedValues.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Select(x => x.Split(':')).ToDictionary(key => key[0].Trim(), value => value[1].Split(new char[]{ ','}, StringSplitOptions.RemoveEmptyEntries).Select(x => x.Trim()));
}

I strongly recomend getting more familiar wit regexp for those cases. Parsing "half" structured text is very easy and logic with regular exp.
for ex. this (and other following are just variants there are many ways to do it depending on what you need)
title:\s*(.*)\s+comp.*?:\s*(.*)\s+addr.*?:\s*(.*)\s+zip:\s*(.*)\s+country:\s*(.*)\s+inter.*?:\s*(.*)\s+topic.*?:\s*(.*)
gives result
1. Mr
2. abc capital
3. 42 mystery lane
4. 112312
5. Ireland
6. Biking, Swimming, Hiking,
7. Europe, Asia, Capital
or - more open to anything:
\s(.*?):\s(.*)
parses your input into nice groups like this:
Match 1
1. Title
2. Mr
Match 2
1. Company
2. abc capital
Match 3
1. Address1
2. 42 mystery lane
Match 4
1. Zip
2. 112312
Match 5
1. Country
2. Ireland
Match 6
1. Interest
2. Biking, Swimming, Hiking,
Match 7
1. Topic of Interest
2. Europe, Asia, Capital
I am not familiar with c# (and its dialect of regexp), I just wanted do awake your interest ...

Related

sorting on List<string> with middle 2 character

I like to sort a list with middle 2 character. for example: The list contains following:
body1text
body2text
body11text
body3text
body12text
body13text
if I apply list.OrderBy(r => r.body), it will sort as follows:
body1text
body11text
body12text
body13text
body2text
body3text
But I need the following result:
body1text
body2text
body3text
body11text
body12text
body13text
is there any easy way to sort with middle 2 digit character?
Regards
Shuvra
The issue here is that your numbers are compared as strings, so string.Compare("11", "2") will return -1 meaning that "11" is less than "2". Assuming that your string is always in format "body" + n numbers + "text" you can match numbers with regex and parse an integer from result:
new[]
{
"body1text"
,"body2text"
,"body3text"
,"body11text"
,"body12text"
,"body13text"
}
.OrderBy(s => int.Parse(Regex.Match(s,#"\d+").Value))

How can I match the given pattern using Regex in C#?

I have the following input:
-key1:"val1" -key2: "val2" -key3:(val3) -key4: "(val4)" -key5: val5 -key6: "val-6" -key-7: val7 -key-eight: "val 8"
With only the following assumption about the pattern:
Keys always start with a - followed by a value delimited by :
How can I match and extract each key and it's corresponding value?
I have so far come up with the following regex:
-(?<key>\S*):\s?(?<val>\S*)
But it's currently not matching the complete value for the last argument as it contains a space but I cannot figure out how to match it.
The expected output should be:
key1 "val1"
key2 "val2"
key3 (val3)
key4 "(val4)"
key5 val5
key6 "val-6"
key-7 val7
key-eight val 8
Any help is much appreciated.
Guessing that you want to only allow whitespace characters that are not at the beginning or end, change your regex to:
-(?<key>\S*):\s?(?<val>\S+(\s*[^-\s])*)
This assumes that the character - preceeded by a whitespace unquestioningly means a new key is beginning, it cannot be a part of any value.
For this example:
-key: value -key2: value with whitespace -key3: value-with-hyphens -key4: v
The matches are:
-key: value, -key2: value with whitespace, -key3: value-with-hyphens, -key4: v.
It also works perfectly well on your provided example.
A low tech (non regex) solution, just for an alternative. Trim guff, ToDictionary if you need
var results = input.Split(new[] { " -" }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim('-').Split(':'));
Full Demo Here
Output
key1 -> "val1"
key2 -> "val2"
key3 -> (val3)
key4 -> "(val4)"
key5 -> val5
key6 -> "val-6"
key-7 -> val7
key8 -> "val 8"
Try this regex using Replace function:
(?:^|(?!\S)\s*)-|\s*:\s*
and replace with "\n". You should get key values in separate lines.
I presume you're wanting to keep the brackets and quotation marks as that's what you're doing in the example you gave? If so then the following should work:
-(?<key>\S+):+\s?(?<val>\S+\s?\d+\)?\"?)
This does presume that all val's end with a number though.
EDIT:
Given that the val doesn't always end with a number, but I'm guessing it always starts with val, this is what I have:
-(?<key>\S+):+\s?(?<val>\"?\(?(val)+\s?\S+)
Seems to be working properly...
This should do the trick
-(?<key>\S*):\s*(?<value>(?(?=")((")(?:(?=(\\?))\2.)*?\1))(\S*))
a sample link can be found here.
Basically it does and if/else/then to detect if the value contain " as (?(?=")(true regex)(false regex), the false regex is yours \S* while the true regex will try to match start/end quote (")(?:(?=(\\?))\2.)*?\1).

Trying to match multiple words multiple times, any order using regex

I'm trying to check if a text contains two or more specific words. The words can be in any order an can show up in the text multiple times but at least once.
If the text is a match I will need to get the information about location of the words.
Lets say we have the text :
"Once I went to a store and bought a coke for a dollar and I got another coke for free"
In this example I want to match the words coke and dollar.
So the result should be:
coke : index 37, lenght 4
dollar : index 48, length 6
coke : index 84, length 4
What I have already is this: (which I think is little bit wrong because it should contain each word at least once so the + should be there instead of the *)
(?:(\bcoke\b))\*(?:(\bdollar\b))\*
But with that regex the RegEx Buddy highlights all the three words if I ask it to hightlight group 1 and group 2.
But when I run this in C# I won't get any results.
Can you point me to the right direction ?
I don't think it's possible what you want only using regular expressions.
Here is a possible solution using regular expressions and linq:
var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "coke", "dollar" };
var regex = new Regex(#"\b(?:"+string.Join("|", words)+#")\b", RegexOptions.IgnoreCase);
var text = #"Once I went to a store and bought a coke
for a dollar and I got another coke for free";
var grouped = regex.Matches(text)
.OfType<Match>()
.GroupBy(m => m.Value, StringComparer.OrdinalIgnoreCase)
.ToArray();
if (grouped.Length != words.Count)
{
//not all words were found
}
else
{
foreach (var g in grouped)
{
Console.WriteLine("Found: " + g.Key);
foreach (var match in g)
Console.WriteLine(" At {0} length {1}", match.Index, match.Length);
}
}
Output:
Found: coke
At 36 length 4
At 72 length 4
Found: dollar
At 47 length 6
How about this, it is pret-tay bad but I think it has a shot at working and it is pure RegEx no extra tools.
(?:^|\W)[cC][oO][kK][eE](?:$|\W)|(?:^|\W)[dD][oO][lL][lL][aA][rR](?:$|\W)
Get rid of the \w's if you want it to capture cokeDollar or dollarCoKe etc.

How to extract address components from a string?

I have a Xamarin Forms application that uses Xamarin. Mobile on the platforms to get the current location and then ascertain the current address. The address is returned in string format with line breaks.
The address can look like this:
111 Mandurah Tce
Mandurah WA 6210
Australia
or
The Glades
222 Mandurah Tce
Mandurah WA 6210
Australia
I have this code to break it down into the street address (including number), suburb, state and postcode (not very elegant, but it works)
string[] lines = address.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
List<string> addyList = new List<string>(lines);
int count = addyList.Count;
string lineToSplit = addyList.ElementAt(count - 2);
string[] splitLine = lineToSplit.Split(null);
List<string> splitList = new List<string>(splitLine);
string streetAddress = addyList.ElementAt (count - 3).ToString ();
string postCode = splitList.ElementAt(2);
string state = splitList.ElementAt(1);
string suburb = splitList.ElementAt(0);
I would like to extract the street number, and in the previous examples this would be easy, but what is the best way to do it, taking into account the number might be Lot 111 (only need to capture the 111, not the word LOT), or 123A or 8/123 - and sometimes something like 111-113 is also returned
I know that I can use regex and look for every possible combo, but is there an elegant built-in type solution, before I go writing any more messy code (and I know that the above code isn't particularly robust)?
These simple regular expressions will account for many types of address formats, but have you considered all the possible variations, such as:
PO Box 123 suburb state post_code
Unit, Apt, Flat, Villa, Shop X Y street name
7C/94 ALISON ROAD RANDWICK NSW 2031
and that is just to get the number. You will also have to deal with all the possible types of streets such as Lane, Road, Place, Av, Parkway.
Then there are street types such as:
12 Grand Ridge Road suburb_name
This could be interpreted as street = "Grand Ridge" and suburb = "Road suburb_name", as Ridge is also a valid street type.
I have done a lot of work in this area and found the huge number of valid address patterns meant simple regexs didn't solve the problem on large amounts of data.
I ended up develpping this parser http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm to solve the problem. It was originally written for Australian addresses so should work well for you.
Regex can capture the parts of a match into groups. Each parentheses () defines a group.
([^\d]*)(\d*)(.*)
For "Lot 222 Mandurah Tce" this returns the following groups
Group 0: "Lot 222 Mandurah Tce" (the input string)
Group 1: "Lot "
Group 2: "222"
Group 3: " Mandurah Tce"
Explanation:
[^\d]* Any number (including 0) of any character except digits.
\d* Any number (including 0) of digits.
.* Any number (including 0) of any character.
string input = "Lot 222 Mandurah Tce";
Match match = Regex.Match(input, #"([^\d]*)(\d*)(.*)");
string beforeNumber = match.Groups[1].Value; // --> "Lot "
string number = match.Groups[2].Value; // --> "222"
string afterNumber = match.Groups[3].Value; // --> " Mandurah Tce"
If a group finds no match, match.Groups[i] will return an empty string ("") for that group.
You could check if the content starts with a number for each entry in the splitLine.
string[] splitLine = lineToSplit.Split(addresseLine);
var streetNumber = string.empty;
foreach(var s in splitLine)
{
//Get the first digit value
if (Regex.IsMatch(s, #"^\d"))
{
streetNumber = s;
break;
}
}
//Deal with empty value another way
Console.WriteLine("My streetnumber is " + s)
Yea I think you have to identify what will work.
If:
it is always in the address line and it must always start with a Digit
nothing else in that line can start with a digit (or if something else does you know which always comes in what order, ie the code below will always work if the street number is always first)
you want every contiguous character to the digit that isn't whitespace (the - and \ examples suggest that to me)
Then it could be as simple as:
var regx = new Regex(#"(?:\s|^)\d[^\s]*");
var mtch = reg.Match(addressline);
You would sort of have to sift and see if any of those assumptions are broken.

Parse string with encapsulated int to an array that contains other numbers to be ignored

during my coding I've come across a problem that involved parsing a string like this:
{15} there are 194 red balloons, {26} there are 23 stickers, {40} there are 12 jacks, ....
my code involved pulling both the sentence and the number into two separate arrays.
I've solved the problem involving parsing out the sentence into its own array using a *.Remove(0, 5) to eliminate the first part the problem with that part was that I had to make sure that the file always was written to a standard where {##} where involved however it was not as elegant as I would like in that some times the number would be {3} and i would be forced to make it { 3}.
as there were also the chance of the string containing other numbers I wasn't able to simply parse out the integers first.
int?[] array = y.Split(',')
.Select(z =>
{
int value;
return int.TryParse(z, out value) ? value : (int?)null;
})
.ToArray();
so anyway back to the problem at hand, I need to be able to parse out "{##}" into an array with each having its own element.
Here's one way to do it using positive lookaheads/lookbehinds:
string s = "{15} there are 194 red balloons, {26} there are 23 stickers, {40} there are 12 jacks";
// Match all digits preceded by "{" and followed by "}"
int[] matches = Regex.Matches(s, #"(?<={)\d+(?=})")
.Cast<Match>()
.Select(m => int.Parse(m.Value))
.ToArray();
// Yields [15, 26, 40]

Categories

Resources