How to pull text from this string of data - c#

I need to pull the city and state out string of data that look as follows:
8 mi SSW of Newtown, PA
10 mi SE of Milwaukee, WI
29 Miles E of Orlando, FL
As of right now I am passing each string individually into a method
string statusLocation = "8 mi SSW of Newtown, PA"
etc. one at a time.
What would be the best way to search this string for the city state? I was thinking either regex or substring and index of the comma etc. I wasn’t quite sure what kind of issues I would run into if a state is 3 characters or a city has a comma in it because this is Canada data as well and I am not sure how they abbreviate stuff.

You could do a
string str = "8 mi SSW of Newtown, PA";
var parts = str.Split(new[] {' '}, 5);
parts then looks like this: { "8", "mi", "SSW", "of", "Newtown, PA" }, and you can access the "Newtown, PA" easily with parts[4].

You could use this regular expression:
of (.*), ([a-zA-Z]{2})$
That will capture everything after the of, up a comma that is followed by a space then two letters, then a line ending. For example:
var regex = new Regex("of (.*), ([a-zA-Z]{2})$");
var strings = new[]
{
"8 mi SSW of Newtown, PA",
"10 mi SE of Milwaukee, WI",
"29 Miles E of Orlando, FL"
};
foreach (var str in strings)
{
var match = regex.Match(str);
var city = match.Groups[1];
var state = match.Groups[2];
Console.Out.WriteLine("state = {0}", state);
Console.Out.WriteLine("city = {0}", city);
}
This of course assumes some consistency with the data, like the state being two letters.

Related

Split by period ignore few cases

I am splitting a string with a period and space(". "), I want to split with a ". " but ignore if it matches few patterns like MR. , JR. , [oneletter]. , Dr.
Pattern list is static.(case insensitive)
Examples:
1) My Name is MR. ABC and working for XYZ.
Output: No split. Just one line
2) My Name is Mr. ABC. I work for XYZ.
Output: string[0] = My Name is Mr. ABC.
string[1] = I work for XYZ.
3) My Name is ABC. I work for XYZ.
Output: string[0] = My Name is ABC.
string[1] = I work for XYZ.
4) My Name is MR. ABC Jr. DEF. I work for XYZ.
Output: string[0] = My Name is MR. ABC Jr. DEF. (MR. and Jr. are ignoring cases )
string[1] = I work for XYZ.
Using sln's regex pattern here's a mock up of how it should work
List<string> ignores = new List<string>(){ "MR", "MS", "MRS", "DR", "PROF" };
ignores = ignores.Select(x => #"\b" + x).ToList();
string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
foreach (char letter in alphabet.ToCharArray())
{
ignores.Add(#"\b" + letter);
}
string test = "This is a test for Prof. Plum. Here is a test for Ms. White. This is A. Test. Welcome to GMR. Next Line.";
string regexPattern = $#"(?<!{string.Join("|", ignores)})\.\s";
string[] results = Regex.Split(test, regexPattern, RegexOptions.IgnoreCase);
results are the 3 sentences (though you need to re-add the . to the end of all but the last value)
Edited to add all single character ignores
Edited to only account for whole words on ignore list

How to extract address components from a string?

I have a Xamarin Forms application that uses Xamarin. Mobile on the platforms to get the current location and then ascertain the current address. The address is returned in string format with line breaks.
The address can look like this:
111 Mandurah Tce
Mandurah WA 6210
Australia
or
The Glades
222 Mandurah Tce
Mandurah WA 6210
Australia
I have this code to break it down into the street address (including number), suburb, state and postcode (not very elegant, but it works)
string[] lines = address.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
List<string> addyList = new List<string>(lines);
int count = addyList.Count;
string lineToSplit = addyList.ElementAt(count - 2);
string[] splitLine = lineToSplit.Split(null);
List<string> splitList = new List<string>(splitLine);
string streetAddress = addyList.ElementAt (count - 3).ToString ();
string postCode = splitList.ElementAt(2);
string state = splitList.ElementAt(1);
string suburb = splitList.ElementAt(0);
I would like to extract the street number, and in the previous examples this would be easy, but what is the best way to do it, taking into account the number might be Lot 111 (only need to capture the 111, not the word LOT), or 123A or 8/123 - and sometimes something like 111-113 is also returned
I know that I can use regex and look for every possible combo, but is there an elegant built-in type solution, before I go writing any more messy code (and I know that the above code isn't particularly robust)?
These simple regular expressions will account for many types of address formats, but have you considered all the possible variations, such as:
PO Box 123 suburb state post_code
Unit, Apt, Flat, Villa, Shop X Y street name
7C/94 ALISON ROAD RANDWICK NSW 2031
and that is just to get the number. You will also have to deal with all the possible types of streets such as Lane, Road, Place, Av, Parkway.
Then there are street types such as:
12 Grand Ridge Road suburb_name
This could be interpreted as street = "Grand Ridge" and suburb = "Road suburb_name", as Ridge is also a valid street type.
I have done a lot of work in this area and found the huge number of valid address patterns meant simple regexs didn't solve the problem on large amounts of data.
I ended up develpping this parser http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm to solve the problem. It was originally written for Australian addresses so should work well for you.
Regex can capture the parts of a match into groups. Each parentheses () defines a group.
([^\d]*)(\d*)(.*)
For "Lot 222 Mandurah Tce" this returns the following groups
Group 0: "Lot 222 Mandurah Tce" (the input string)
Group 1: "Lot "
Group 2: "222"
Group 3: " Mandurah Tce"
Explanation:
[^\d]* Any number (including 0) of any character except digits.
\d* Any number (including 0) of digits.
.* Any number (including 0) of any character.
string input = "Lot 222 Mandurah Tce";
Match match = Regex.Match(input, #"([^\d]*)(\d*)(.*)");
string beforeNumber = match.Groups[1].Value; // --> "Lot "
string number = match.Groups[2].Value; // --> "222"
string afterNumber = match.Groups[3].Value; // --> " Mandurah Tce"
If a group finds no match, match.Groups[i] will return an empty string ("") for that group.
You could check if the content starts with a number for each entry in the splitLine.
string[] splitLine = lineToSplit.Split(addresseLine);
var streetNumber = string.empty;
foreach(var s in splitLine)
{
//Get the first digit value
if (Regex.IsMatch(s, #"^\d"))
{
streetNumber = s;
break;
}
}
//Deal with empty value another way
Console.WriteLine("My streetnumber is " + s)
Yea I think you have to identify what will work.
If:
it is always in the address line and it must always start with a Digit
nothing else in that line can start with a digit (or if something else does you know which always comes in what order, ie the code below will always work if the street number is always first)
you want every contiguous character to the digit that isn't whitespace (the - and \ examples suggest that to me)
Then it could be as simple as:
var regx = new Regex(#"(?:\s|^)\d[^\s]*");
var mtch = reg.Match(addressline);
You would sort of have to sift and see if any of those assumptions are broken.

Match the pattern in the string using c#

Let's say my texts are:
New York, NY is where I live.
Boston, MA is where I live.
Kentwood in the Pines, CA is where I live.
How do I extract just "New York", "Boston", "Kentwood in the Pines".
I can extract State name by pattern #"\b,\s(?"<"state">"\w\w)\s\w+\s\w+\s\w\s\w+"
I am using regular expression but I'm not able to figure out how to extract city names as city names can be more than two words or three.
Just substring from the beginning of the string to the first comma:
var city = input.Substring(0, input.IndexOf(','));
This will work if your format is always [City], [State] is where I live. and [City] never contains a comma.
this is want you need ..
static void Main(string[] args)
{
string exp = "New York, NY is where I live. Boston, MA is where I live. Kentwood in the Pines, CA is where I live.";
string reg = #"[\w\s]*(?=,)";
var matches = Regex.Matches(exp, reg);
foreach (Match m in matches)
{
Console.WriteLine(m.ToString());
}
Console.ReadLine();
}

How to split a string into 2 strings using RegEx?

I need a regEx to retrieve the street of a string and the streetnumber. Let's consider that the streetname starts from the beginning until there is a whitespace followed by a number
example:
Original string: 'Jan van Rijswijcklaan 123'
Result should be: 'Jan van Rijswijcklaan' as the streetname and '123' as the streetnumber.
any help is appreciated.
UPDATE
I was able to get the streetname and number, but sometimes I had streetnumbers like '123b a1' then the code failed in defining the streetnumber. As result for the streetnumber was only 'a1' instead of '123b a1'.
So at the moment I'm dealing with 2 scenarios:
When streetname contains only alphabetic characters and number contains only digits - like 'Jan van Rijswijcklaan 123'
When streetname contains only alphabetic characters and number contains alphanumeric characters - like 'Jan van Rijswijcklaan 123b a1'
Here is the code I tried:
string street = Regex.Match(streetWithNum, #"^[^0-9]*").Value + ";";
string number = Regex.Match(streetWithNum, #"\w\d*\w?\d*$").Value + ";";
Use positive lookahead pattern to search spliting condition:
var s = "Jan van Rijswijcklaan 124";
var result = Regex.Split(s, #"\s(?=\d)");
Console.WriteLine("street name: {0}", result[0]);
Console.WriteLine("street number: {0}", result[1]);
prints:
street name: Jan van Rijswijcklaan
street number: 124
note: use Int32.TryParse to convert street number from string to int, if you need to
I'm not a fan of regex, do you notice that?
IEnumerable<string> nameParts = "Jan van Rijswijcklaan 124".Split()
.TakeWhile(word => !word.All(Char.IsDigit));
string name = string.Join(" ", nameParts);
DEMO
If you want to take both, the street-name and the number:
string[] words = "Jan van Rijswijcklaan 124".Split();
var streetNamePart = words.TakeWhile(w => !w.All(Char.IsDigit));
var streetNumberPart = words.SkipWhile(w => !w.All(Char.IsDigit));
Console.WriteLine("street-name: {0}", string.Join(" ", streetNamePart));
Console.WriteLine("street-number: {0}", string.Join(" ", streetNumberPart));
Than just fix the answer of #Ilya_Ivanov with lookahead:
var result = Regex.Split(s, #"\s(?=\d)");
Here a non-regex solution also;
string str = "Jan van Rijswijcklaan 124";
var array = str.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
string streetname = "";
string streetnumber = "";
foreach (var item in array)
{
if (Char.IsNumber(item[0]))
streetnumber += item;
else
streetname += item + " ";
}
Console.WriteLine(streetname.TrimEnd());
Console.WriteLine(streetnumber);
Output will be;
Jan van Rijswijcklaan
124
Here a DEMO.
this should do:
Regex r = new Regex(#"(.+?) (\d+)$");
Match m = r.Match("Jan van Rijswijcklaan 124");
String street = m.Groups[1].Value;
String number = m.Groups[2].Value;
I typed this from memory, don't blame me for typos :)
edit: the '$' at the end of the regex string makes sure the number match to occur only at the end of the input string.
edit 2: just removed the typos and tested the code, it works now.
edit 3: the expression can be read as: Collect as many characters as you can get into group 1 without being greedy (.+?) but leave a sequence of digits at the end of the string, after a whitespace, for group 2 (\d+)$

Parsing Google calendar to DDay.iCal

I'm working on application which parses Google Calendar via Google API to DDay.iCal
The main attributes, properties are handled easily... ev.Summary = evt.Title.Text;
The problem is when I got an recurring event, the XML contains a field like:
<gd:recurrence>
DTSTART;VALUE=DATE:20100916
DTEND;VALUE=DATE:20100917
RRULE:FREQ=YEARLY
</gd:recurrence>
or
<gd:recurrence>
DTSTART:20100915T220000Z
DTEND:20100916T220000Z
RRULE:FREQ=YEARLY;BYMONTH=9;WKST=SU"
</gd:recurrence>
using the following code:
String[] lines =
evt.Recurrence.Value.Split(new char[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
foreach (String line in lines)
{
if (line.StartsWith("R"))
{
RecurrencePattern rp = new RecurrencePattern(line);
ev.RecurrenceRules.Add(rp);
}
else
{
ISerializationContext ctx = new SerializationContext();
ISerializerFactory factory = new DDay.iCal.Serialization.iCalendar.SerializerFactory();
ICalendarProperty property = new CalendarProperty();
IStringSerializer serializer = factory.Build(property.GetType(), ctx) as IStringSerializer;
property = (ICalendarProperty)serializer.Deserialize(new StringReader(line));
ev.Properties.Add(property);
Console.Out.WriteLine(property.Name + " - " + property.Value);
}
}
RRULEs are parsed correctly, but the problem is that other property (datetimes) values are empty...
Here is the starting point of what I'm doing, going off of the RFC-5545 spec's recurrence rule. It isn't complete to the spec and may break given certain input, but it should get you going. I think this should all be doable using RegEx, and something as heavy as a recursive decent parser would be overkill.
RRULE:(?:FREQ=(DAILY|WEEKLY|SECONDLY|MINUTELY|HOURLY|DAILY|WEEKLY|MONTHLY|YEARLY);)?(?:COUNT=([0-9]+);)?(?:INTERVAL=([0-9]+);)?(?:BYDAY=([A-Z,]+);)?(?:UNTIL=([0-9]+);)?
I am building this up using http://regexstorm.net/tester.
The test input I'm using is:
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;INTERVAL=8;BYDAY=FR;UNTIL=20141101
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;COUNT=5;INTERVAL=8;BYDAY=FR;UNTIL=20141101
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;BYDAY=FR;UNTIL=20141101
Sample matching results would look like:
Index Position Matched String $1 $2 $3 $4 $5
0 90 RRULE:FREQ=WEEKLY;INTERVAL=8;BYDAY=FR;UNTIL=20141101 WEEKLY 8 FR 20141101
1 236 RRULE:FREQ=WEEKLY;COUNT=5;INTERVAL=8;BYDAY=FR;UNTIL=20141101 WEEKLY 5 8 FR 20141101
2 390 RRULE:FREQ=WEEKLY;BYDAY=FR;UNTIL=20141101 WEEKLY FR 20141101
Usage is like:
string freqPattern = #"RRULE:(?:FREQ=(DAILY|WEEKLY|SECONDLY|MINUTELY|HOURLY|DAILY|WEEKLY|MONTHLY|YEARLY);?)?(?:COUNT=([0-9]+);?)?(?:INTERVAL=([0-9]+);?)?(?:BYDAY=([A-Z,]+);?)?(?:UNTIL=([0-9]+);?)?";
MatchCollection mc = Regex.Matches(rule, freqPattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
foreach (Match m in mc)
{
string frequency = m.Groups[1].ToString();
string count = m.Groups[2].ToString();
string interval = m.Groups[3].ToString();
string byday = m.Groups[4].ToString();
string until = m.Groups[5].ToString();
System.Console.WriteLine("recurrence => frequency: \"{0}\", count: \"{1}\", interval: \"{2}\", byday: \"{3}\", until: \"{4}\"", frequency, count, interval, byday, until);
}
This is a great example of when to use regular expressions. Try this out for general parsing:
\s*(\w+):((\w+=\w+;)+(\w+=\w+)?|\w+)
Or, you might decide to have something more schema-specific.
\s*(?:DTSTART:)(?'Start'\w+)
\s*(?:DTEND:)(?'End'\w+)
\s*(?:RRULE:)(?'Rule'(\w+=\w+;)+(\w+=\w+)?)

Categories

Resources