Splitting Street names with c#/regex

Splitting Street names with c#/regex - c#

I'm currently fighting (my first time) with regex.
My goal is to split multiple street names separated by "/". There are some special things to note. There could be a whitespace before and after the "/" and after the slash there needs to be a letter and not a number (because sometimes the house numbers are also separated by a slash.
I nearly archieved my goal. It currently splits as I wish when there are only two street names in a string, but with three street names I have problem (it only splits one time.)
My current two regex tries looks like this:
/.([A-Za-z]+).*? (Works great but only with two streets, ignoring additionals)
/.([A-Za-z]+).* (Works with multiple streets but stops after a whitespace in street names
To make it more clear I attached some screenshots:
In this screenshot I split like I want it to be (but only splits one time and ignoring the third street).
In this second screenshot I added a "?" at the end of the regex. Now it's considering the third street but cut's the second street after a whitespace.
Hope you can help me.

int z = 0;
string[] arr = new[]
{
"Street name 1 / Street name 2 / Street name 3",
"Street name 1 /Street name 2",
"Street name 1 / 2"
};
string pattern = #"(?i)\s*/\s*(?=[a-z])";
foreach (var x in arr)
{
WriteLine($"Record {++z}");
var streets = Regex.Split(x, pattern);
foreach (var street in streets)
{
WriteLine("\t" + street);
}
}
/*
Output:
Record 1
Street name 1
Street name 2
Street name 3
Record 2
Street name 1
Street name 2
Record 3
Street name 1 / 2
*/

Related

How to split a street address after the first number?

I'm not sure how to solve this but I need to split a string into 2 parts. Take the string below for example:
North Street 57A 1floor
I need to split this into 2 parts.
Part 1 "North Street 57" and part 2 "A 1floor"
But if the address is just "North Street 57" then I don't need to split the string at all, so the key here is to identify if the first occurrence of street number is only digits or a combination of digits and characters (57A)
I have a lot of different address names so the text can vary. Can this be achieved?

If you always want to split after the first occurrence of a number, you may use Regular Expression for that.
Here's a full example:
string input = "North Street 57A 1floor";
var regex = new Regex(#"(?<=\d)(?=\D)");
var parts = regex.Split(input, 2);
foreach (var part in parts)
Console.WriteLine(part);
Output:
North Street 57
A 1floor
The pattern (?<=\d)(?=\D) gets the position after a string of digits. Then, we use Regex.Split(string input, int count) where count=2 to ensure that it returns two parts at maximum.
Try it online.

Split by period ignore few cases

I am splitting a string with a period and space(". "), I want to split with a ". " but ignore if it matches few patterns like MR. , JR. , [oneletter]. , Dr.
Pattern list is static.(case insensitive)
Examples:
1) My Name is MR. ABC and working for XYZ.
Output: No split. Just one line
2) My Name is Mr. ABC. I work for XYZ.
Output: string[0] = My Name is Mr. ABC.
string[1] = I work for XYZ.
3) My Name is ABC. I work for XYZ.
Output: string[0] = My Name is ABC.
string[1] = I work for XYZ.
4) My Name is MR. ABC Jr. DEF. I work for XYZ.
Output: string[0] = My Name is MR. ABC Jr. DEF. (MR. and Jr. are ignoring cases )
string[1] = I work for XYZ.

Using sln's regex pattern here's a mock up of how it should work
List<string> ignores = new List<string>(){ "MR", "MS", "MRS", "DR", "PROF" };
ignores = ignores.Select(x => #"\b" + x).ToList();
string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
foreach (char letter in alphabet.ToCharArray())
{
ignores.Add(#"\b" + letter);
}
string test = "This is a test for Prof. Plum. Here is a test for Ms. White. This is A. Test. Welcome to GMR. Next Line.";
string regexPattern = $#"(?<!{string.Join("|", ignores)})\.\s";
string[] results = Regex.Split(test, regexPattern, RegexOptions.IgnoreCase);
results are the 3 sentences (though you need to re-add the . to the end of all but the last value)
Edited to add all single character ignores
Edited to only account for whole words on ignore list

C# Regex Match Address - Mark groups optional

I need to parse a German address that I get in one string like "Example Street 5b". I want to split it in groups: Street, Number and Additional Information.
For example: address = Test Str. 5b
-> Street: "Test Str." Number: "5", Add.: "b"
My code looks like that:
string street = "";
string number = "";
string addition = "";
//this works:
string address = "Test Str. 5b";
//this doesn't match, but I want it in the street group:
//string address = "Test Str.";
Match adressMatch = Regex.Match(address, #"(?<street>.*?\.*)\s*(?<number>[1-9][0-9]*)\s*(?<addition>.*)");
street = adressMatch.Groups["street"].Value;
number = adressMatch.Groups["number"].Value;
addition = adressMatch.Groups["addition"].Value;
That code works well for the example and most other cases.
My problem:
If the adress does not contain a number, the function fails. I tried to add *? after the number group and several other things, but then the whole string got parsed into the "addition" and "street" and "number" remain empty. But if the number is missing, I want the string to parse into "street" and "number" and "addition" shall remain empty.
Thanks in advance :)

I would do it like this: I'd match the street into the street group, then match the number - if any - into the number group, and then the rest into the addition group.
Then, if the number group does not succeed, the addition value should be moved to the number group, which can be done easily within C# code.
So, use
(?<street>.*\.)(?:\s*(?<number>[1-9][0-9]*))?\s*(?<addition>.*)
^^ ^^ ^^
See the regex demo here (note the changes: the first .*? is turned greedy, the * quantifier after \. is removed, the number group is made optional together with the \s* in front).
Then, use this logic (C# sample snippet):
string street = "";
string number = "";
string addition = "";
//string address = "Test Str. 5b"; // => Test Str. | 5 | b
string address = "Test Str. b"; // => Test Str. | b |
Match adressMatch = Regex.Match(address, #"(?<street>.*\.)(?:\s*(?<number>[1-9][0-9]*))?\s*(?<addition>.*)");
if (adressMatch.Success) {
street = adressMatch.Groups["street"].Value;
addition = adressMatch.Groups["addition"].Value;
if (adressMatch.Groups["number"].Success)
number = adressMatch.Groups["number"].Value;
else
{
number = adressMatch.Groups["addition"].Value;
addition = string.Empty;
}
}
Console.WriteLine("Street: {0}\nNumber: {1}\nAddition: {2}", street, number, addition);

How to extract address components from a string?

I have a Xamarin Forms application that uses Xamarin. Mobile on the platforms to get the current location and then ascertain the current address. The address is returned in string format with line breaks.
The address can look like this:
111 Mandurah Tce
Mandurah WA 6210
Australia
or
The Glades
222 Mandurah Tce
Mandurah WA 6210
Australia
I have this code to break it down into the street address (including number), suburb, state and postcode (not very elegant, but it works)
string[] lines = address.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
List<string> addyList = new List<string>(lines);
int count = addyList.Count;
string lineToSplit = addyList.ElementAt(count - 2);
string[] splitLine = lineToSplit.Split(null);
List<string> splitList = new List<string>(splitLine);
string streetAddress = addyList.ElementAt (count - 3).ToString ();
string postCode = splitList.ElementAt(2);
string state = splitList.ElementAt(1);
string suburb = splitList.ElementAt(0);
I would like to extract the street number, and in the previous examples this would be easy, but what is the best way to do it, taking into account the number might be Lot 111 (only need to capture the 111, not the word LOT), or 123A or 8/123 - and sometimes something like 111-113 is also returned
I know that I can use regex and look for every possible combo, but is there an elegant built-in type solution, before I go writing any more messy code (and I know that the above code isn't particularly robust)?

These simple regular expressions will account for many types of address formats, but have you considered all the possible variations, such as:
PO Box 123 suburb state post_code
Unit, Apt, Flat, Villa, Shop X Y street name
7C/94 ALISON ROAD RANDWICK NSW 2031
and that is just to get the number. You will also have to deal with all the possible types of streets such as Lane, Road, Place, Av, Parkway.
Then there are street types such as:
12 Grand Ridge Road suburb_name
This could be interpreted as street = "Grand Ridge" and suburb = "Road suburb_name", as Ridge is also a valid street type.
I have done a lot of work in this area and found the huge number of valid address patterns meant simple regexs didn't solve the problem on large amounts of data.
I ended up develpping this parser http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm to solve the problem. It was originally written for Australian addresses so should work well for you.

Regex can capture the parts of a match into groups. Each parentheses () defines a group.
([^\d]*)(\d*)(.*)
For "Lot 222 Mandurah Tce" this returns the following groups
Group 0: "Lot 222 Mandurah Tce" (the input string)
Group 1: "Lot "
Group 2: "222"
Group 3: " Mandurah Tce"
Explanation:
[^\d]* Any number (including 0) of any character except digits.
\d* Any number (including 0) of digits.
.* Any number (including 0) of any character.
string input = "Lot 222 Mandurah Tce";
Match match = Regex.Match(input, #"([^\d]*)(\d*)(.*)");
string beforeNumber = match.Groups[1].Value; // --> "Lot "
string number = match.Groups[2].Value; // --> "222"
string afterNumber = match.Groups[3].Value; // --> " Mandurah Tce"
If a group finds no match, match.Groups[i] will return an empty string ("") for that group.

You could check if the content starts with a number for each entry in the splitLine.
string[] splitLine = lineToSplit.Split(addresseLine);
var streetNumber = string.empty;
foreach(var s in splitLine)
{
//Get the first digit value
if (Regex.IsMatch(s, #"^\d"))
{
streetNumber = s;
break;
}
}
//Deal with empty value another way
Console.WriteLine("My streetnumber is " + s)

Yea I think you have to identify what will work.
If:
it is always in the address line and it must always start with a Digit
nothing else in that line can start with a digit (or if something else does you know which always comes in what order, ie the code below will always work if the street number is always first)
you want every contiguous character to the digit that isn't whitespace (the - and \ examples suggest that to me)
Then it could be as simple as:
var regx = new Regex(#"(?:\s|^)\d[^\s]*");
var mtch = reg.Match(addressline);
You would sort of have to sift and see if any of those assumptions are broken.

Regex pattern for house letter

Im' looking for a Regex pattern that finds a house letter.
Eks(looking for the letter d).
1. Streetname 3d, 7000 Town Country.
2. Streetname 3 d, 7000 Town Country.
3. Streetname 13d, 7000 Town Country.
4. Streetname 13 d, 7000 Town Country.
I'm writing i C#.

Some combination of:
const string address = "Streetname 3d, 7000 Town Country";
string streetPart = address.Split(',')[0];
char letter = streetPart[streetPart.Length - 1];
bool isLetter = char.IsLetter(letter);
Debug.WriteLine("{0}, isLetter: {1}", letter, isLetter);
will probably work...
Outputs: d, isLetter: true

I think this pattern works in your 4 cases.
I don't test the code but just try it and tell me.
string sPattern = "[a-zA-Z 0-9]*([a-zA-Z]),.*";
int i = 0;
foreach (string s in address)
{
Match m = Regex.Match(s, sPattern);
if (m.Success){
houseLetter[i] = m.ToString();
} else {
houseLetter[i] = "Not Found";
}
i++;
}

If you are thinking that there is a regex that will universally solve this problem, stop thinking it. With your scheme my parents' street address would be
Ioakim 3rd 4242, 7000 Town Country // "Ioakim 3rd" is the street name
As you see, you are definitely going to have some percentage of wrong results. Are your four examples the only cases where you need guaranteed correct results?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting Street names with c#/regex - c#

Related

How to split a street address after the first number?

Split by period ignore few cases

C# Regex Match Address - Mark groups optional

How to extract address components from a string?

Regex pattern for house letter

Categories

Resources