How to extract address components from a string?

How to extract address components from a string? - c#

I have a Xamarin Forms application that uses Xamarin. Mobile on the platforms to get the current location and then ascertain the current address. The address is returned in string format with line breaks.
The address can look like this:
111 Mandurah Tce
Mandurah WA 6210
Australia
or
The Glades
222 Mandurah Tce
Mandurah WA 6210
Australia
I have this code to break it down into the street address (including number), suburb, state and postcode (not very elegant, but it works)
string[] lines = address.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
List<string> addyList = new List<string>(lines);
int count = addyList.Count;
string lineToSplit = addyList.ElementAt(count - 2);
string[] splitLine = lineToSplit.Split(null);
List<string> splitList = new List<string>(splitLine);
string streetAddress = addyList.ElementAt (count - 3).ToString ();
string postCode = splitList.ElementAt(2);
string state = splitList.ElementAt(1);
string suburb = splitList.ElementAt(0);
I would like to extract the street number, and in the previous examples this would be easy, but what is the best way to do it, taking into account the number might be Lot 111 (only need to capture the 111, not the word LOT), or 123A or 8/123 - and sometimes something like 111-113 is also returned
I know that I can use regex and look for every possible combo, but is there an elegant built-in type solution, before I go writing any more messy code (and I know that the above code isn't particularly robust)?

These simple regular expressions will account for many types of address formats, but have you considered all the possible variations, such as:
PO Box 123 suburb state post_code
Unit, Apt, Flat, Villa, Shop X Y street name
7C/94 ALISON ROAD RANDWICK NSW 2031
and that is just to get the number. You will also have to deal with all the possible types of streets such as Lane, Road, Place, Av, Parkway.
Then there are street types such as:
12 Grand Ridge Road suburb_name
This could be interpreted as street = "Grand Ridge" and suburb = "Road suburb_name", as Ridge is also a valid street type.
I have done a lot of work in this area and found the huge number of valid address patterns meant simple regexs didn't solve the problem on large amounts of data.
I ended up develpping this parser http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm to solve the problem. It was originally written for Australian addresses so should work well for you.

Regex can capture the parts of a match into groups. Each parentheses () defines a group.
([^\d]*)(\d*)(.*)
For "Lot 222 Mandurah Tce" this returns the following groups
Group 0: "Lot 222 Mandurah Tce" (the input string)
Group 1: "Lot "
Group 2: "222"
Group 3: " Mandurah Tce"
Explanation:
[^\d]* Any number (including 0) of any character except digits.
\d* Any number (including 0) of digits.
.* Any number (including 0) of any character.
string input = "Lot 222 Mandurah Tce";
Match match = Regex.Match(input, #"([^\d]*)(\d*)(.*)");
string beforeNumber = match.Groups[1].Value; // --> "Lot "
string number = match.Groups[2].Value; // --> "222"
string afterNumber = match.Groups[3].Value; // --> " Mandurah Tce"
If a group finds no match, match.Groups[i] will return an empty string ("") for that group.

You could check if the content starts with a number for each entry in the splitLine.
string[] splitLine = lineToSplit.Split(addresseLine);
var streetNumber = string.empty;
foreach(var s in splitLine)
{
//Get the first digit value
if (Regex.IsMatch(s, #"^\d"))
{
streetNumber = s;
break;
}
}
//Deal with empty value another way
Console.WriteLine("My streetnumber is " + s)

Yea I think you have to identify what will work.
If:
it is always in the address line and it must always start with a Digit
nothing else in that line can start with a digit (or if something else does you know which always comes in what order, ie the code below will always work if the street number is always first)
you want every contiguous character to the digit that isn't whitespace (the - and \ examples suggest that to me)
Then it could be as simple as:
var regx = new Regex(#"(?:\s|^)\d[^\s]*");
var mtch = reg.Match(addressline);
You would sort of have to sift and see if any of those assumptions are broken.

Related

How to extract the first 3 free standing characters from a string?

I have a program that needs to parse town names. Sometimes the user enters the correct town name but often the users enter the post code as the town name.
In case I cannot match the town name with a valid town name, I am assuming that the input contains the post code. The first 3 free standing characters of the post code uniquely identify the town.
Post codes have this format 3 letters followed by 3 digits, e.g. ABC123.
However some users enter the digits before the letters and some users combine the town name and the post code, e.g.
123ABC
Pretty city ABC123
How do I extract the first 3 free standing characters?
Free standing = to the left and right of the 3 characters are no other characters.
For the below strings ABC are the first 3 free standing characters.
ABC123
123ABC
ABC 123
123 ABC
123 ABC 456
ABC12DEF
123 ABC DEF
DE 123 ABC
Pretty city ABC123
These next strings do not have 3 free standing characters.
123ABCDEF
ABCD123
123ABCD
123 ABCD
Somename1234
1234Somename
Case is irrelevant.
Here are my attempts
Using regex. Does not work for "Pretty City ABC123"
Regex rgx = new Regex("[a-zA-Z]{3}");
string hamster = "ABC123";
var code = rgx.Match(hamster);
Awkward function
private static string GetCode(string pig)
{
var code = "";
var canstart = true;
for (int i = 0; i < pig.Length; i++)
{
//Console.WriteLine(code);
if (code.Length == 3)
{
if (char.IsLetter(pig[i]))
{
canstart = false;
code = "";
}
else
{
break;
}
}
if (char.IsLetter(pig[i]) && canstart)
{
code += pig[i];
}
else if (!char.IsLetter(pig[i]) && !canstart)
{
canstart = true;
}
}
if (code.Length != 3)
{
code = "";
}
return code;
}

You can use
(?<![a-zA-Z])[a-zA-Z]{3}(?![a-zA-Z])
See the regex demo. Details:
(?<![a-zA-Z]) - a negative lookbehind that matches a location that is not immediately preceded with an ASCII letter
[a-zA-Z]{3} - three ASCII letters
(?![a-zA-Z]) - a negative lookahead that matches a location that is not immediately followed with an ASCII letter.
In C#:
var rgx = new Regex(#"(?<![a-zA-Z])[a-zA-Z]{3}(?![a-zA-Z])");
var hamster = "ABC123";
var code = rgx.Match(hamster)?.Value;

How to split a street address after the first number?

I'm not sure how to solve this but I need to split a string into 2 parts. Take the string below for example:
North Street 57A 1floor
I need to split this into 2 parts.
Part 1 "North Street 57" and part 2 "A 1floor"
But if the address is just "North Street 57" then I don't need to split the string at all, so the key here is to identify if the first occurrence of street number is only digits or a combination of digits and characters (57A)
I have a lot of different address names so the text can vary. Can this be achieved?

If you always want to split after the first occurrence of a number, you may use Regular Expression for that.
Here's a full example:
string input = "North Street 57A 1floor";
var regex = new Regex(#"(?<=\d)(?=\D)");
var parts = regex.Split(input, 2);
foreach (var part in parts)
Console.WriteLine(part);
Output:
North Street 57
A 1floor
The pattern (?<=\d)(?=\D) gets the position after a string of digits. Then, we use Regex.Split(string input, int count) where count=2 to ensure that it returns two parts at maximum.
Try it online.

C# Regex Match Address - Mark groups optional

I need to parse a German address that I get in one string like "Example Street 5b". I want to split it in groups: Street, Number and Additional Information.
For example: address = Test Str. 5b
-> Street: "Test Str." Number: "5", Add.: "b"
My code looks like that:
string street = "";
string number = "";
string addition = "";
//this works:
string address = "Test Str. 5b";
//this doesn't match, but I want it in the street group:
//string address = "Test Str.";
Match adressMatch = Regex.Match(address, #"(?<street>.*?\.*)\s*(?<number>[1-9][0-9]*)\s*(?<addition>.*)");
street = adressMatch.Groups["street"].Value;
number = adressMatch.Groups["number"].Value;
addition = adressMatch.Groups["addition"].Value;
That code works well for the example and most other cases.
My problem:
If the adress does not contain a number, the function fails. I tried to add *? after the number group and several other things, but then the whole string got parsed into the "addition" and "street" and "number" remain empty. But if the number is missing, I want the string to parse into "street" and "number" and "addition" shall remain empty.
Thanks in advance :)

I would do it like this: I'd match the street into the street group, then match the number - if any - into the number group, and then the rest into the addition group.
Then, if the number group does not succeed, the addition value should be moved to the number group, which can be done easily within C# code.
So, use
(?<street>.*\.)(?:\s*(?<number>[1-9][0-9]*))?\s*(?<addition>.*)
^^ ^^ ^^
See the regex demo here (note the changes: the first .*? is turned greedy, the * quantifier after \. is removed, the number group is made optional together with the \s* in front).
Then, use this logic (C# sample snippet):
string street = "";
string number = "";
string addition = "";
//string address = "Test Str. 5b"; // => Test Str. | 5 | b
string address = "Test Str. b"; // => Test Str. | b |
Match adressMatch = Regex.Match(address, #"(?<street>.*\.)(?:\s*(?<number>[1-9][0-9]*))?\s*(?<addition>.*)");
if (adressMatch.Success) {
street = adressMatch.Groups["street"].Value;
addition = adressMatch.Groups["addition"].Value;
if (adressMatch.Groups["number"].Success)
number = adressMatch.Groups["number"].Value;
else
{
number = adressMatch.Groups["addition"].Value;
addition = string.Empty;
}
}
Console.WriteLine("Street: {0}\nNumber: {1}\nAddition: {2}", street, number, addition);

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.

This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.

I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}

Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)

Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

Regex for checking for compass directions

I'm looking to match the 8 main directions as might appear in a street or location prefix or suffix, such as:
N Main
south I-22
124 Grover Ave SE
This is easy to code using a brute force list of matches and cycle through every match possibility for every street address, matching once with a start-of-string anchor and once with a end-of-string anchor. My blunt starting point is shown farther down, if you want to see it.
My question is if anyone has some clever ideas for compact, fast-executing patterns to accomplish the same thing. You can assume:
Compound directions always start with the north / south component. So I need to match South East but not EastSouth
The pattern should not match [direction]-ern words, like "Northern" or "Southwestern"
The match will always be at the very beginning or very end of the string.
I'm using C#, but I'm just looking for a pattern so I'm not emphasizing the language. /s(outh)?/ is just as good as #"s(outh)?" for me or future readers.
SO emphasizes real problems, so FYI this is one. I'm parsing a few hundred thousand nasty, unvalidated user-typed address strings. I want to check if the start or end of the "street" field (which is free-form jumble of PO boxes, streets, apartments, and straight up invalid junk) begins or ends with a compass direction. I'm trying to deconstruct these free form strings to find similar addresses which may be accidental or intentional variations and obfuscations.
My blunt attempt
Core pattern: /n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)/
In a function:
public static Tuple<Match, Match> MatchDirection(String value) {
string patternBase = #"n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)";
Match[] matches = new Match[2];
string[] compassPatterns = new[] { #"^(" + patternBase + #")\b", #"\b(" + patternBase + #")$" };
for (int i = 0; i < 2; i++) { matches[i] = Regex.Match(value, compassPatterns[i], RegexOptions.IgnoreCase); }
return new Tuple<Match, Match>(matches[0], matches[1]);
}
In use, where sourceDt is a table with all the addresses:
var parseQuery = sourceDt.AsEnumerable()
.Select((DataRow row) => {
string addr = ((string)row["ADDR_STREET"]).Trim();
Tuple<Match, Match> dirMatches = AddressParser.MatchDirection(addr);
return new string[] { addr, dirMatches.Item1.Value, dirMatches.Item2.Value };
})

Edit: Actually this is probably wrong answer - so keeping it just so people not suggest the same thing - figuring out tokenization for "South East" is task in itself. Also I still doubt RegExp will be very usable either.
Original answer:
Don't... your initial RegExp attempt is already non-readable.
Dictionary look up for each word you want from the tokenized string ("brute force approach") already gives you linear time on length and constant time per word. And it is very easy to customize with new words.

(^[nesw][^n\s]*)|([nesw][^n\s]*$)
So this will match a line that:
begins or ends with a word that:
Begins with a cardinal direction
Doesn't have an n otherwise in it (to get rid of the "-ern"s)

Perl/PCRE compatible expression:
(?xi)
(^)?
\b
(?:
n(?:orth)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
s(?:outh)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
e(?:ast)?
|
w(?:est)?
)
\b
(?(1)|$)
I think C# supports all the features used here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract address components from a string? - c#

Related

How to extract the first 3 free standing characters from a string?

How to split a street address after the first number?

C# Regex Match Address - Mark groups optional

Regex masking of words that contain a digit

Regex for checking for compass directions

Categories

Resources