How to split a string into 2 strings using RegEx? - c#

I need a regEx to retrieve the street of a string and the streetnumber. Let's consider that the streetname starts from the beginning until there is a whitespace followed by a number
example:
Original string: 'Jan van Rijswijcklaan 123'
Result should be: 'Jan van Rijswijcklaan' as the streetname and '123' as the streetnumber.
any help is appreciated.
UPDATE
I was able to get the streetname and number, but sometimes I had streetnumbers like '123b a1' then the code failed in defining the streetnumber. As result for the streetnumber was only 'a1' instead of '123b a1'.
So at the moment I'm dealing with 2 scenarios:
When streetname contains only alphabetic characters and number contains only digits - like 'Jan van Rijswijcklaan 123'
When streetname contains only alphabetic characters and number contains alphanumeric characters - like 'Jan van Rijswijcklaan 123b a1'
Here is the code I tried:
string street = Regex.Match(streetWithNum, #"^[^0-9]*").Value + ";";
string number = Regex.Match(streetWithNum, #"\w\d*\w?\d*$").Value + ";";

Use positive lookahead pattern to search spliting condition:
var s = "Jan van Rijswijcklaan 124";
var result = Regex.Split(s, #"\s(?=\d)");
Console.WriteLine("street name: {0}", result[0]);
Console.WriteLine("street number: {0}", result[1]);
prints:
street name: Jan van Rijswijcklaan
street number: 124
note: use Int32.TryParse to convert street number from string to int, if you need to

I'm not a fan of regex, do you notice that?
IEnumerable<string> nameParts = "Jan van Rijswijcklaan 124".Split()
.TakeWhile(word => !word.All(Char.IsDigit));
string name = string.Join(" ", nameParts);
DEMO
If you want to take both, the street-name and the number:
string[] words = "Jan van Rijswijcklaan 124".Split();
var streetNamePart = words.TakeWhile(w => !w.All(Char.IsDigit));
var streetNumberPart = words.SkipWhile(w => !w.All(Char.IsDigit));
Console.WriteLine("street-name: {0}", string.Join(" ", streetNamePart));
Console.WriteLine("street-number: {0}", string.Join(" ", streetNumberPart));

Than just fix the answer of #Ilya_Ivanov with lookahead:
var result = Regex.Split(s, #"\s(?=\d)");

Here a non-regex solution also;
string str = "Jan van Rijswijcklaan 124";
var array = str.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
string streetname = "";
string streetnumber = "";
foreach (var item in array)
{
if (Char.IsNumber(item[0]))
streetnumber += item;
else
streetname += item + " ";
}
Console.WriteLine(streetname.TrimEnd());
Console.WriteLine(streetnumber);
Output will be;
Jan van Rijswijcklaan
124
Here a DEMO.

this should do:
Regex r = new Regex(#"(.+?) (\d+)$");
Match m = r.Match("Jan van Rijswijcklaan 124");
String street = m.Groups[1].Value;
String number = m.Groups[2].Value;
I typed this from memory, don't blame me for typos :)
edit: the '$' at the end of the regex string makes sure the number match to occur only at the end of the input string.
edit 2: just removed the typos and tested the code, it works now.
edit 3: the expression can be read as: Collect as many characters as you can get into group 1 without being greedy (.+?) but leave a sequence of digits at the end of the string, after a whitespace, for group 2 (\d+)$

Related

Split by period ignore few cases

I am splitting a string with a period and space(". "), I want to split with a ". " but ignore if it matches few patterns like MR. , JR. , [oneletter]. , Dr.
Pattern list is static.(case insensitive)
Examples:
1) My Name is MR. ABC and working for XYZ.
Output: No split. Just one line
2) My Name is Mr. ABC. I work for XYZ.
Output: string[0] = My Name is Mr. ABC.
string[1] = I work for XYZ.
3) My Name is ABC. I work for XYZ.
Output: string[0] = My Name is ABC.
string[1] = I work for XYZ.
4) My Name is MR. ABC Jr. DEF. I work for XYZ.
Output: string[0] = My Name is MR. ABC Jr. DEF. (MR. and Jr. are ignoring cases )
string[1] = I work for XYZ.
Using sln's regex pattern here's a mock up of how it should work
List<string> ignores = new List<string>(){ "MR", "MS", "MRS", "DR", "PROF" };
ignores = ignores.Select(x => #"\b" + x).ToList();
string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
foreach (char letter in alphabet.ToCharArray())
{
ignores.Add(#"\b" + letter);
}
string test = "This is a test for Prof. Plum. Here is a test for Ms. White. This is A. Test. Welcome to GMR. Next Line.";
string regexPattern = $#"(?<!{string.Join("|", ignores)})\.\s";
string[] results = Regex.Split(test, regexPattern, RegexOptions.IgnoreCase);
results are the 3 sentences (though you need to re-add the . to the end of all but the last value)
Edited to add all single character ignores
Edited to only account for whole words on ignore list

c# Regex - Only get numbers and whitespaces in one string and only text and whitespaces in another

How do I only get numbers and include whitespaces in one string and only text and white spaces in another?
Iv'e tried this:
string value1 = "123 45 New York";
string result1 = Regex.Match(value1, #"^[\w\s]*$").Value;
string value2 = "123 45 New York";
string result2 = Regex.Match(value2, #"^[\w\s]*$").Value;
result1 need to be "123 45"
result2 need to be " New York"
Try next code:
string value1 = "123 45 New York";
string digitsAndSpaces = Regex.Match(value1, #"([0-9 ]+)").Value;
string value2 = "123 45 New York";
string lettersAndSpaces = Regex.Match(value2, #"([A-Za-z ])+([A-Za-z ]+)").Value;
Update:
How do I allow charachters like å ä ö in result from value2?
string value3 = "å ä ö";
string speclettersAndSpaces = Regex.Match(value3, #"([a-zÀ-ÿ ])+([a-zÀ-ÿ ]+)").Value;
The fallowing regex will allow only digits and spaces between them, the same goes with characters.
Regex: (?:\d[0-9 ]*\d)|(?:[A-Za-z][A-Za-z ]*[A-Za-z])
Details:
(?:) Non-capturing group
\d matches a digit (equal to [0-9])
[] Match a single character present in the list
* Matches between zero and unlimited times
| or
Output:
Match 1
Full match 0-6 `123 45`
Match 2
Full match 7-15 `New York`
Regex demo

C# Regex Match Address - Mark groups optional

I need to parse a German address that I get in one string like "Example Street 5b". I want to split it in groups: Street, Number and Additional Information.
For example: address = Test Str. 5b
-> Street: "Test Str." Number: "5", Add.: "b"
My code looks like that:
string street = "";
string number = "";
string addition = "";
//this works:
string address = "Test Str. 5b";
//this doesn't match, but I want it in the street group:
//string address = "Test Str.";
Match adressMatch = Regex.Match(address, #"(?<street>.*?\.*)\s*(?<number>[1-9][0-9]*)\s*(?<addition>.*)");
street = adressMatch.Groups["street"].Value;
number = adressMatch.Groups["number"].Value;
addition = adressMatch.Groups["addition"].Value;
That code works well for the example and most other cases.
My problem:
If the adress does not contain a number, the function fails. I tried to add *? after the number group and several other things, but then the whole string got parsed into the "addition" and "street" and "number" remain empty. But if the number is missing, I want the string to parse into "street" and "number" and "addition" shall remain empty.
Thanks in advance :)
I would do it like this: I'd match the street into the street group, then match the number - if any - into the number group, and then the rest into the addition group.
Then, if the number group does not succeed, the addition value should be moved to the number group, which can be done easily within C# code.
So, use
(?<street>.*\.)(?:\s*(?<number>[1-9][0-9]*))?\s*(?<addition>.*)
^^ ^^ ^^
See the regex demo here (note the changes: the first .*? is turned greedy, the * quantifier after \. is removed, the number group is made optional together with the \s* in front).
Then, use this logic (C# sample snippet):
string street = "";
string number = "";
string addition = "";
//string address = "Test Str. 5b"; // => Test Str. | 5 | b
string address = "Test Str. b"; // => Test Str. | b |
Match adressMatch = Regex.Match(address, #"(?<street>.*\.)(?:\s*(?<number>[1-9][0-9]*))?\s*(?<addition>.*)");
if (adressMatch.Success) {
street = adressMatch.Groups["street"].Value;
addition = adressMatch.Groups["addition"].Value;
if (adressMatch.Groups["number"].Success)
number = adressMatch.Groups["number"].Value;
else
{
number = adressMatch.Groups["addition"].Value;
addition = string.Empty;
}
}
Console.WriteLine("Street: {0}\nNumber: {1}\nAddition: {2}", street, number, addition);

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

How to pull text from this string of data

I need to pull the city and state out string of data that look as follows:
8 mi SSW of Newtown, PA
10 mi SE of Milwaukee, WI
29 Miles E of Orlando, FL
As of right now I am passing each string individually into a method
string statusLocation = "8 mi SSW of Newtown, PA"
etc. one at a time.
What would be the best way to search this string for the city state? I was thinking either regex or substring and index of the comma etc. I wasn’t quite sure what kind of issues I would run into if a state is 3 characters or a city has a comma in it because this is Canada data as well and I am not sure how they abbreviate stuff.
You could do a
string str = "8 mi SSW of Newtown, PA";
var parts = str.Split(new[] {' '}, 5);
parts then looks like this: { "8", "mi", "SSW", "of", "Newtown, PA" }, and you can access the "Newtown, PA" easily with parts[4].
You could use this regular expression:
of (.*), ([a-zA-Z]{2})$
That will capture everything after the of, up a comma that is followed by a space then two letters, then a line ending. For example:
var regex = new Regex("of (.*), ([a-zA-Z]{2})$");
var strings = new[]
{
"8 mi SSW of Newtown, PA",
"10 mi SE of Milwaukee, WI",
"29 Miles E of Orlando, FL"
};
foreach (var str in strings)
{
var match = regex.Match(str);
var city = match.Groups[1];
var state = match.Groups[2];
Console.Out.WriteLine("state = {0}", state);
Console.Out.WriteLine("city = {0}", city);
}
This of course assumes some consistency with the data, like the state being two letters.

Categories

Resources