Parsing with Regex - c#

I've been searching for a good guide on this, but I can't quite figure out the regex syntax.
I have the following string that I need parsed:
[ 2013.11.22 22:50:30 ] System > Firstname Surname was kicked by
Moderator
The variables I need to pull out should look a bit like this:
[ <yyyy>.<MM>.<dd> <hh>:<mm>:<ss> ] System > <username> was kicked by <moderatorname>
So, basically the timestamp and who was kicked by who (alpha-numeric names). Here's what's confusing me a little. Both the username and the name of the moderator could potentially be
in 2 or even 3 parts divided by spaces... and potentially I guess the username could be "was kicked" which surely could screw with the parsing.
I haven't done alot of regex before, so I'm not that good at the syntax. Looking at a few guides I've come this far:
string text = "[ 2013.11.22 22:50:30 ] System > Firstname Surname was kicked by Moderator"
var input = text.ToLower();
Match m = Regex.Match(input, #"(?i:\[\s)(?<year>\d{4})\.(?<month>\d{1,2})\.(?<day>\d{1,2})\s(?<hour>\d{1,2})\:(?<minute>\d{1,2})\:(?<second>\d{1,2})\s\]");
This works for parsing the timestamp, but the text part following is giving me some trouble. I'm not really sure how to approach the issue.
Any help is appreciated, thank you

use this :
\[\s*(?<yyyy>\d+)\.(?<MM>\d+)\.(?<dd>\d+)\s+(?<hh>\d+)\:(?<mm>\d+)\:(?<ss>\d+)\s+\] System > (?<username>.+) was kicked by (?<moderatorname>\w+)
demo here :
http://regex101.com/r/kU2xA8

It may be your intention to only use regex, if so, fair enough. Otherwise may I suggest this could be simpler for the date part.
string date = "2013.11.22 22:50:30";
DateTime dateTime = DateTime.ParseExact(date , "yyyy-MM-dd HH:mm:ss", CultureInfo.InvariantCulture);
or use DateTime.Parse() if there's less certainty about the format.
I'll have a look at one big regex, but my approach would be to just pickup the usernames with regex with something like this:
System > (((?!System|\swas).)+)\swas (whoops, I'm picking up addition things)
and
(?<=kicked by).*

Related

Formatting dashes in string interpolation

I have just been checking out the new string interpolation feature in C# 6.0 (refer to the Language Features page at Roslyn for further detail). With the current syntax (which is expected to change), you can do something like this (example taken from a blog post I'm writing just now):
var dob2 = "Customer \{customer.IdNo} was born on \{customer.DateOfBirth:yyyyMdd}";
However, I can't seem to include dashes in the formatting part, such as:
var dob2 = "Customer \{customer.IdNo} was born on \{customer.DateOfBirth:yyyy-M-dd}";
If I do that, I get the error:
Error CS1056 Unexpected character '-' StringInterpolation Program.cs 21
Is there any way I can get dashes to work in the formatting part? I know I can just use string.Format(), but I want to see if it can be done with string interpolation, just as an exercise.
Edit: since it seems like nobody knows what I'm talking about, see my blog post on the subject to see how it's supposed to work.
The final version is more user friendly:
var text = $"The time is {DateTime.Now:yyyy-MM-dd HH:mm:ss}";
With the version of string interpolation that's in VS 2015 Preview, you can use characters like dashes in the interpolation format by enclosing it in another pair of quotes:
var dob2 = "Customer \{customer.IdNo} was born on \{customer.DateOfBirth : "yyyy-M-dd"}";

Parsing a text file into fields using multiple delimiter types

I'm attempting to parse log files from a chat using c#, the problem I'm running into is that it's not really designed for parsing as it doesn't use standard delimiters. Here's an example of a typical line from the file:
2010-08-09 02:07:54 [Message] Skylar Morris -> (ATL)City Waterfront: I'll be right back
date time messageType userName -> roomName: message
The fields I'd like to store are:
Date and Time joined as a DateTime type
messageType
userName
roomName
message
If it was separable by a standard delimiter like space, tab, or comma it would be fairly simple but I'm at a loss on how to attack this.
As a follow up, using this code as a template:
List<String> fileContents = new List<String>();
string input = #"2010-08-09 02:07:54 [Message] Skylar Morris -> (ATL)City Waterfront: I'll be right back";
string pattern = #"(.*)\[(.*)\](.*)->(.+?):(.*)";
foreach (string result in Regex.Split(input, pattern))
{
fileContents.Add(result.Trim());
}
I'm getting 7 elements (one empty before and after) the 5 that are expected. How can I rectify this?
foreach (string result in Regex.Split(input, pattern)
**.Where(result => !string.IsNullOrEmpty(result))**)
{
fileContents.Add(result.Trim());
}
Ok, managed to resolve it with the above code.
You know that old adage about "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."?
well, in this case, you really do need regular expressions.
this one should cover you in this case:
([\d]{4}-[\d]{2}-[\d]{2} [\d]{2}:[\d]{2}:[\d]{2}) \[([\w]+)\] ([a-zA-Z0-9 ]+) -> (\([\w]+\)[a-zA-Z0-9 ]+): (.*)
you should really test it though. I just threw this together and it may be not handling everything you could see.
Try this:
.*\[(.*)\](.*)->(.+?):(.*)
It uses the fact that message is in square brackets []
name is between [] and ->
room name is between -> and :
and message is everything afterwards. :)

Regex for Date in format dd.MM.yyyy?

Trying to create a regex for Date in this format dd.MM.yyyy.
I want to use it with a DataAnnotation, like this
[RegularExpression(#"<theregex>")]
public DateTime Date { get; set; }
I suggest that regex is not exactly the best way to do this. You haven't given the context, so it's hard to guess what you're doing overall... but you might want to create a DateExpression attribute or such and then do:
return DateTime.ParseExact(value, "dd.MM.yyyy");
in wherever your converter is defined.
^(0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)[0-9]{2}$
This regex matches 01.01.1900, 01.01.2000 but doesn't match 1.1.2000 or 1/1/00.
[0-3]{0,1}[0-9]\.[0-1]{0,1}[0-9]\.[0-9]{4,2}
matches:
28.2.96, 1.11.2008 and 12.10.2005
Just because I always find this site useful, here is an online regex checker. On the right hand side it also has examples and community contributions. In there are a number of Date matching regex variations.
You can type in a load of dates you wish to match and try some of the different examples if you're not sure which is best for you. As with anything, there are multiple ways to solve the problem, but it might help you choose the one that fits best.
^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$
This one also accepts - and / as separators. You can remove them if you want.

C# Extracting a name from a string

I want to extract 'James\, Brown' from the string below but I don't always know what the name will be. The comma is causing me some difficuly so what would you suggest to extract James\, Brown?
OU=James\, Brown,OU=Test,DC=Internal,DC=Net
Thanks
A regex is likely your best approach
static string ParseName(string arg) {
var regex = new Regex(#"^OU=([a-zA-Z\\]+\,\s+[a-zA-Z\\]+)\,.*$");
var match = regex.Match(arg);
return match.Groups[1].Value;
}
You can use a regex:
string input = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
Match m = Regex.Match(input, "^OU=(.*?),OU=.*$");
Console.WriteLine(m.Groups[1].Value);
A quite brittle way to do this might be...
string name = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
string[] splitUp = name.Split("=".ToCharArray(),3);
string namePart = splitUp[1].Replace(",OU","");
Console.WriteLine(namePart);
I wouldn't necessarily advocate this method, but I've just come back from a departmental Christmas lyunch and my brain is not fully engaged yet.
I'd start off with a regex to split up the groups:
Regex rx = new Regex(#"(?<!\\),");
String test = "OU=James\\, Brown,OU=Test,DC=Internal,DC=Net";
String[] segments = rx.Split(test);
But from there I would split up the parameters in the array by splitting them up manually, so that you don't have to use a regex that depends on more than the separator character used. Since this looks like an LDAP query, it might not matter if you always look at params[0], but there is a chance that the name might be set as "CN=". You can cover both cases by just reading the query like this:
String name = segments[0].Split('=', 2)[1];
That looks suspiciously like an LDAP or Active Directory distinguished name formatted according to RFC 2253/4514.
Unless you're working with well known names and/or are okay with a fragile hackaround (like the regex solutions) - then you should start by reading the spec.
If you, like me, generally hate implementing code according to RFCs - then hope this guy did a better job following the spec than you would. At least he claims to be 2253 compliant.
If the slash is always there, I would look at potentially using RegEx to do the match, you can use a match group for the last and first names.
^OU=([a-zA-Z])\,\s([a-zA-Z])
That RegEx will match names that include characters only, you will need to refine it a bit for better matching for the non-standard names. Here is a RegEx tester to help you along the way if you go this route.
Replace \, with your own preferred magic string (perhaps & #44;), split on remaining commas or search til the first comma, then replace your magic string with a single comma.
i.e. Something like:
string originalStr = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
string replacedStr = originalStr.Replace("\,", ",");
string name = replacedStr.Substring(0, replacedStr.IndexOf(","));
Console.WriteLine(name.Replace(",", ","));
Assuming you're running in Windows, use PInvoke with DsUnquoteRdnValueW. For code, see my answer to another question: https://stackoverflow.com/a/11091804/628981
If the format is always the same:
string line = GetStringFromWherever();
int start = line.IndexOf("=") + 1;//+1 to get start of name
int end = line.IndexOf("OU=",start) -1; //-1 to remove comma
string name = line.Substring(start, end - start);
Forgive if syntax is not quite right - from memory. Obviously this is not very robust and fails if the format ever changes.

Phone Number Formatting, OnBlur

I have a .NET WinForms textbox for a phone number field. After allowing free-form text, I'd like to format the text as a "more readable" phone number after the user leaves the textbox. (Outlook has this feature for phone fields when you create/edit a contact)
1234567 becomes 123-4567
1234567890 becomes (123) 456-7890
(123)456.7890 becomes (123) 456-7890
123.4567x123 becomes 123-4567 x123
etc
A fairly simple-minded approach would be to use a regular expression. Depending on which type of phone numbers you're accepting, you could write a regular expression that looks for the digits (for US-only, you know there can be 7 or 10 total - maybe with a leading '1') and potential separators between them (period, dash, parens, spaces, etc.).
Once you run the match against the regex, you'll need to write the logic to determine what you actually got and format it from there.
EDIT: Just wanted to add a very basic example (by no means is this going to work for all of the examples you posted above). Geoff's suggestion of stripping non-numeric characters might help out a bit depending on how you write your regex.
Regex regex = new Regex(#"(?<areaCode>([\d]{3}))?[\s.-]?(?<leadingThree>([\d]{3}))[\s.-]?(?<lastFour>([\d]{4}))[x]?(?<extension>[\d]{1,})?");
string phoneNumber = "701 123-4567x324";
Match phoneNumberMatch = regex.Match(phoneNumber);
if(phoneNumberMatch.Success)
{
if (phoneNumberMatch.Groups["areaCode"].Success)
{
Console.WriteLine(phoneNumberMatch.Groups["areaCode"].Value);
}
if (phoneNumberMatch.Groups["leadingThree"].Success)
{
Console.WriteLine(phoneNumberMatch.Groups["leadingThree"].Value);
}
if (phoneNumberMatch.Groups["lastFour"].Success)
{
Console.WriteLine(phoneNumberMatch.Groups["lastFour"].Value);
}
if (phoneNumberMatch.Groups["extension"].Success)
{
Console.WriteLine(phoneNumberMatch.Groups["extension"].Value);
}
}
I think the easiest thing to do is to first strip any non-numeric characters from the string so that you just have a number then format as mentioned in this question
I thought about stripping any non-numeric characters and then formatting, but I don't think that works so well for the extension case (123.4567x123)
Lop off the extension then strip the non-numeric character from the remainder. Format it then add the extension back on.
Start: 123.4567x123
Lop: 123.4567
Strip: 1234567
Format: 123-4567
Add: 123-4567 x123
I don't know of any way other than doing it yourself by possibly making some masks and checking which one it matches and doing each mask on a case by case basis. Don't think it'd be too hard, just time consuming.
My guess is that you could accomplish this with a conditional statement to look at the input and then parse it into a specific format. But I'm guessing there is going to be a good amount of logic to investigate the input and format the output.
This works for me. Worth checking performance if you are doing this in a tight loop...
public static string FormatPhoneNumber(string phone)
{
phone = Regex.Replace(phone, #"[^\d]", "");
if (phone.Length == 10)
return Regex.Replace(phone,
"(?<ac>\\d{3})(?<pref>\\d{3})(?<num>\\d{4})",
"(${ac}) ${pref}-${num}");
else if ((phone.Length < 16) && (phone.Length > 10))
return Regex.Replace(phone,
"(?<ac>\\d{3})(?<pref>\\d{3})(?<num>\\d{4})(?<ext>\\d{1,5})",
"(${ac}) ${pref}-${num} x${ext}");
else
return string.Empty;
}

Categories

Resources