How can I Ensure that a String Matches a Certain Format? - c#

How can I check that a string matches a certain format? For example, how can I check that a string matches the format of an IP address, proxy address (or any custom format)?
I found this code but I am unable to understand what it does. Please help me understand the match string creation process.
string pattern = #"^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$";
//create our Regular Expression object

Regex matching is made simple:
Regex r = new Regex(#"your_regexp");
if (r.Match(whatever).Success)
{
// Do_something
}
This code will invoke some actions if whatever string matches you_regexp regular expression.
So what are they, these regular expressions (the same as regex or regexp abbrevation)? They're nothing but string patterns, designed to use as filters for other strings.
Let's assume you have a lot of HTTP headers and you want to get GET moofoo HTTP/1.1 only. You may use string.Contains(other_string) method, but regexps make this process more detailed, error-free, flexible and handy.
Regexp consists of blocks, which may be used for replacement in the future. Each block definnes which symbols the entire string can contain at some position. Blocks let you to define these symbols or use patterns to ease your work.
Symbols, which may be or may not be in the current string position are determined as follows:
if you sure of these symbols MUST be there, just use them "as is". In our example, this matches HTTP word - this is always present in HTTP headers.
if you know all possible variations, use | (logic OR) operator. Note: all variants must be enclosed by block signs - round brackets. Read below for details. In our case this one matches GET word - this header could use GET, POST, PUT or DELETE words.
if you know all possible symbol ranges, use range blocks: for example, literals could be determined as [a-z], [\w] or [[:alpha:]]. Square brackets are the signs of range blocks. They must be used with count operator. This one is used to define repetitions. E.g. if your words/symbols should be matched once or more, you should define that using:
? (means 'could be present and could be not')
+ (stands for 'once or more')
* (stands for 'zero or more')
{A,} (stands for 'A or more')
{A,B} (means 'not less than A and not greater than B times')
{,B} (stands for 'not more than B')
if you know which symbol ranges must not be present, use NOT operator (^) within range, at the very beginning: [^a-z] is matched by 132==? while [^\d] is matched by abc==? (\d defines all digits and is equal to [0-9] or [[:digit:]]). Note: ^ is also used to determine the very beginning of entire string if it is not used within range block: ^moo matches moofoo and not foomoo. To finish the idea, $ matches the very ending of entire string: moo$ would be matched with foomoo and not moofoo.
if you don't care which symbol to match, use star: .* is the most commonly-used pattern to match any number of any symbols.
Note: all blocks should be enclosed by round brackets ((phrase) is a good block example).
Note: all non-standard and reserved symbols (such as tab symbol \t, round brackets ( and ), etc.) should be escaped (e.g. used with back-slash before symbol representation: \(, \t,, \.) if they do not belong to any block and should be matched as-is. For example, in our case there are two escape-sequences within HTTP/1.1 block: \/ and \.. These two should be matched as-is.
Using all the text before i've typed for nearly 30 minutes, let's use it and create a regexp to match our example HTTP header:
(GET|POST|PUT|DELETE) will match HTTP method
\ will match <SP> symbol (space as it defined in HTTP specification)
HTTP\/ would help us to math HTTP requests only
(\d+\.\d+) will match HTTP version (this will match not 1.1 only, but 12.34 too)
^ and $ will be our string border-limiters
Gathering all these statements together will give us this regexp: ^(GET|POST|PUT|DELETE)\ HTTP\/(\d+\.\d+)$.

Regular Expressions is what you use to perform a lookup on a string. A pattern is defined and you use this pattern to work out the matches for your expression. This is best seen by example.
Here is a sample set of code I wrote last year for checking if an entered string is a valid frequency of Hz, KHz, MHz, GHz or THz.
Understanding regular expressions will come from trial and error. Read up regular expressions documentation here - http://msdn.microsoft.com/en-us/library/2k3te2cs(v=vs.80).aspx
The expression below took me about 6 hours to get working, due to misunderstanding what certain terms meant, and where I needed brackets etc. But once I had this one cracked the other 6 were very simple.
/// <summary>
/// Checks the given string against a regular expression to see
/// if it is a valid hertz measurement, which can be used
/// by this formatter.
/// </summary>
/// <param name="value">The string value to be tested</param>
/// <returns>Returns true, if it is a valid hertz value</returns>
private Boolean IsValidValue(String value)
{
//Regular Expression Explaination
//
//Start (^)
//Negitive numbers allowed (-?)
//At least 1 digit (\d+)
//Optional (. followed by at least 1 digit) ((\.\d+)?)
//Optional (optional whitespace + (any number of characters (\s?(([h].*)?([k].*)?([m].*)?([g].*)?([t].*)?)+)?
// of which must contain atleast one of the following letters (h,k,m,g,t))
// before the remainder of the string.
//End ($)
String expression = #"^-?\d+(\.\d+)?(\s?(([h].*)?([k].*)?([m].*)?([g].*)?([t].*)?)+)?$";
return Regex.IsMatch(value, expression, RegexOptions.IgnoreCase);
}

May I suggest you to read the regex wiki page.

It looks like you are specifically looking for regular expressions which support IP addresses with port numbers. This thread may be useful; IPs with port numbers are discussed in detail, and there are some examples given:
http://www.codeproject.com/Messages/2829242/Re-Using-Regex-in-Csharp-for-ip-port-format.aspx
Keep in mind that a structurally valid IP is differently from a completely valid IP that only has valid numbers in it. For example, 999.999.999.999.:0000 has a valid structure, but it is not a valid IP address.
Alternatively, IPAddress.TryParse() may work for you, but I have not tried it myself.
http://msdn.microsoft.com/en-us/library/system.net.ipaddress.tryparse.aspx

Related

How to match regular expression starting exactly at a given index?

With the .NET Regex class, is there any way to match a regular expression inside a string only if the match starts exactly at a specific character index?
Let's look at an example:
regular expression ab
input string: ababab
Now, I can search for matches for the regular expression (named expr in the following) in the input string, for instance, starting at character index 2:
var match = expr.Match("ababab", 2);
// match ------------->XXab
This will be successful and return a match at index 2.
If I pass index 1, this will also be successful, pointing to the same occurrence as above:
var match = expr.Match("ababab", 1);
// match ------------->X ab
Is there any efficient way to have the second test fail, because the match does not start exactly at the specified index?
Obviously, there are some work-arounds to this.
As my string in which testing occurs might be ... "long" (think possibly 4 digit numbers of characters), I would, however, prefer to avoid the overhead that would presumably occur in all three cases one way or another:
#
Work-Around
Drawback
1
I could check the resulting match to see whether its Index property matches the supplied index.
Matching throughout the entire string would still take place, at least until the first match is found (or the end of the string is reached).
2
I could prepend the start anchor ^ to my regular expression and always test just the substring starting at the specified index.
As the string may be very long and I might be testing the same regex on multiple starting positions (but, again, only exactly on these), I am concerned about performance drawbacks from the frequent partial copying of the long string. (Ranges might be a way out here, but unfortunately, the Regex class cannot (yet?) be used to scan them.)
3
I could prepend "^.{#}" (with # being replaced with the character index to test) for each expression and match from the beginning, then fish out the actually interesting match with a capturing group.
I need to test the same regex on multiple possible start positions throughout my input string. As each time, the number of skipped characters changes, that would mean compiling a new regex every time, rather than re-using the one that I have, which again feels somewhat unclean.
Lastly, the Match overload that accepts a maximum length to check in addition to the start index does not seem useful, as in my case, the regular expression is not fixed and may well include variable-length portions, so I have no idea about the expected length of a match in advance.
It appears you can use the \G operator, \Gab pattern will allow you to match at the second index and will fail at the first one, see this C# demo:
Regex expr = new Regex(#"\Gab");
Console.WriteLine(expr.Match("ababab", 1)?.Success); // => False
Regex expr2 = new Regex(#"\Gab");
Console.WriteLine(expr2.Match("ababab", 2)?.Success); // => True
As per the documentation, \G operator matches like this:
The match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started."

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

Assist me on building my own regex

I'm completely new in this area, I need a regex that follows these rules:
Only numbers and symbols are allowed.
Must start with a number and ends with a number.
Must not contain more than 1 symbol in a row. (for example 123+-4567 is not accepted but 12+345-67 is accepted.
I tried ^[0-9]*[+-*/][0-9]*$ but I think it's a stupid try.
You were close with your attempt. This one should work.
^[0-9]+([+*/-][0-9]+)*$
explanation:
^ matches beginning of the string
[0-9]+ matches 1 or more digits.
[+*/-] matches one from specified symbols
([+*/-][0-9]+)* matches group of symbol followed by at least one digit, repeated 0 or more times
$ matches end of string
We'll build that one from individual parts and then we'll see how we can be smarter about that:
Numbers
\d+
will match an integer. Not terribly fancy, but we need to start somewhere.
Must start with a number and end with a number:
^\d+.*\d+$
Pretty straightforward. We don't know anything about the part in between, though (also the last \d+ will only match a single digit; we might want to fix that eventually).
Only numbers and symbols are allowed. Depending on the complexity of the rest of the regex this might be easier by explicitly spelling it out or using a negative lookahead to make sure there is no non-(number|symbol) somewhere in the string. I'll go for the latter here because we need that again:
(?!.*[^\d+*/-])
Sticking this to the start of the regex makes sure that the regex won't match if there is any non-(number|symbol) character anywhere in the string. Also note that I put the - at the end of the character class. This is because it has a certain special meaning when used between two other characters in a character class.
Must not contain more than one symbol in a row. This is a variation on the one before. We just make sure that there never is more than one symbol by using a negative lookahead to disallow two in sequence:
(?!.*[+/*-]{2})
Putting it all together:
(?!.*[^\d+*/-])(?!.*[+/*-]{2})^\d+.*\d+$
Testing it:
PS Home:\> '123+-4567' -match '(?!.*[^\d+*/-])(?!.*[+/*-]{2})^\d+.*\d+$'
False
PS Home:\> '123-4567' -match '(?!.*[^\d+*/-])(?!.*[+/*-]{2})^\d+.*\d+$'
True
However, I only literally interpreted your rules. If you're trying to match arithmetic expressions that can have several operands and operators in sequence (but without parentheses), then you can approach that problem differently:
Numbers again
\d+
Operators
[+/*-]
A number followed by an operator
\d+[+/*-]
Using grouping and repetition to match a number followed by any number of repetitions of an operator and another number:
\d+([+/*-]\d+)*
Anchoring it so we match the whole string:
^\d+([+/*-]\d+)*$
Generally, for problems where it works, this latter approach works better and leads to more understandable expressions. The former approach has its merits, but most often only in implementing password policies (apart from »cannot repeat any of your previous 30689 passwords«).

Validation for Phone Number to allow () and space, but regex I am using is not allowing to enter those.

string Phno=txt_EditPhno.Text;
bool check = false;
Regex regexObj = new Regex(#"^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$");
if ((String.IsNullOrEmpty(Phno))||(regexObj.IsMatch(Phno)))
{}
I am using this regular expression to allow phone number to allow, space, -, () But it doesn't allow any of the symbols mentioned above, Is it the regular expression i am using wrong or am I doing it in the wrong way
The RegEx string you listed is working correctly:
System.Text.RegularExpressions.Regex regexObj = new System.Text.RegularExpressions.Regex(#"^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$");
regexObj.IsMatch("(555)-867-5309")
true
regexObj.IsMatch("+15558675309")
true
regexObj.IsMatch("+1.555.867.5309")
true
regexObj.IsMatch("+15558675309ext12345")
true
regexObj.IsMatch("+15558675309x12345")
true
The error in your code must be somewhere else. You can always use a tool like RegExLib.com to test out your RegEx.
Using a slightly different approach might also be useful to you... I've used a type of approach before in getting telephone number information that involves pulling out the needed information and reformatting it - you may have requirements that don't fit this solution, but I'd like to suggest it anyways.
Using this match expression:
(?i)^\D*1?\D*([2-9])\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)[^x]*?\s*(?:(?:e?(x)(?:\.|t\.?|tension)?)\D*(\d+))?.*$
and this replace expression:
($1$2$3) $4$5$6-$7$8$9$10 $12$13
you should be able to reformat these inputs as indicated:
Input Output
----------------------------- --------------------------------
"1323-456-7890 540" "(323) 456-7890 "
"8648217634" "(864) 821-7634 "
"453453453322" "(453) 453-4533 "
"#404-327-4532" "(404) 327-4532 "
"172830923423456" "(728) 309-2342 "
"17283092342x3456" "(728) 309-2342 x3456"
"jh345gjk26k65g3245" "(345) 266-5324 "
"jh3g24235h2g3j5h3x245324" "(324) 235-2353 x245324"
"12345678925x14" "(234) 567-8925 x14"
"+1 (322)485-9321" "(322) 485-9321 "
"804.555.1234" "(804) 555-1234 "
I'll grant you it's not the most efficient expression, but an inefficient regex is not usually a problem when run on a short amount of text, especially when written with knowledge and a small amount of care
To break down the parsing expression a little bit:
(?i)^\D*1?\D* # mode=ignore case; optional "1" in the beginning
([2-9])\D*(\d)\D*(\d)\D* # three digits* with anything in between
(\d)\D*(\d)\D*(\d)\D* # three more digits with anything in between
(\d)\D*(\d)\D*(\d)\D*(\d)[^x]*? # four more digits with anything in between
\s* # optional whitespace
(?:(?:e?(x)(?:\.|t\.?|tension)?) # extension indicator (see below)
\D*(\d+))? # optional anything before a series of digits
.*$ # and anything else to the end of the string"
The three digits cannot start with 0 or 1. The extension indicator can be x, ex, xt, ext (all of which can have a period at the end), extension, or xtension (which cannot have a period at the end).
As written, the extension (the digits, that is) has to be a contiguous series of numbers (but they usually are, as your given expression assumes)
The idea is to use the regex engine to pull out the first 10 digits (excluding "0" and "1", because domestic U.S. telephone numbers do not start with those (except as a switch, which is not needed or always needed, and is not dependent upon the destination phone, but the phone you're typing it into. It will then try to pull out anything up to an 'x', and capture the 'x', along with the first contiguous string of digits after that.
It allows considerable tolerance in formatting of the input, while at the same time stripping out harmful data or meta-characters, then produces a consistently-formatted telephone number (something that is often appreciated on many levels)

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

Categories

Resources