Regular expression to check valid property name in c# - c#

I need to validate user input for a property name to retrieve.
For example user can type "Parent.Container" property for windows forms control object or just "Name" property. Then I use reflection to get value of the property.
What I need is to check if user typed legal symbols of c# property (or just legal word symbols like \w) and also this property can be composite (contain two or more words separated with dot).
I have this as of now, is this a right solution?
^([\w]+\.)+[\w]+$|([\w]+)
I used Regex.IsMatch method and it returned true when I passed "?someproperty", though "\w" does not include "?"

I was looking for this too, but I knew none of the existing answers are complete. After a little digging, here's what I found.
Clarifying what we want
First we need to know which valid we want: valid according to the runtime or valid according to the language? Examples:
Foo\u0123Bar is a valid property name for the C# language but not for the runtime. The difference is smoothed over by the compiler, which quietly converts the identifier to FooģBar.
For verbatim identifiers (# prefix) the language treats the # as part of the identifier, but the runtime doesn't see it.
Either could make sense depending on your needs. If you're feeding the validated text into Reflection methods such as GetProperty(string), you'll need the runtime-valid version. If you want the syntax that's more familiar to C# developers, though, you'd want the language- valid version.
"Valid" based on the runtime
C# version 5 is (as of 7/2018) the latest version with formal standards: the ECMA 334 spec. Its rule says:
The rules for identifiers given in this subclause correspond exactly
to those recommended by the Unicode Standard Annex 15 except that
underscore is allowed as an initial character (as is traditional in
the C programming language), Unicode escape sequences are permitted in
identifiers, and the “#” character is allowed as a prefix to enable
keywords to be used as identifiers.
The "Unicode Standard Annex 15" mentioned is Unicode TR 15, Annex 7, which formalizes the basic pattern as:
<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
The {codes in curly braces} are Unicode classes, which map directly to Regex via \p{category}. So (after a little simplification) the basic regex to check for "valid" according to the runtime would be:
#"^[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$"
All the ugly details
The C# spec also requires that identifiers be in Unicode Normalization Form C. It doesn't require that the compiler actually enforces it, though. At least the Roslyn C# compiler allows non-normal-form identifiers (e.g., E\u0304\u0306) and treats them as distinct from equivalent normal-form identifiers (e.g., \u0100\u0306). And anyway, to my knowledge there's no sane way to represent such a rule with a regex. If you don't need/want the user to be able to differentiate properties that look exactly the same, my suggestion is to just run string.Normalize() on the user's input to be done with it.
The C# spec says that two identifiers are equivalent if they only differ by formatting characters. For example, Elmo (four characters) and El­mo (El\u00ADmo) are the same identifier. (Note: that's the soft-hyphen, which is normally invisible; some fonts may display it, though.) If the presence of invisible characters would cause you trouble, you can drop the \p{Cf} from the regex. That doesn't reduce which identifiers you accept—just which formats you accept.
The C# spec reserves identifiers containing "__" for its own use. Depending on your needs you may want to exclude that. That should likely be an operation separate from the regex.
Nesting, generics, etc.
Reflection, Type, IL, and perhaps other places sometimes show class names or method names with extra symbols. For example, a type name may be given as X`1+Y[T]. That extra stuff is not part of the identifier—it's an unrelated way of representing type information.
"Valid" based on the language
This is just the previous regex but also allowing for:
Prefixed #
Unicode escape sequences
The first is a trivial modification: just add #?.
Unicode escape sequences are of form #"\\[Uu][\dA-Fa-f]{4}". We may be tempted to wedge that into both [...] pairs and call it done, but that would incorrectly allow (for example) \u0000 as an identifier. We need to limit the escape sequences to ones that produce otherwise-acceptable characters. One way to do that is to do a pre-pass to convert the escape sequences: replace all \\[Uu][\dA-Fa-f]{4} with the corresponding character.
So putting it all together, a check for whether a string is valid from a C# language standpoint would be:
bool IsValidIdentifier(string input)
{
if (input is null) { throw new ArgumentNullException(); }
// Technically the input must be in normal form C. Implementations aren't required
// to verify that though, so you could remove this check if your runtime doesn't
// mind.
if (!input.IsNormalized())
{
return false;
}
// Convert escape sequences to the characters they represent. The only allowed escape
// sequences are of form \u0000 or \U0000, where 0 is a hex digit.
MatchEvaluator replacer = (Match match) =>
{
string hex = match.Groups[1].Value;
var codepoint = int.Parse(hex, NumberStyles.HexNumber);
return new string((char)codepoint, 1);
};
var escapeSequencePattern = #"\\[Uu]([\dA-Fa-f]{4})";
var withoutEscapes = Regex.Replace(input, escapeSequencePattern, replacer, RegexOptions.CultureInvariant);
withoutEscapes.Dump();
// Now do the real check.
var isIdentifier = #"^#?[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$";
return Regex.IsMatch(withoutEscapes, isIdentifier, RegexOptions.CultureInvariant);
}
Back to the original question
The asker is long gone, but I feel obliged to include an answer to the actual question:
string[] parts = input.Split();
return parts.Length == 2
&& IsValidIdentifier(parts[0])
&& IsValidIdentifier(parts[1]);
Sources
ECMA 334 § 7.4.3; ECMA 335 § I.10; Unicode TR 15 Annex 7

Not the best, but this will work. Demo here.
^#?[a-zA-Z_]\w*(\.#?[a-zA-Z_]\w*)*$
Note that
* Number 0-9 is not allowed as first character
* # is allowed only as first character, but not anywhere else (compiler will strip off though)
* _ is allowed
Edit
Looking at your requirement, the below Regex will be more useful, as input property name need not have # in it. Check here.
^[a-zA-Z_]\w*(\.[a-zA-Z_]\w*)*$

What you posted in the comments is almost right. But it won't detect single properties like "Name".
^(?:[\w]+\.)*\w+$
Works as expected. Just changed the + to * and the group to non-capturing group since you are not concerned about groups here.

Related

C# toUpper for language without Uppercase

When using String.toUpper() are there any additional precautions which must be taken when attempting to "format" a language which does not contain uppercase characters such as Arabic?
string arabic = "مرحبا بالعالم";
string upper= arabic.ToUpper();
Sidebar: Never call .ToUpper() or .ToLower() when localization matters because these methods do not accept an explicit IFormatProvider that makes your intent (about localization) clear. You should prefer CultureInfo.TextInfo.ToUpperCase instead.
But to answer your question: case-conversions do not affect characters not subject to casing, they are kept as-is. This also happens in en-US and other Latin-alphabet languages because characters like digits 0, 1, 2 etc don't have cases either - so your Arabic characters will be preserved as-is.
Note how the non-alphabetical and already-uppercase characters are ignored:
"abcDEF1234!##" -> "ABCDEF1234!##"
Another thing to be aware of is that some languages have characters that don't have a one-to-one mapping between lowercase and uppercase forms, namely the Turkish I, which is written up here: https://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/ (and it's why FxCop yells at you if you ever use ToLower instead of ToUpper, and why you should use StringComparison.OrdinalIgnoreCase or CurrentCultureIgnoreCase and never str1.ToLower() == str2.ToLower() for case-insensitive string comparison.

Domain Name Regex Including IDN Characters c#

I want my domain name to not contain more than one consecutive (.), '/' or any other special characters. But it can contain IDN Characters such as Á, ś, etc... I can achieve all requirements (except IDN) by using this regex:
#"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
Problem is that this regex denies IDN charaters too. I want a regex which will allow IDN characters. I did a lof of research but I cant figure it out.
Brief
Regex contains a character class that allows you to specify Unicode general categories \p{}. The MSDN regex documentation contains the following:
\p{ name } Matches any single character in the Unicode general
category or named block specified by name.
Also, as a sidenote, I noticed your regex contains an unescaped .. In regex the dot character . has a special meaning of any character (except newline unless otherwise specified). You may need to change this to \. to ensure proper functionality.
Code
Editing your existing code to include Unicode character classes instead of simply the ASCII letters, you should attain the following:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
Explanation
\p{L} Represents the Unicode character class for any letter in any language/script
\p{N} Represents the Unicode character class for any number in any language/script (based on your character samples, you can probably keep 0-9, but I figured I would show you the general concept and give you slightly additional information)
This site gives a quick and general overview of the most used Unicode categories.
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
This question cannot be answered with a simple regex that allows all sorts of Unicode character classes since the IDN Character Categorization defines many illegal characters and there are other limitations.
AFAIK, IDN domain names start with xn--. This way extended UTF-8 characters are enabled in domain names, e.g. 大众汽车.cn is a valid domain name (volkswagen in Chinese). To validate this domain name using regex, you need to let http://xn--3oq18vl8pn36a.cn/ (the ACE equivalent of 大众汽车) pass.
In order to do so, you will need to encode domain names to ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn comes with a CLI tool called idn that allows you to convert a hostname in UTF-8 to ACE encoding. The resulting string can then be used as ACE-encoded equivalent of UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
Inspired by paka and timgws and I suggest the following regular expression, that should cover most domains:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
Here are some samples:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--stackoverflow.com
stackoverflow.xn--com
stackoverflow.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
Demo
Visualization
Some useful links
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
If you would like to use Unicode character classes \p{} instead, you should use the following as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
See also: Perl Unicode properties
"There are several reasons why one may need to validate a domain or an Internationalized Domain Name.
To accept only the functional domains which resolve when probed through a DNS query
To accept the strings which can potentially act (get registered and subsequently resolved, or only for the sake of information) as domain name
Depending on the nature of the need, the ways in which the domain name can be validated, differs a great deal.
For validating the domain names, only from pure technical specification point of view, regardless of it's resolvability vis-a-vis the DNS, is a slightly more complex problem than merely writing a Regex with certain number of Unicode classes.
There is a host of RFCs (5891,5892,5893,5894 and 5895) that together define, the structure of a valid domain ( IDN in specific, domain in general) name. It involves not only various Unicode Character classes, but also includes some context specific rules which need a full-fledged algorithm of their own. Typically, all the leading programming languages and frameworks provide a way to validate the domain names as per the latest IDNA Protocol i.e. IDNA 2008.
C# provides a library: System.Globalization.IdnMapping that provides conversion of a domain name to it's equivalent punycode version. You can use this library to check if the user submitted domain is compliant with the IDNA norms. If it is not, during the conversion you will encounter an error/exception, thereby validating the user submission.
If one is interested to dig deeper into the topic, please do refer the very thoroughly research document produced by the "Universal Acceptance Steering Group" (https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
In addition, those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, one can also go through the following RFCs: RFC 6530 (Overview and Framework for Internationalized Email), RFC 6531 (SMTP Extension for Internationalized Email), RFC 6532 (Internationalized Email Headers), RFC 6533 (Internationalized Delivery Status and Disposition Notifications), RFC 6855 (IMAP Support for UTF-8), RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8), RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages), RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email).)."

RegEx different substitutions based on groups?

So I'm relatively n00bish at regular expressions, and doing a little practicing.
I'm playing with a dog-simple "deobfucator" that just looks for [dot] or (dot) or [at] or (at). Case-insensitive, and with or w/out any number of spaces before or after the match(s).
This is for the usual: someemail [AT] domain (dot) com type of thing. I want to obviously turn it into someemail#domain.com.
The RegEx I've come up with does the matching fine, but now I want to replace with either a . or a # depending on the match.
i.e.
I want the group matching the "dot" group to replace it with the literal ., and the group matching the "at" group with the literal #.
I know I could just write 2 different (almost identical) RegEx's and run it through both, but for the sake of education, I'm trying to see if I can do it all in one RegEx?
Here's the RegEx I came up with (probably not the smallest possible, which I'd also be interested in seeing):
+(\[|\()(dot)(\)|\]) +| +(\[|\()(at)(\)|\]) +
NOTE: before each + there's an empty space, for matching spaces.
What I'm looking for is what I would use to do the replacement(s) properly?
Update: Sorry all, forgot to add which language I was working with for this. In this case, I'm using a clipboard utility that can run RegEx's on it's input (whatever gets copied to the clipboard), and the engine it uses is C#/VB.NET. Ultimate goal for this little project is to just be able to copy an "obfuscated" email address or URL, and run the RegEx on it so that it's set on the clipboard in it's "unobfuscated" state.
That said, I do tend to use RegEx's on many different languages, so converting them between languages generally isn't an issue.
.NET regex does not support conditional replacement patterns.
for the sake of education, I'm trying to see if I can do it all in one RegEx?
There are other regex engines that allow conditional replacement logic in a single regex replacement operation with conditional replacement patterns.
There are 3 engines that support this type of replacements: JGsoft V2, Boost, and PCRE2.
For conditionals to work in Boost, you need to pass regex_constants::format_all to regex_replace. For them to work in PCRE2, you need to pass PCRE2_SUBSTITUTE_EXTENDED to pcre2_substitute.
In PCRE2:
${1:+matched:unmatched} where 1 is a number between 1 and 99 referencing a numbered capturing group. If your regex contains named capturing groups then you can reference them in a conditional by their name: ${name:+matched:unmatched}.
If you want a literal colon in the matched part, then you need to escape it with a backslash. If you want a literal closing curly brace anywhere in the conditional, then you need to escape that with a backslash too. Plus signs have no special meaning beyond the :+ that starts the conditional, so they don't need to be escaped.
Also, see The Boost-Specific Format Sequences:
When specifying the format_all flag to regex_replace(), the escape sequences recognized are the same as those above for format_perl. In addition, conditional expressions of the following form are recognized:
?Ntrue-expression:false-expression
where N is a decimal digit representing a sub-match. If the corresponding sub-match participated in the full match, then the substitution is true-expression. Otherwise, it is false-expression. In this mode, you can use parens () for grouping. If you want a literal paren, you must escape it as \(.
In Boost replacement patterns, literal ( and ) must be escaped.
The syntax for JGsoft V2 replacement string conditionals is the same as that in the C++ Boost library.
So, your regex can be contracted to ( +)[[(](?:(dot)|(at))[])]( +):
( +) - Group 1: one or more spaces
[[(] - a [ or (
(?:(dot)|(at)) - Either (Group 2) a dot substring or (Group 3) an at substring
[])] - a ) or ]
( +) - Group 4: one or more spaces
And replace with $1(?{3}.:#)$4:
$1 - Group 1 value,
(?{3}.:#) - if Group 3 matched, replace with ., else with #
$4 - Group 4 value.
This is available in Notepad++:
If you are using Java, try replaceAll method from String class.
And finally you need to normalize it with white spaces:
- Pure Java - String after = before.trim().replaceAll("\\s+", " ");
- Pure Java - String after = before.replaceAll("\\s{2,}", " ").trim();
- Apache commons lang3 - String after = StringUtils.normalizeSpace(String str);
- ...

Why can users not put escape sequences in their input by default?

So i'm working on this challenge in which I have to take in user input, check if it contains a escape sequence and then execute the escape sequence.
My question is why do escape sequences execute on pre determined string variables but then you take a users input and store that in a variable. That input happens to contain a escape sequence such as \n but does not execute.
No user input Ex:
string noInput = "this is a escape \n sequence"
Console.WriteLine(noInput);
Console.ReadLine()
Output is : This is an escape
sequence
or user input Ex:
string input = Console.ReadLine();
Console.WriteLine(input);
Console.ReadLine();
Output is : This is an escape \n sequence
Hopefully i explained my question well enough. I'm assuming this may be because of security but would like to know the answer.
"Escape sequence" is a feature of the language / compiler.. in this case C#.
The relevant language specification can be found at - 2.4.4.5 String literals
Note that the reference is to an older version of language specification, but still applies.
Latest version can be found here.
From the spec -
A character that follows a backslash character () in a regular-string-literal-character must be one of the following characters: ', ", \, 0, a, b, f, n, r, t, u, U, x, v. Otherwise, a compile-time error occurs. The example
string a = "hello, world"; // hello, world
string b = #"hello, world"; // hello, world
string c = "hello \t world"; // hello world
string d = #"hello \t world"; // hello \t world
Point is, that a .Net language is free to define what special characters in a string literal will be treated as escape sequences.. however it is typically what has been used for ages from languages like C and C++ in old days.
When you are accepting user input.. The input is (obviously?) treaded as a literal string. (Another way to think is, a compiled .Net program is obviously compiler and language independent.. the runtime a.k.a CLR doesn't have the concept of escape sequences in strings)
If you wish to provide such features (may be you have a good scenario).. you have limited options..
Use upcoming compiler features like Roslyn to process the input string for you. I have never personally looked at which specific API in Roslyn will help you do that, but it has to be there, given that Roslyn is supposed to be the compiler itself.
Note that a con of this approach is, that Roslyn may be pretty heavyweight to include in your app for only one feature.
Write a small routine yourself, which tries to perform same escaping as the compiler. For production quality code, this can be tricky (you have to understand and follow the specification to exactly match it.. and perhaps keep your implementation up to date, as it may change with future versions of C# - Like what if new escape sequence is introduced).
Although, practically speaking.. escape sequences in C# specification should not change willy nilly.. but I would not bet on it.
Find a third party library, which already does it for you (included for sake of completeness of the answer.)
EDIT: Proof that the string you see (in source code), is only an artifact of the source code in given language -
Compile a C# app, with string "Hello\nWorld" in it. Open the compiled binary in a binary editor. The string you'd find in the compiled binary will be without the "\n", replaced with the appropriate bytes for new line character.
When it is in predetermined string then it consider as single character. When user input '\' and then 'n' then it consider as two different character. So in case of user input your string in one character more.
Try using substrings or any string manipulation to achieve what you want and get /n part of the user input. Check it here:
http://msdn.microsoft.com/en-us/library/vstudio/ms228362(v=vs.110).aspx
;)

How can I Ensure that a String Matches a Certain Format?

How can I check that a string matches a certain format? For example, how can I check that a string matches the format of an IP address, proxy address (or any custom format)?
I found this code but I am unable to understand what it does. Please help me understand the match string creation process.
string pattern = #"^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$";
//create our Regular Expression object
Regex matching is made simple:
Regex r = new Regex(#"your_regexp");
if (r.Match(whatever).Success)
{
// Do_something
}
This code will invoke some actions if whatever string matches you_regexp regular expression.
So what are they, these regular expressions (the same as regex or regexp abbrevation)? They're nothing but string patterns, designed to use as filters for other strings.
Let's assume you have a lot of HTTP headers and you want to get GET moofoo HTTP/1.1 only. You may use string.Contains(other_string) method, but regexps make this process more detailed, error-free, flexible and handy.
Regexp consists of blocks, which may be used for replacement in the future. Each block definnes which symbols the entire string can contain at some position. Blocks let you to define these symbols or use patterns to ease your work.
Symbols, which may be or may not be in the current string position are determined as follows:
if you sure of these symbols MUST be there, just use them "as is". In our example, this matches HTTP word - this is always present in HTTP headers.
if you know all possible variations, use | (logic OR) operator. Note: all variants must be enclosed by block signs - round brackets. Read below for details. In our case this one matches GET word - this header could use GET, POST, PUT or DELETE words.
if you know all possible symbol ranges, use range blocks: for example, literals could be determined as [a-z], [\w] or [[:alpha:]]. Square brackets are the signs of range blocks. They must be used with count operator. This one is used to define repetitions. E.g. if your words/symbols should be matched once or more, you should define that using:
? (means 'could be present and could be not')
+ (stands for 'once or more')
* (stands for 'zero or more')
{A,} (stands for 'A or more')
{A,B} (means 'not less than A and not greater than B times')
{,B} (stands for 'not more than B')
if you know which symbol ranges must not be present, use NOT operator (^) within range, at the very beginning: [^a-z] is matched by 132==? while [^\d] is matched by abc==? (\d defines all digits and is equal to [0-9] or [[:digit:]]). Note: ^ is also used to determine the very beginning of entire string if it is not used within range block: ^moo matches moofoo and not foomoo. To finish the idea, $ matches the very ending of entire string: moo$ would be matched with foomoo and not moofoo.
if you don't care which symbol to match, use star: .* is the most commonly-used pattern to match any number of any symbols.
Note: all blocks should be enclosed by round brackets ((phrase) is a good block example).
Note: all non-standard and reserved symbols (such as tab symbol \t, round brackets ( and ), etc.) should be escaped (e.g. used with back-slash before symbol representation: \(, \t,, \.) if they do not belong to any block and should be matched as-is. For example, in our case there are two escape-sequences within HTTP/1.1 block: \/ and \.. These two should be matched as-is.
Using all the text before i've typed for nearly 30 minutes, let's use it and create a regexp to match our example HTTP header:
(GET|POST|PUT|DELETE) will match HTTP method
\ will match <SP> symbol (space as it defined in HTTP specification)
HTTP\/ would help us to math HTTP requests only
(\d+\.\d+) will match HTTP version (this will match not 1.1 only, but 12.34 too)
^ and $ will be our string border-limiters
Gathering all these statements together will give us this regexp: ^(GET|POST|PUT|DELETE)\ HTTP\/(\d+\.\d+)$.
Regular Expressions is what you use to perform a lookup on a string. A pattern is defined and you use this pattern to work out the matches for your expression. This is best seen by example.
Here is a sample set of code I wrote last year for checking if an entered string is a valid frequency of Hz, KHz, MHz, GHz or THz.
Understanding regular expressions will come from trial and error. Read up regular expressions documentation here - http://msdn.microsoft.com/en-us/library/2k3te2cs(v=vs.80).aspx
The expression below took me about 6 hours to get working, due to misunderstanding what certain terms meant, and where I needed brackets etc. But once I had this one cracked the other 6 were very simple.
/// <summary>
/// Checks the given string against a regular expression to see
/// if it is a valid hertz measurement, which can be used
/// by this formatter.
/// </summary>
/// <param name="value">The string value to be tested</param>
/// <returns>Returns true, if it is a valid hertz value</returns>
private Boolean IsValidValue(String value)
{
//Regular Expression Explaination
//
//Start (^)
//Negitive numbers allowed (-?)
//At least 1 digit (\d+)
//Optional (. followed by at least 1 digit) ((\.\d+)?)
//Optional (optional whitespace + (any number of characters (\s?(([h].*)?([k].*)?([m].*)?([g].*)?([t].*)?)+)?
// of which must contain atleast one of the following letters (h,k,m,g,t))
// before the remainder of the string.
//End ($)
String expression = #"^-?\d+(\.\d+)?(\s?(([h].*)?([k].*)?([m].*)?([g].*)?([t].*)?)+)?$";
return Regex.IsMatch(value, expression, RegexOptions.IgnoreCase);
}
May I suggest you to read the regex wiki page.
It looks like you are specifically looking for regular expressions which support IP addresses with port numbers. This thread may be useful; IPs with port numbers are discussed in detail, and there are some examples given:
http://www.codeproject.com/Messages/2829242/Re-Using-Regex-in-Csharp-for-ip-port-format.aspx
Keep in mind that a structurally valid IP is differently from a completely valid IP that only has valid numbers in it. For example, 999.999.999.999.:0000 has a valid structure, but it is not a valid IP address.
Alternatively, IPAddress.TryParse() may work for you, but I have not tried it myself.
http://msdn.microsoft.com/en-us/library/system.net.ipaddress.tryparse.aspx

Categories

Resources