All these string prefixes are legal to use in C#:
"text"
#"text"
$"text"
$#"text"
Why isn't this?
#$"text"
One would have thought that the order of these operators doesn't matter, because they have no other meaning in C# but to prefix strings. I cannot think of a situation when this inverted double prefix would not compile. Is the order enforced only for aesthetic purposes?
Interpolated verbatim strings were not allowed before C# version 8, for no other reason than they weren't implemented. However, this is now possible, so both of these lines will work:
var string1 = $#"text";
var string2 = #$"text";
These prefixes aren't operators. They are only interpreted by the compiler. While it understands $# it doesn't understand #$. Why? Because Microsoft's compiler team decided so.
However, support for the latter is planned for C# 8.0
According to msDocs
A verbatim interpolated string starts with the $ character followed by
the # character.
The $ token must appear before the # token in a verbatim interpolated
string.
Perhaps this is the way they designed to be understandable by the current version of c# compiler.
Related
Expression
var regex = new Regex(#"{([A-z]*)(([^]|:)((\\:)|[^:])*?)(([^]|:)((\\:)|[^:])*?)}");
Breakdown
The expression is [crudely] designed to find tokens within an input, using the format: {name[:pattern[:format]]}, where the pattern and format are optional.
{
([A-z]*) // name
(([^]|:)((\\:)|[^:])*?) // regex pattern
(([^]|:)((\\:)|[^:])*?) // format
}
Additionally, the expression attempts to ignore escaped colons, thus allowing for strings such as {Time:\d+\:\d+\:\d+:hh\:mm\:ss}
Question
When testing on RegExr.com, everything works sufficiently, however when attempting the same pattern in C#, the input fails to match, why?
(Any advice for general improvements to the expression are very welcome too)
The [^] pattern is only valid in JavaScript where it matches a not nothing, i.e. any character (although in ES5, it does not match the chars from outside the BMP plane). In C#, it is easy to match any char with . and passing the RegexOptions.Singleline modifier. However, in JS, the modifier is not supported, but you may match any char with [\s\S] workaround pattern.
So, the minimum change you need to make to make both compatible in both regex flavors is to change ([^]|:) to [\s\S] because there is no need to use a : as an alternative (since [\s\S] will already match a colon).
Also, do not use [A-z] as a shortcut to match ASCII letters. Either use [a-zA-Z] or [a-z] and pass a case insensitive modifier.
So, you might consider writing the expression as
{([A-Za-z]*)([\s\S]((\\:)|[^:])*?)([\s\S]((\\:)|[^:])*?)}
See a .NET regex test and a JS regex test.
Surely, there may be other enhancements here: remove redundant groups, add support for any escape sequences (not just escaped colons), etc., but it is out of the question scope.
I need to validate user input for a property name to retrieve.
For example user can type "Parent.Container" property for windows forms control object or just "Name" property. Then I use reflection to get value of the property.
What I need is to check if user typed legal symbols of c# property (or just legal word symbols like \w) and also this property can be composite (contain two or more words separated with dot).
I have this as of now, is this a right solution?
^([\w]+\.)+[\w]+$|([\w]+)
I used Regex.IsMatch method and it returned true when I passed "?someproperty", though "\w" does not include "?"
I was looking for this too, but I knew none of the existing answers are complete. After a little digging, here's what I found.
Clarifying what we want
First we need to know which valid we want: valid according to the runtime or valid according to the language? Examples:
Foo\u0123Bar is a valid property name for the C# language but not for the runtime. The difference is smoothed over by the compiler, which quietly converts the identifier to FooģBar.
For verbatim identifiers (# prefix) the language treats the # as part of the identifier, but the runtime doesn't see it.
Either could make sense depending on your needs. If you're feeding the validated text into Reflection methods such as GetProperty(string), you'll need the runtime-valid version. If you want the syntax that's more familiar to C# developers, though, you'd want the language- valid version.
"Valid" based on the runtime
C# version 5 is (as of 7/2018) the latest version with formal standards: the ECMA 334 spec. Its rule says:
The rules for identifiers given in this subclause correspond exactly
to those recommended by the Unicode Standard Annex 15 except that
underscore is allowed as an initial character (as is traditional in
the C programming language), Unicode escape sequences are permitted in
identifiers, and the “#” character is allowed as a prefix to enable
keywords to be used as identifiers.
The "Unicode Standard Annex 15" mentioned is Unicode TR 15, Annex 7, which formalizes the basic pattern as:
<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
The {codes in curly braces} are Unicode classes, which map directly to Regex via \p{category}. So (after a little simplification) the basic regex to check for "valid" according to the runtime would be:
#"^[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$"
All the ugly details
The C# spec also requires that identifiers be in Unicode Normalization Form C. It doesn't require that the compiler actually enforces it, though. At least the Roslyn C# compiler allows non-normal-form identifiers (e.g., E\u0304\u0306) and treats them as distinct from equivalent normal-form identifiers (e.g., \u0100\u0306). And anyway, to my knowledge there's no sane way to represent such a rule with a regex. If you don't need/want the user to be able to differentiate properties that look exactly the same, my suggestion is to just run string.Normalize() on the user's input to be done with it.
The C# spec says that two identifiers are equivalent if they only differ by formatting characters. For example, Elmo (four characters) and Elmo (El\u00ADmo) are the same identifier. (Note: that's the soft-hyphen, which is normally invisible; some fonts may display it, though.) If the presence of invisible characters would cause you trouble, you can drop the \p{Cf} from the regex. That doesn't reduce which identifiers you accept—just which formats you accept.
The C# spec reserves identifiers containing "__" for its own use. Depending on your needs you may want to exclude that. That should likely be an operation separate from the regex.
Nesting, generics, etc.
Reflection, Type, IL, and perhaps other places sometimes show class names or method names with extra symbols. For example, a type name may be given as X`1+Y[T]. That extra stuff is not part of the identifier—it's an unrelated way of representing type information.
"Valid" based on the language
This is just the previous regex but also allowing for:
Prefixed #
Unicode escape sequences
The first is a trivial modification: just add #?.
Unicode escape sequences are of form #"\\[Uu][\dA-Fa-f]{4}". We may be tempted to wedge that into both [...] pairs and call it done, but that would incorrectly allow (for example) \u0000 as an identifier. We need to limit the escape sequences to ones that produce otherwise-acceptable characters. One way to do that is to do a pre-pass to convert the escape sequences: replace all \\[Uu][\dA-Fa-f]{4} with the corresponding character.
So putting it all together, a check for whether a string is valid from a C# language standpoint would be:
bool IsValidIdentifier(string input)
{
if (input is null) { throw new ArgumentNullException(); }
// Technically the input must be in normal form C. Implementations aren't required
// to verify that though, so you could remove this check if your runtime doesn't
// mind.
if (!input.IsNormalized())
{
return false;
}
// Convert escape sequences to the characters they represent. The only allowed escape
// sequences are of form \u0000 or \U0000, where 0 is a hex digit.
MatchEvaluator replacer = (Match match) =>
{
string hex = match.Groups[1].Value;
var codepoint = int.Parse(hex, NumberStyles.HexNumber);
return new string((char)codepoint, 1);
};
var escapeSequencePattern = #"\\[Uu]([\dA-Fa-f]{4})";
var withoutEscapes = Regex.Replace(input, escapeSequencePattern, replacer, RegexOptions.CultureInvariant);
withoutEscapes.Dump();
// Now do the real check.
var isIdentifier = #"^#?[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$";
return Regex.IsMatch(withoutEscapes, isIdentifier, RegexOptions.CultureInvariant);
}
Back to the original question
The asker is long gone, but I feel obliged to include an answer to the actual question:
string[] parts = input.Split();
return parts.Length == 2
&& IsValidIdentifier(parts[0])
&& IsValidIdentifier(parts[1]);
Sources
ECMA 334 § 7.4.3; ECMA 335 § I.10; Unicode TR 15 Annex 7
Not the best, but this will work. Demo here.
^#?[a-zA-Z_]\w*(\.#?[a-zA-Z_]\w*)*$
Note that
* Number 0-9 is not allowed as first character
* # is allowed only as first character, but not anywhere else (compiler will strip off though)
* _ is allowed
Edit
Looking at your requirement, the below Regex will be more useful, as input property name need not have # in it. Check here.
^[a-zA-Z_]\w*(\.[a-zA-Z_]\w*)*$
What you posted in the comments is almost right. But it won't detect single properties like "Name".
^(?:[\w]+\.)*\w+$
Works as expected. Just changed the + to * and the group to non-capturing group since you are not concerned about groups here.
So i'm working on this challenge in which I have to take in user input, check if it contains a escape sequence and then execute the escape sequence.
My question is why do escape sequences execute on pre determined string variables but then you take a users input and store that in a variable. That input happens to contain a escape sequence such as \n but does not execute.
No user input Ex:
string noInput = "this is a escape \n sequence"
Console.WriteLine(noInput);
Console.ReadLine()
Output is : This is an escape
sequence
or user input Ex:
string input = Console.ReadLine();
Console.WriteLine(input);
Console.ReadLine();
Output is : This is an escape \n sequence
Hopefully i explained my question well enough. I'm assuming this may be because of security but would like to know the answer.
"Escape sequence" is a feature of the language / compiler.. in this case C#.
The relevant language specification can be found at - 2.4.4.5 String literals
Note that the reference is to an older version of language specification, but still applies.
Latest version can be found here.
From the spec -
A character that follows a backslash character () in a regular-string-literal-character must be one of the following characters: ', ", \, 0, a, b, f, n, r, t, u, U, x, v. Otherwise, a compile-time error occurs. The example
string a = "hello, world"; // hello, world
string b = #"hello, world"; // hello, world
string c = "hello \t world"; // hello world
string d = #"hello \t world"; // hello \t world
Point is, that a .Net language is free to define what special characters in a string literal will be treated as escape sequences.. however it is typically what has been used for ages from languages like C and C++ in old days.
When you are accepting user input.. The input is (obviously?) treaded as a literal string. (Another way to think is, a compiled .Net program is obviously compiler and language independent.. the runtime a.k.a CLR doesn't have the concept of escape sequences in strings)
If you wish to provide such features (may be you have a good scenario).. you have limited options..
Use upcoming compiler features like Roslyn to process the input string for you. I have never personally looked at which specific API in Roslyn will help you do that, but it has to be there, given that Roslyn is supposed to be the compiler itself.
Note that a con of this approach is, that Roslyn may be pretty heavyweight to include in your app for only one feature.
Write a small routine yourself, which tries to perform same escaping as the compiler. For production quality code, this can be tricky (you have to understand and follow the specification to exactly match it.. and perhaps keep your implementation up to date, as it may change with future versions of C# - Like what if new escape sequence is introduced).
Although, practically speaking.. escape sequences in C# specification should not change willy nilly.. but I would not bet on it.
Find a third party library, which already does it for you (included for sake of completeness of the answer.)
EDIT: Proof that the string you see (in source code), is only an artifact of the source code in given language -
Compile a C# app, with string "Hello\nWorld" in it. Open the compiled binary in a binary editor. The string you'd find in the compiled binary will be without the "\n", replaced with the appropriate bytes for new line character.
When it is in predetermined string then it consider as single character. When user input '\' and then 'n' then it consider as two different character. So in case of user input your string in one character more.
Try using substrings or any string manipulation to achieve what you want and get /n part of the user input. Check it here:
http://msdn.microsoft.com/en-us/library/vstudio/ms228362(v=vs.110).aspx
;)
To review regular expresions I read this tutorial. Anyways that tutorial mentions that \b matches a word boundary (between \w and \W characters). That tutorial also gives a link where you can install expresso (program that helps when creating regular expressions).
So I have created my regular expressions in expresso and I do inded get a match. Now when I copy the same regex to visual studio I do not get a match. Take a look:
Why am I not getting a match? in the immediate window I am showing the content of variable output. In expresso I do get a match and in visual studio I don't. why?
The C# language and .NET Regular Expressions both have their own distinct set of backslash-escape sequences, but the C# compiler is intercepting the "\b" in your string and converting it into an ASCII backspace character so the RegEx class never sees it. You need to make your string verbatim (prefix with an at-symbol) or double-escape the 'b' so the backslash is passed to RegEx like so:
#"\bCOMPILATION UNIT";
Or
"\\bCOMPILATION UNIT"
I'll say the .NET RegEx documentation does not make this clear. It took me a while to figure this out at first too.
Fun-fact: The \r and \n characters (carriage-return and line-break respectively) and some others are recognized by both RegEx and the C# language, so the end-result is the same, even if the compiled string is different.
You should use #"\bCOMPILATION UNIT". This is a verbatim literal. When you do "\b" instead, it parses \b into a special character. You can also do "\\b", whose double backslash is parsed into a real backslash, but it's generally easier to just use verbatims when dealing with regex.
Both languages claim to use Perl style regular expressions. If I have one language test a regular expression for validity, will it work in the other? Where do the regular expression syntaxes differ?
The use case here is a C# (.NET) UI talking to an eventual Java back end implementation that will use the regex to match data.
Note that I only need to worry about matching, not about extracting portions of the matched data.
There are quite (a lot of) differences.
Character Class
Character classes subtraction [abc-[cde]]
.NET YES (2.0)
Java: Emulated via character class intersection and negation: [abc&&[^cde]])
Character classes intersection [abc&&[cde]]
.NET: Emulated via character class subtraction and negation: [abc-[^cde]])
Java YES
\p{Alpha} POSIX character class
.NET NO
Java YES (US-ASCII)
Under (?x) mode COMMENTS/IgnorePatternWhitespace, space (U+0020) in character class is significant.
.NET YES
Java NO
Unicode Category (L, M, N, P, S, Z, C)
.NET YES: \p{L} form only
Java YES:
From Java 5: \pL, \p{L}, \p{IsL}
From Java 7: \p{general_category=L}, \p{gc=L}
Unicode Category (Lu, Ll, Lt, ...)
.NET YES: \p{Lu} form only
Java YES:
From Java 5: \p{Lu}, \p{IsLu}
From Java 7: \p{general_category=Lu}, \p{gc=Lu}
Unicode Block
.NET YES: \p{IsBasicLatin} only. (Supported Named Blocks)
Java YES: (name of the block is free-casing)
From Java 5: \p{InBasicLatin}
From Java 7: \p{block=BasicLatin}, \p{blk=BasicLatin}
Spaces, and underscores allowed in all long block names (e.g. BasicLatin can be written as Basic_Latin or Basic Latin)
.NET NO
Java YES (Java 5)
Quantifier
?+, *+, ++ and {m,n}+ (possessive quantifiers)
.NET NO
Java YES
Quotation
\Q...\E escapes a string of metacharacters
.NET NO
Java YES
\Q...\E escapes a string of character class metacharacters (in character sets)
.NET NO
Java YES
Matching construct
Conditional matching (?(?=regex)then|else), (?(regex)then|else), (?(1)then|else) or (?(group)then|else)
.NET YES
Java NO
Named capturing group and named backreference
.NET YES:
Capturing group: (?<name>regex) or (?'name'regex)
Backreference: \k<name> or \k'name'
Java YES (Java 7):
Capturing group: (?<name>regex)
Backreference: \k<name>
Multiple capturing groups can have the same name
.NET YES
Java NO (Java 7)
Balancing group definition (?<name1-name2>regex) or (?'name1-name2'subexpression)
.NET YES
Java NO
Assertions
(?<=text) (positive lookbehind)
.NET Variable-width
Java Obvious width
(?<!text) (negative lookbehind)
.NET Variable-width
Java Obvious width
Mode Options/Flags
ExplicitCapture option (?n)
.NET YES
Java NO
Miscellaneous
(?#comment) inline comments
.NET YES
Java NO
References
regular-expressions.info - Comparison of Different Regex Flavors
MSDN Library Reference - .NET Framework 4.5 - Regular Expression Language
Pattern (Java Platform SE 7)
Check out: http://www.regular-expressions.info/refflavors.html
Plenty of regex info on that site, and there's a nice chart that details the differences between java & .net.
c# regex has its own convention for named groups (?<name>). I don't know of any other differences.
Java uses standard Perl type regex as well as POSIX regex. Looking at the C# documentation on regexs, it looks like that Java has all of C# regex syntax, but not the other way around.
Compare them yourself: Java: C#:
EDIT:
Currently, no other regex flavor supports Microsoft's version of named capture.
.NET Regex supports counting, so you can match nested parentheses which is something you normally cannot do with a regular expression. According to Mastering Regular Expressions that's one of the few implementations to do that, so that could be a difference.
From my experience:
Java 7 regular expressions as compared to .NET 2.0 regular expressions:
Underscore symbol in group names is not supported
Groups with the same name (in the same regular expression) are not supported (although it may be really useful in expressions using "or"!)
Groups having captured nothing have value of null and not of an empty string
Group with index 0 also contains the whole match (same as in .NET) BUT is not included in groupCount()
Group back reference in replace expressions is also denoted with dollar sign (e.g. $1), but if the same expression contains dollar sign as the end-of-line
marker - then the back reference dollar should be escaped (\$), otherwise in Java we get the "illegal group reference" error
End-of-line symbol ($) behaves greedy. Consider, for example, the following expression (Java-string is given): "bla(bla(?:$|\r\n))+)?$". Here the last
line of text will be NOT captured! To capture it, we must substitute "$" with "\z".
There is no "Explicit Capture" mode.
Empty string doesn't satisfy the ^.{0}$ pattern.
Symbol "-" must be escaped when used inside square brackets. That is, pattern "[a-z+-]+" doesn't match string "f+g-h" in Java, but it does in .NET. To match
in Java, pattern should look as (Java-string is given): "[a-z+\-]+".
NOTE: "(Java-string is given)" - just to explain double escapes in the expression.