So i'm working on this challenge in which I have to take in user input, check if it contains a escape sequence and then execute the escape sequence.
My question is why do escape sequences execute on pre determined string variables but then you take a users input and store that in a variable. That input happens to contain a escape sequence such as \n but does not execute.
No user input Ex:
string noInput = "this is a escape \n sequence"
Console.WriteLine(noInput);
Console.ReadLine()
Output is : This is an escape
sequence
or user input Ex:
string input = Console.ReadLine();
Console.WriteLine(input);
Console.ReadLine();
Output is : This is an escape \n sequence
Hopefully i explained my question well enough. I'm assuming this may be because of security but would like to know the answer.
"Escape sequence" is a feature of the language / compiler.. in this case C#.
The relevant language specification can be found at - 2.4.4.5 String literals
Note that the reference is to an older version of language specification, but still applies.
Latest version can be found here.
From the spec -
A character that follows a backslash character () in a regular-string-literal-character must be one of the following characters: ', ", \, 0, a, b, f, n, r, t, u, U, x, v. Otherwise, a compile-time error occurs. The example
string a = "hello, world"; // hello, world
string b = #"hello, world"; // hello, world
string c = "hello \t world"; // hello world
string d = #"hello \t world"; // hello \t world
Point is, that a .Net language is free to define what special characters in a string literal will be treated as escape sequences.. however it is typically what has been used for ages from languages like C and C++ in old days.
When you are accepting user input.. The input is (obviously?) treaded as a literal string. (Another way to think is, a compiled .Net program is obviously compiler and language independent.. the runtime a.k.a CLR doesn't have the concept of escape sequences in strings)
If you wish to provide such features (may be you have a good scenario).. you have limited options..
Use upcoming compiler features like Roslyn to process the input string for you. I have never personally looked at which specific API in Roslyn will help you do that, but it has to be there, given that Roslyn is supposed to be the compiler itself.
Note that a con of this approach is, that Roslyn may be pretty heavyweight to include in your app for only one feature.
Write a small routine yourself, which tries to perform same escaping as the compiler. For production quality code, this can be tricky (you have to understand and follow the specification to exactly match it.. and perhaps keep your implementation up to date, as it may change with future versions of C# - Like what if new escape sequence is introduced).
Although, practically speaking.. escape sequences in C# specification should not change willy nilly.. but I would not bet on it.
Find a third party library, which already does it for you (included for sake of completeness of the answer.)
EDIT: Proof that the string you see (in source code), is only an artifact of the source code in given language -
Compile a C# app, with string "Hello\nWorld" in it. Open the compiled binary in a binary editor. The string you'd find in the compiled binary will be without the "\n", replaced with the appropriate bytes for new line character.
When it is in predetermined string then it consider as single character. When user input '\' and then 'n' then it consider as two different character. So in case of user input your string in one character more.
Try using substrings or any string manipulation to achieve what you want and get /n part of the user input. Check it here:
http://msdn.microsoft.com/en-us/library/vstudio/ms228362(v=vs.110).aspx
;)
Related
When receiving an RTL string from a MySsql server that ends in a direction agnostic character, the first char (string[0]) in the string array switches to be the ending char as in the following example (which will hopefully render in the correct order here):
String str = "קוד (לדוגמה)";
Char a = str[0];
Char b = str[1];
In this example, a=( and b=ק, which is incorrect. a should = ק and b should = ו.
Using substring for character extraction yields the same result.
after further examination, I've learned RTL strings are kept as LTR behind the scenes with most programming languages. Using Unicode RTL symbol did not change the outcome.
This presents a unique problems for us, since in our ETL process which requires iterating through all chars (and not searching, since it appears regex can handle this use case), we can't differentiate whether the 1st char was indeed a bracket or other symbol, or was it the ending character.
Any ideas on how to solve this problem would be appreciated, as we couldn't find an answer relevant to our case thus far.
Edit:
It appears the example code has the same problem we encounter while being displayed using certain browsers.
the brackets are actually at the end of the string.
correct order: https://files.logoscdn.com/v1/files/35323612/content.png?signature=pvAgUwSaLB8WGf8u868Cv1eOqiM
Bug, which also happens with stack overflow display on some browsers: https://files.logoscdn.com/v1/files/35323580/content.png?signature=LNasMBU9NWEi_x3BeVSLG9FU5co
2nd edit:
After examination of MySql binaries, it appears the string in MySql starts with the bracket. However, I am unsure whether this is the proper way it should be stored, as every possible display we use (including but not limited to Visual Studio) displays it properly and other than char manipulation the strings acts as if the brackets are at the end.
So to phrase the question better: how do all these systems, including MySql workbench which is written in C# AFAIK, know whether to put the bracket at the beginning or the end?
After a lot of checking, it appears a common convention when using unicode is storing the last character as first and vice versa in the case it's an LTR\unidirectional character in an RTL string.
The convention seems to differ a bit between text parsers, as evident between browsers. However, the 1st char IS indeed the bracket in our case. And in the case where it's the first character, it will end up being the LAST character.
I recommend just checking the handling of your own specific storage, parsers and libraries.
All these string prefixes are legal to use in C#:
"text"
#"text"
$"text"
$#"text"
Why isn't this?
#$"text"
One would have thought that the order of these operators doesn't matter, because they have no other meaning in C# but to prefix strings. I cannot think of a situation when this inverted double prefix would not compile. Is the order enforced only for aesthetic purposes?
Interpolated verbatim strings were not allowed before C# version 8, for no other reason than they weren't implemented. However, this is now possible, so both of these lines will work:
var string1 = $#"text";
var string2 = #$"text";
These prefixes aren't operators. They are only interpreted by the compiler. While it understands $# it doesn't understand #$. Why? Because Microsoft's compiler team decided so.
However, support for the latter is planned for C# 8.0
According to msDocs
A verbatim interpolated string starts with the $ character followed by
the # character.
The $ token must appear before the # token in a verbatim interpolated
string.
Perhaps this is the way they designed to be understandable by the current version of c# compiler.
I need to validate user input for a property name to retrieve.
For example user can type "Parent.Container" property for windows forms control object or just "Name" property. Then I use reflection to get value of the property.
What I need is to check if user typed legal symbols of c# property (or just legal word symbols like \w) and also this property can be composite (contain two or more words separated with dot).
I have this as of now, is this a right solution?
^([\w]+\.)+[\w]+$|([\w]+)
I used Regex.IsMatch method and it returned true when I passed "?someproperty", though "\w" does not include "?"
I was looking for this too, but I knew none of the existing answers are complete. After a little digging, here's what I found.
Clarifying what we want
First we need to know which valid we want: valid according to the runtime or valid according to the language? Examples:
Foo\u0123Bar is a valid property name for the C# language but not for the runtime. The difference is smoothed over by the compiler, which quietly converts the identifier to FooģBar.
For verbatim identifiers (# prefix) the language treats the # as part of the identifier, but the runtime doesn't see it.
Either could make sense depending on your needs. If you're feeding the validated text into Reflection methods such as GetProperty(string), you'll need the runtime-valid version. If you want the syntax that's more familiar to C# developers, though, you'd want the language- valid version.
"Valid" based on the runtime
C# version 5 is (as of 7/2018) the latest version with formal standards: the ECMA 334 spec. Its rule says:
The rules for identifiers given in this subclause correspond exactly
to those recommended by the Unicode Standard Annex 15 except that
underscore is allowed as an initial character (as is traditional in
the C programming language), Unicode escape sequences are permitted in
identifiers, and the “#” character is allowed as a prefix to enable
keywords to be used as identifiers.
The "Unicode Standard Annex 15" mentioned is Unicode TR 15, Annex 7, which formalizes the basic pattern as:
<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
The {codes in curly braces} are Unicode classes, which map directly to Regex via \p{category}. So (after a little simplification) the basic regex to check for "valid" according to the runtime would be:
#"^[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$"
All the ugly details
The C# spec also requires that identifiers be in Unicode Normalization Form C. It doesn't require that the compiler actually enforces it, though. At least the Roslyn C# compiler allows non-normal-form identifiers (e.g., E\u0304\u0306) and treats them as distinct from equivalent normal-form identifiers (e.g., \u0100\u0306). And anyway, to my knowledge there's no sane way to represent such a rule with a regex. If you don't need/want the user to be able to differentiate properties that look exactly the same, my suggestion is to just run string.Normalize() on the user's input to be done with it.
The C# spec says that two identifiers are equivalent if they only differ by formatting characters. For example, Elmo (four characters) and Elmo (El\u00ADmo) are the same identifier. (Note: that's the soft-hyphen, which is normally invisible; some fonts may display it, though.) If the presence of invisible characters would cause you trouble, you can drop the \p{Cf} from the regex. That doesn't reduce which identifiers you accept—just which formats you accept.
The C# spec reserves identifiers containing "__" for its own use. Depending on your needs you may want to exclude that. That should likely be an operation separate from the regex.
Nesting, generics, etc.
Reflection, Type, IL, and perhaps other places sometimes show class names or method names with extra symbols. For example, a type name may be given as X`1+Y[T]. That extra stuff is not part of the identifier—it's an unrelated way of representing type information.
"Valid" based on the language
This is just the previous regex but also allowing for:
Prefixed #
Unicode escape sequences
The first is a trivial modification: just add #?.
Unicode escape sequences are of form #"\\[Uu][\dA-Fa-f]{4}". We may be tempted to wedge that into both [...] pairs and call it done, but that would incorrectly allow (for example) \u0000 as an identifier. We need to limit the escape sequences to ones that produce otherwise-acceptable characters. One way to do that is to do a pre-pass to convert the escape sequences: replace all \\[Uu][\dA-Fa-f]{4} with the corresponding character.
So putting it all together, a check for whether a string is valid from a C# language standpoint would be:
bool IsValidIdentifier(string input)
{
if (input is null) { throw new ArgumentNullException(); }
// Technically the input must be in normal form C. Implementations aren't required
// to verify that though, so you could remove this check if your runtime doesn't
// mind.
if (!input.IsNormalized())
{
return false;
}
// Convert escape sequences to the characters they represent. The only allowed escape
// sequences are of form \u0000 or \U0000, where 0 is a hex digit.
MatchEvaluator replacer = (Match match) =>
{
string hex = match.Groups[1].Value;
var codepoint = int.Parse(hex, NumberStyles.HexNumber);
return new string((char)codepoint, 1);
};
var escapeSequencePattern = #"\\[Uu]([\dA-Fa-f]{4})";
var withoutEscapes = Regex.Replace(input, escapeSequencePattern, replacer, RegexOptions.CultureInvariant);
withoutEscapes.Dump();
// Now do the real check.
var isIdentifier = #"^#?[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$";
return Regex.IsMatch(withoutEscapes, isIdentifier, RegexOptions.CultureInvariant);
}
Back to the original question
The asker is long gone, but I feel obliged to include an answer to the actual question:
string[] parts = input.Split();
return parts.Length == 2
&& IsValidIdentifier(parts[0])
&& IsValidIdentifier(parts[1]);
Sources
ECMA 334 § 7.4.3; ECMA 335 § I.10; Unicode TR 15 Annex 7
Not the best, but this will work. Demo here.
^#?[a-zA-Z_]\w*(\.#?[a-zA-Z_]\w*)*$
Note that
* Number 0-9 is not allowed as first character
* # is allowed only as first character, but not anywhere else (compiler will strip off though)
* _ is allowed
Edit
Looking at your requirement, the below Regex will be more useful, as input property name need not have # in it. Check here.
^[a-zA-Z_]\w*(\.[a-zA-Z_]\w*)*$
What you posted in the comments is almost right. But it won't detect single properties like "Name".
^(?:[\w]+\.)*\w+$
Works as expected. Just changed the + to * and the group to non-capturing group since you are not concerned about groups here.
Hola. I'm failing to write a method to test for words within a plain text or html document. I was reasonably literate with regex, and I am newer to c# (from way more java).
Just 'cause,
string html = source.ToLower();
string plaintext = Regex.Replace(html, #"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, #"\s+", " "); // remove excess white space
and then,
string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,#"\b" + Regex.Escape(tag) + #"\b");
bool foundAsContains = plaintext.Contains(tag);
For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. My google-fu is weak, so I didn't get much back on "what the hell". Any ideas or pointers welcome!
edit:
I'm searching for matches on skills in resumes. for example, the distinct value "c++".
edit:
a real excerpt is given below:
"...administration- c, c++, perl, shell programming..."
The problem is that \b matches between a word character and a non-word character. Given the expression \bc\+\+\b, you have a problem. "+" is a non-word character. So searching for the pattern in "xxx c++, xxx", you're not going to find anything. There's no "word break" after the "+" character.
If you're looking for non-word characters then you'll have to change your logic. Not sure what the best thing would be. I suppose you can use \W, but then it's not going to match at the beginning or end of the line, so you'll need (^|\W) and (\W|$) ... which is ugly. And slow, although perhaps still fast enough depending on your needs.
Your regular expression is turning into:
/\bc\+\+\b/
Which means you're looking for a word boundary, followed by the string c++, followed by another word boundary. This means it won't match on strings like abc++, whereas plaintext.Contains will succeed.
If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer.
Edit: My original regex was /\bc++\b/, which is incorrect, as c++ is being passed to Regex.Escape(), which escapes out regular expression metacharacters like +. I've fixed it above.
I want to assign a xml code into a string variable.
I can do this without escaping single or double-quotes by using triple-quote in python.
Is there a similar way to do this in F# or C#?
F# 3.0 supports triple quoted strings. See Visual Studio F# Team Blog Post on 3.0 features.
The F# 3.0 Spec Strings and Characters section specifically mentions the XML scenario:
A triple-quoted string is specified by using three quotation marks
(""") to ensure that a string that includes one or more escaped
strings is interpreted verbatim. For example, a triple-quoted string
can be used to embed XML blobs:
As far as I know, there is no syntax corresponding to this in C# / F#. If you use #"str" then you have to replace quote with two quotes and if you just use "str" then you need to add backslash.
In any case, there is some encoding of ":
var str = #"allows
multiline, but still need to encode "" as two chars";
var str = "need to use backslahs \" here";
However, the best thing to do when you need to embed large strings (such as XML data) into your application is probably to use .NET resources (or store the data somewhere else, depending on your application). Embedding large string literals in program is generally not very recommended. Also, there used to be a plugin for pasting XML as a tree that constructs XElement objects for C#, but I'm not sure whether it still exists.
Although, I would personally vote to add """ as known from Python to F# - it is very useful, especially for interactive scripting.
In case someone ran into this question when looking for triple quote strings in C# (rather than F#), C#11 now has raw string literals and they're (IMO) better than Python's (due to how indentation is handled)!
Raw string literals are a new format for string literals. Raw string literals can contain arbitrary text, including whitespace, new lines, embedded quotes, and other special characters without requiring escape sequences. A raw string literal starts with at least three double-quote (""") characters. It ends with the same number of double-quote characters. Typically, a raw string literal uses three double quotes on a single line to start the string, and three double quotes on a separate line to end the string. The newlines following the opening quote and preceding the closing quote are not included in the final content:
string longMessage = """
This is a long message.
It has several lines.
Some are indented
more than others.
Some should start at the first column.
Some have "quoted text" in them.
""";
Any whitespace to the left of the closing double quotes will be removed from the string literal. Raw string literals can be combined with string interpolation to include braces in the output text. Multiple $ characters denote how many consecutive braces start and end the interpolation:
var location = $$"""
You are at {{{Longitude}}, {{Latitude}}}
""";
The preceding example specifies that two braces starts and end an interpolation. The third repeated opening and closing brace are included in the output string.
https://devblogs.microsoft.com/dotnet/csharp-11-preview-updates/#raw-string-literals
https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-11
As shoosh said, you want to use the verbatim string literals in C#, where the string starts with # and is enclosed in double quotation marks. The only exception is if you need to put a double quotation mark in the string, in which case you need to double it
System.Console.WriteLine(#"Hello ""big"" world");
would output
Hello "big" world
http://msdn.microsoft.com/en-us/library/362314fe.aspx
In C# the syntax is #"some string"
see here