This question already has answers here:
Unexpected behavior when sorting strings with letters and dashes
(2 answers)
Closed 5 years ago.
I have 2 strings of the same length.
I was assuming (probably wrongly) that inserting a space between each character of each string will not change their order.
var e1 = "12*4";
var e2 = "12-4";
Console.WriteLine(String.Compare(e1,e2)); // -1 (e1 < e2)
var f1 = "1 2 * 4";
var f2 = "1 2 - 4";
Console.WriteLine(String.Compare(f1,f2)); // +1 (f1 > f2)
If I insert other characters (_ x for instance), the order is preserved.
What's going on ?
Thanks in advance.
If you use Ordinal comparison, you will get the right result.
The reason is that ordinal comparison works by evaluating the numeric value of each of the chars in the string object, so inserting spaces will make no difference.
If you use other types of comparisons, there are other things involved. From the documentation:
An operation that uses word sort rules performs a culture-sensitive
comparison wherein certain nonalphanumeric Unicode characters might
have special weights assigned to them. Using word sort rules and the
conventions of a specific culture, the hyphen ("-") might have a very
small weight assigned to it so that "coop" and "co-op" appear next to
each other in a sorted list.
An operation that uses ordinal sort rules performs a comparison based on the numeric value (Unicode code point) of each Char in the
string. An ordinal comparison is fast but culture-insensitive. When
you use ordinal sort rules to sort strings that start with Unicode
characters (U+), the string U+xxxx comes before the string U+yyyy if
the value of xxxx is numerically less than yyyy.
From MSDN:
The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.
When comparing strings, you should call the Compare(String, String, StringComparison) method, which requires that you explicitly specify the type of string comparison that the method uses.
It suggests that there is some cultural issue which means that the last space changes the sort order of the two.
Related
I'm trying to figure out how to handle filenames in Tamil. I need to shorten them like this:
"foobar.gif" -> "foo...gif".
I've learned today that some languages use more than one char to represent a letter and I discovered that C# has the Rune concept.
I can't get this to work with Tamil.
Take "தமிழ்.gif" for example:
I had hoped that "தமிழ்.gif".Length should be 6 but it's 9:
How can I get do a proper substring like "தமிழ்.gif".Substring(2) => "தமி" instead of "தம".
What am I missing?
This has to do with surrogate pairs, which are pairs of char that represent "single" characters in Unicode.
See these question regarding Surrogate Pairs: What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?
Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?
When dealing with characters that are actually longer than a single character, you'll have to find the indices of the string arrays that are contained within your current string array.
I should add, because of this, you'll have to create some "Unicode-Safe" methods for removal of characters or finding the indices, otherwise you may end up removing "half" of a valid Unicode character and be left with invalid Unicode
I am trying to find the integers in a .txt that I am reading in C# and divide them, but everything I have tried have not worked. Any tips?
this txt has values like this:
this is the string in the .txt file
queue1 6000
queue2 54888
queue3 1
but they change every second
expected output
queue1 0.6
queue2 5.4
queue3 0.0001
code
int value;
string text = System.IO.File.ReadAllText(#"N:\doc\mytext.txt");
if (int.TryParse(text, out value))
{
value/10000;
}
Console.ForegroundColor = ConsoleColor.Red;
System.Console.WriteLine(text);
thanks
There are several things your code needs to do:
Read the input
Find the target values in the input
Convert the target values into the desired result values
Replace the target values in the input string with the result values
Output the result
For this answer, I am assuming you are happy with your implementation of steps 1 and 5, so I'll only address steps 2, 3, and 4.
Step 2: find the target values
To find the target values in your input, you must first define what they are. According to your question and subsequent comments, your input string looks like this, with the target values bolded:
queue1 6000, queue2 54888, queue3 1
These are integer values, which are embedded in your text as discrete words.
The easiest way to find them is to use a regular expression. One thing that makes your case tricky is the fact that you have numbers embedded in your text that you do not want (e.g. in "queue1"). Fortunately, .NET regular expressions have a couple shortcuts that make it easy to write the necessary expression:
var re = new Regex(#"\b\d+\b");
\b matches a word boundary
\d matches any decimal digit
+ matches one or more of the preceding characters
So this regular expression will match distinct integers like "6000" while ignoring numbers embedded in other words like "queue1".
Step 3 - convert to desired result
This step has a couple sub-steps
Parse the string into a number
Calculate the desired result number
Format the desired result number
To parse the string into a number, you are using int.TryParse, which is one way to do it. Since the regular expression will find only valid integers, we can use the Parse method instead.
Your calculation is to divide by 10000, and you want the result to be a floating point number. If you use 10000 as your literal, the result will be an integer, so you need to do something to ensure that the result is floating point. An easy way is to use 10000.0 as your literal, like so:
int value = 6000;
var result = value / 10000.0; // typeof(result) == double
There are several options to format your output. The simplest starting point is to use the ToString method. The default behavior is often good enough, but if you want to specify the format, you can use the variant that takes a format string.
So the resulting code to parse, calculate, and format is like so:
(Int64.Parse(value) / 10000.0).ToString()
Step 4: replace the target values in the input
Since regular expressions are useful for finding things, they are also used to replace things. In your case, the replacement requires some logic that may not be handled by a straight regular expression (i.e., step 2). The .NET regular expression Replace method provides an overload with a MatchEvaluator parameter for just this scenario.
Each match that is found by the regular expression is passed to the match evaluator, which is responsible for returning the string to be used as the replacement. To make things simple, you can use a lambda expression to supply the match evaluator.
So when you put it all together, you get something like this:
string text = #"queue1 6000, queue2 54888, queue3 1";
var re = new Regex(#"\b\d+\b");
string output = re.Replace(text, m => (Int64.Parse(m.Value) / 10000.0).ToString());
// output == "queue1 0.6, queue2 5.4888, queue3 0.0001"
You can use Regular expressions to get all of the numbers in a string. The string could be the content of a file that you got from System.IO.File.ReadAlltext (for example).
You can use "[0-9]+" to get all of the integers or "[0-9]+(.[0-9]+)?" to get both integers and floating points.
var pattern = "[0-9]" // or "[0-9]+(.[0-9]+)?"
int[] integers = System.Text.RegularExpressions.Regex.Matches(text, pattern).Cast<Match>().Select(m => int.Parse(m.Value)).ToArray();
// for floats
//float[] floats = System.Text.RegularExpressions.Regex.Matches(text, pattern).Cast<Match>().Select(m => float.Parse(m.Value)).ToArray();
In regular expressions you define a pattern then search the string for tokens that answer that pattern. If we need to search for integers then we can use the "[0-9]+" pattern. This pattern will search for all of the tokens that are between 0-9 (inclusive) with at least one character (one digit - the '+' make sure of that).
If we need to search for floating points then we need to use "[0-9]+(.[0-9]+)?" which is composed from the integers pattern ("[0-9]+") with addition to an optional floating point ("(.[0-9]+)?"). The '?' sign means that this token can exist zero or one time.
you can find more information about Regular Expressions and patterns here
You can use Regular Expressions for that:
Regex.Matches(text, #"(?<=^|\s)[0-9]+(?=$|\s)").Cast<Match>().Select(m => int.Parse(m.Value));
First, you select all parts of the string that resemble an integer, and then convert the resulting strings into integers.
(?=$|\s) is a Zero-width positive lookahead assertion. It checks if there is a whitespace or end of string at the end of the match.
(?<=^|\s) is a Zero-width positive lookbehind assertion. It checks if there is a whitespace or start of string at the beginning of the match.
You can find documentation on that here.
These are necessary so that the whitespaces are not included by the matches. If they were, you'd only get one of two integers if they are only separated by one whitespace.
I need to validate user input for a property name to retrieve.
For example user can type "Parent.Container" property for windows forms control object or just "Name" property. Then I use reflection to get value of the property.
What I need is to check if user typed legal symbols of c# property (or just legal word symbols like \w) and also this property can be composite (contain two or more words separated with dot).
I have this as of now, is this a right solution?
^([\w]+\.)+[\w]+$|([\w]+)
I used Regex.IsMatch method and it returned true when I passed "?someproperty", though "\w" does not include "?"
I was looking for this too, but I knew none of the existing answers are complete. After a little digging, here's what I found.
Clarifying what we want
First we need to know which valid we want: valid according to the runtime or valid according to the language? Examples:
Foo\u0123Bar is a valid property name for the C# language but not for the runtime. The difference is smoothed over by the compiler, which quietly converts the identifier to FooģBar.
For verbatim identifiers (# prefix) the language treats the # as part of the identifier, but the runtime doesn't see it.
Either could make sense depending on your needs. If you're feeding the validated text into Reflection methods such as GetProperty(string), you'll need the runtime-valid version. If you want the syntax that's more familiar to C# developers, though, you'd want the language- valid version.
"Valid" based on the runtime
C# version 5 is (as of 7/2018) the latest version with formal standards: the ECMA 334 spec. Its rule says:
The rules for identifiers given in this subclause correspond exactly
to those recommended by the Unicode Standard Annex 15 except that
underscore is allowed as an initial character (as is traditional in
the C programming language), Unicode escape sequences are permitted in
identifiers, and the “#” character is allowed as a prefix to enable
keywords to be used as identifiers.
The "Unicode Standard Annex 15" mentioned is Unicode TR 15, Annex 7, which formalizes the basic pattern as:
<identifier> ::= <identifier_start> ( <identifier_start> | <identifier_extend> )*
<identifier_start> ::= [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
<identifier_extend> ::= [{Mn}{Mc}{Nd}{Pc}{Cf}]
The {codes in curly braces} are Unicode classes, which map directly to Regex via \p{category}. So (after a little simplification) the basic regex to check for "valid" according to the runtime would be:
#"^[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$"
All the ugly details
The C# spec also requires that identifiers be in Unicode Normalization Form C. It doesn't require that the compiler actually enforces it, though. At least the Roslyn C# compiler allows non-normal-form identifiers (e.g., E\u0304\u0306) and treats them as distinct from equivalent normal-form identifiers (e.g., \u0100\u0306). And anyway, to my knowledge there's no sane way to represent such a rule with a regex. If you don't need/want the user to be able to differentiate properties that look exactly the same, my suggestion is to just run string.Normalize() on the user's input to be done with it.
The C# spec says that two identifiers are equivalent if they only differ by formatting characters. For example, Elmo (four characters) and Elmo (El\u00ADmo) are the same identifier. (Note: that's the soft-hyphen, which is normally invisible; some fonts may display it, though.) If the presence of invisible characters would cause you trouble, you can drop the \p{Cf} from the regex. That doesn't reduce which identifiers you accept—just which formats you accept.
The C# spec reserves identifiers containing "__" for its own use. Depending on your needs you may want to exclude that. That should likely be an operation separate from the regex.
Nesting, generics, etc.
Reflection, Type, IL, and perhaps other places sometimes show class names or method names with extra symbols. For example, a type name may be given as X`1+Y[T]. That extra stuff is not part of the identifier—it's an unrelated way of representing type information.
"Valid" based on the language
This is just the previous regex but also allowing for:
Prefixed #
Unicode escape sequences
The first is a trivial modification: just add #?.
Unicode escape sequences are of form #"\\[Uu][\dA-Fa-f]{4}". We may be tempted to wedge that into both [...] pairs and call it done, but that would incorrectly allow (for example) \u0000 as an identifier. We need to limit the escape sequences to ones that produce otherwise-acceptable characters. One way to do that is to do a pre-pass to convert the escape sequences: replace all \\[Uu][\dA-Fa-f]{4} with the corresponding character.
So putting it all together, a check for whether a string is valid from a C# language standpoint would be:
bool IsValidIdentifier(string input)
{
if (input is null) { throw new ArgumentNullException(); }
// Technically the input must be in normal form C. Implementations aren't required
// to verify that though, so you could remove this check if your runtime doesn't
// mind.
if (!input.IsNormalized())
{
return false;
}
// Convert escape sequences to the characters they represent. The only allowed escape
// sequences are of form \u0000 or \U0000, where 0 is a hex digit.
MatchEvaluator replacer = (Match match) =>
{
string hex = match.Groups[1].Value;
var codepoint = int.Parse(hex, NumberStyles.HexNumber);
return new string((char)codepoint, 1);
};
var escapeSequencePattern = #"\\[Uu]([\dA-Fa-f]{4})";
var withoutEscapes = Regex.Replace(input, escapeSequencePattern, replacer, RegexOptions.CultureInvariant);
withoutEscapes.Dump();
// Now do the real check.
var isIdentifier = #"^#?[\p{L}\p{Nl}_][\p{Cf}\p{L}\p{Mc}\p{Mn}\p{Nd}\p{Nl}\p{Pc}]*$";
return Regex.IsMatch(withoutEscapes, isIdentifier, RegexOptions.CultureInvariant);
}
Back to the original question
The asker is long gone, but I feel obliged to include an answer to the actual question:
string[] parts = input.Split();
return parts.Length == 2
&& IsValidIdentifier(parts[0])
&& IsValidIdentifier(parts[1]);
Sources
ECMA 334 § 7.4.3; ECMA 335 § I.10; Unicode TR 15 Annex 7
Not the best, but this will work. Demo here.
^#?[a-zA-Z_]\w*(\.#?[a-zA-Z_]\w*)*$
Note that
* Number 0-9 is not allowed as first character
* # is allowed only as first character, but not anywhere else (compiler will strip off though)
* _ is allowed
Edit
Looking at your requirement, the below Regex will be more useful, as input property name need not have # in it. Check here.
^[a-zA-Z_]\w*(\.[a-zA-Z_]\w*)*$
What you posted in the comments is almost right. But it won't detect single properties like "Name".
^(?:[\w]+\.)*\w+$
Works as expected. Just changed the + to * and the group to non-capturing group since you are not concerned about groups here.
This code:
string a = "abc";
string b = "A𠈓C";
Console.WriteLine("Length a = {0}", a.Length);
Console.WriteLine("Length b = {0}", b.Length);
outputs:
Length a = 3
Length b = 4
Why? The only thing I could imagine is that the Chinese character is 2 bytes long and that the .Length method returns the byte count.
Everyone else is giving the surface answer, but there's a deeper rationale too: the number of "characters" is a difficult-to-define question and can be surprisingly expensive to compute, whereas a length property should be fast.
Why is it difficult to define? Well, there's a few options and none are really more valid than another:
The number of code units (bytes or other fixed size data chunk; C# and Windows typically use UTF-16 so it returns the number of two-byte pieces) is certainly relevant, as the computer still needs to deal with the data in that form for many purposes (writing to a file, for example, cares about bytes rather than characters)
The number of Unicode codepoints is fairly easy to compute (although O(n) because you gotta scan the string for surrogate pairs) and might matter to a text editor.... but isn't actually the same thing as the number of characters printed on screen (called graphemes). For example, some accented letters can be represented in two forms: a single codepoint, or two points paired together, one representing the letter, and one saying "add an accent to my partner letter". Would the pair be two characters or one? You can normalize strings to help with this, but not all valid letters have a single codepoint representation.
Even the number of graphemes isn't the same as the length of a printed string, which depends on the font among other factors, and since some characters are printed with some overlap in many fonts (kerning), the length of a string on screen is not necessarily equal to the sum of the length of graphemes anyway!
Some Unicode points aren't even characters in the traditional sense, but rather some kind of control marker. Like a byte order marker or a right-to-left indicator. Do these count?
In short, the length of a string is actually a ridiculously complex question and calculating it can take a lot of CPU time as well as data tables.
Moreover, what's the point? Why does these metrics matter? Well, only you can answer that for your case, but personally, I find they are generally irrelevant. Limiting data entry I find is more logically done by byte limits, as that's what needs to be transferred or stored anyway. Limiting display size is better done by the display side software - if you have 100 pixels for the message, how many characters you fit depends on the font, etc., which isn't known by the data layer software anyway. Finally, given the complexity of the unicode standard, you're probably going to have bugs at the edge cases anyway if you try anything else.
So it is a hard question with not a lot of general purpose use. Number of code units is trivial to calculate - it is just the length of the underlying data array - and the most meaningful/useful as a general rule, with a simple definition.
That's why b has length 4 beyond the surface explanation of "because the documentation says so".
From the documentation of the String.Length property:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
Your character at index 1 in "A𠈓C" is a SurrogatePair
The key point to remember is that surrogate pairs represent 32-bit
single characters.
You can try this code and it will return True
Console.WriteLine(char.IsSurrogatePair("A𠈓C", 1));
Char.IsSurrogatePair Method (String, Int32)
true if the s parameter includes adjacent characters at positions
index and index + 1, and the numeric value of the character at
position index ranges from U+D800 through U+DBFF, and the numeric
value of the character at position index+1 ranges from U+DC00 through
U+DFFF; otherwise, false.
This is further explained in String.Length property:
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.
As the other answers have pointed out, even if there are 3 visible character they are represented with 4 char objects. Which is why the Length is 4 and not 3.
MSDN states that
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters.
However if what you really want to know is the number of "text elements" and not the number of Char objects you can use the StringInfo class.
var si = new StringInfo("A𠈓C");
Console.WriteLine(si.LengthInTextElements); // 3
You can also enumerate each text element like this
var enumerator = StringInfo.GetTextElementEnumerator("A𠈓C");
while(enumerator.MoveNext()){
Console.WriteLine(enumerator.Current);
}
Using foreach on the string will split the middle "letter" in two char objects and the printed result won't correspond to the string.
That is because the Length property returns the number of char objects, not the number of unicode characters. In your case, one of the Unicode characters is represented by more than one char object (SurrogatePair).
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.
As others said, it's not the number of characters in the string but the number of Char objects. The character 𠈓 is code point U+20213. Since the value is outside 16-bit char type's range, it's encoded in UTF-16 as the surrogate pair D840 DE13.
The way to get the length in characters was mentioned in the other answers. However it should be use with care as there can be many ways to represent a character in Unicode. "à" may be 1 composed character or 2 characters (a + diacritics). Normalization may be needed like in the case of twitter.
You should read this
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This is because length() only works for Unicode code points that are no larger than U+FFFF. This set of code points is known as the Basic Multilingual Plane (BMP) and uses only 2 bytes.
Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs.
To correctly count the number of characters (3), use StringInfo
StringInfo b = new StringInfo("A𠈓C");
Console.WriteLine(string.Format("Length 2 = {0}", b.LengthInTextElements));
Okay, in .Net and C# all strings are encoded as UTF-16LE. A string is stored as a sequence of chars. Each char encapsulates the storage of 2 bytes or 16 bits.
What we see "on paper or screen" as a single letter, character, glyph, symbol, or punctuation mark can be thought of as a single Text Element. As described in Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, each Text Element is represented by one or more Code Points. An exhaustive list of Codes can be found here.
Each Code Point needs to encoded into binary for internal representation by a computer. As stated, each char stores 2 bytes. Code Points at or below U+FFFF can be stored in a single char. Code Points above U+FFFF are stored as a surrogate pair, using two chars to represent a single Code Point.
Given what we now know we can deduce, a Text Element can be stored as one char, as a Surrogate Pair of two chars or, if the Text Element is represented by multiple Code Points some combination of single chars and Surrogate Pairs. As if that weren't complicated enough, some Text Elements can be represented by different combinations of Code Points as described in, Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS.
Interlude
So, strings that look the same when rendered can actually be made up of a different combination of chars. An ordinal (byte by byte) comparison of two such strings would detect a difference, this may be unexpected or undesirable.
You can re-encode .Net strings. so that they use the same Normalization Form. Once normalized, two strings with the same Text Elements will be encoded the same way. To do this, use the string.Normalize function. However, remember, some different Text Elements look similar to each other. :-s
So, what does this all mean in relation to the question? The Text Element '𠈓' is represented by the single Code Point U+20213 cjk unified ideographs extension b. This means it cannot be encoded as a single char and must be encoded as Surrogate Pair, using two chars. This is why string b is one char longer that string a.
If you need to reliably (see caveat) count the number of Text Elements in a string you should use the
System.Globalization.StringInfo class like this.
using System.Globalization;
string a = "abc";
string b = "A𠈓C";
Console.WriteLine("Length a = {0}", new StringInfo(a).LengthInTextElements);
Console.WriteLine("Length b = {0}", new StringInfo(b).LengthInTextElements);
giving the output,
"Length a = 3"
"Length b = 3"
as expected.
Caveat
The .Net implementation of Unicode Text Segmentation in the StringInfo and TextElementEnumerator classes should be generally useful and, in most cases, will yield a response that the caller expects. However, as stated in Unicode Standard Annex #29, "The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries."
For my answer in this question I have to compare two characters. I thought that the normal char.CompareTo() method would allow me to specify a CultureInfo, but that's not the case.
So my question is: How can I compare two characters and specify a CultureInfo for the comparison?
There is no culture enabled comparison for characters, you have to convert the characters to strings so that you can use for example the String.Compare(string, string, CultureInfo, CompareOptions) method.
Example:
char a = 'å';
char b = 'ä';
// outputs -1:
Console.WriteLine(String.Compare(
a.ToString(),
b.ToString(),
CultureInfo.GetCultureInfo("sv-SE"),
CompareOptions.IgnoreCase
));
// outputs 1:
Console.WriteLine(String.Compare(
a.ToString(),
b.ToString(),
CultureInfo.GetCultureInfo("en-GB"),
CompareOptions.IgnoreCase
));
There is indeed a difference between comparing characters and strings. Let me try to explain the basic issue, which is quite simple: A character always represents a single unicode point. Comparing characters always compares the code points without any regard as to their equal meaning.
If you want to compare characters for equal meaning, you need to create a string and use the comparison methods provided there. These include support for different cultures. See Guffa's answer on how to do that.
Did you try String.Compare Method?
The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.
String.Compare(str1, str2, false, new CultureInfo("en-US"))
I don't think cultureInfo matters while comparing chars in C#. char is already a Unicode character so two characters can be easily compared witohut CultureInfo.