Bug in the string comparing of the .NET Framework - c#

It's a requirement for any comparison sort to work that the underlying order operator is transitive and antisymmetric.
In .NET, that's not true for some strings:
static void CompareBug()
{
string x = "\u002D\u30A2"; // or just "-ア" if charset allows
string y = "\u3042"; // or just "あ" if charset allows
Console.WriteLine(x.CompareTo(y)); // positive one
Console.WriteLine(y.CompareTo(x)); // positive one
Console.WriteLine(StringComparer.InvariantCulture.Compare(x, y)); // positive one
Console.WriteLine(StringComparer.InvariantCulture.Compare(y, x)); // positive one
var ja = StringComparer.Create(new CultureInfo("ja-JP", false), false);
Console.WriteLine(ja.Compare(x, y)); // positive one
Console.WriteLine(ja.Compare(y, x)); // positive one
}
You see that x is strictly greater than y, and y is strictly greater than x.
Because x.CompareTo(x) and so on all give zero (0), it is clear that this is not an order. Not surprisingly, I get unpredictable results when I Sort arrays or lists containing strings like x and y. Though I haven't tested this, I'm sure SortedDictionary<string, WhatEver> will have problems keeping itself in sorted order and/or locating items if strings like x and y are used for keys.
Is this bug well-known? What versions of the framework are affected (I'm trying this with .NET 4.0)?
EDIT:
Here's an example where the sign is negative either way:
x = "\u4E00\u30A0"; // equiv: "一゠"
y = "\u4E00\u002D\u0041"; // equiv: "一-A"

If correct sorting is so important in your problem, just use ordinal string comparison instead of culture-sensitive. Only this one guarantees transitive and antisymmetric comparing you want.
What MSDN says:
Specifying the StringComparison.Ordinal or
StringComparison.OrdinalIgnoreCase value in a method call signifies a
non-linguistic comparison in which the features of natural languages
are ignored. Methods that are invoked with these StringComparison
values base string operation decisions on simple byte comparisons
instead of casing or equivalence tables that are parameterized by
culture. In most cases, this approach best fits the intended
interpretation of strings while making code faster and more reliable.
And it works as expected:
Console.WriteLine(String.Compare(x, y, StringComparison.Ordinal)); // -12309
Console.WriteLine(String.Compare(y, x, StringComparison.Ordinal)); // 12309
Yes, it doesn't explain why culture-sensitive comparison gives inconsistent results. Well, strange culture — strange result.

I came across this SO post, while I was trying to figure out why I was having problems retrieving (string) keys that were inserted into a SortedList, after I discovered the cause was the odd behaviour of the .Net 40 and above comparers (a1 < a2 and a2 < a3, but a1 > a3).
My struggle to figure out what was going on can be found here: c# SortedList<string, TValue>.ContainsKey for successfully added key returns false.
You may want to have a look at the "UPDATE 3" section of my SO question. It appears that the issue was reported to Microsoft in Dec 2012, and closed before the end of january 2013 as "won't be fixed". Additionally it lists a workaround that may be used.
I created an implementation of this recommended workaround, and verified that it fixed the problem that I had encountered. I also just verified that this resolves the issue you reported.
public static void SO_13254153_Question()
{
string x = "\u002D\u30A2"; // or just "-ア" if charset allows
string y = "\u3042"; // or just "あ" if charset allows
var invariantComparer = new WorkAroundStringComparer();
var japaneseComparer = new WorkAroundStringComparer(new System.Globalization.CultureInfo("ja-JP", false));
Console.WriteLine(x.CompareTo(y)); // positive one
Console.WriteLine(y.CompareTo(x)); // positive one
Console.WriteLine(invariantComparer.Compare(x, y)); // negative one
Console.WriteLine(invariantComparer.Compare(y, x)); // positive one
Console.WriteLine(japaneseComparer.Compare(x, y)); // negative one
Console.WriteLine(japaneseComparer.Compare(y, x)); // positive one
}
The remaining problem is that this workaround is so slow it is hardly practical for use with large collections of strings. So I hope Microsoft will reconsider closing this issue or that someone knows of a better workaround.

Related

Enum methods vs string methods

Is it more efficient to use enums instead of string arrays, performance-wise?
I decided to test a particular method IsDefined, versus checking for a match-up inside a string array. I created an object of Stopwatch to test the runtime for each one.
The code, below:
Defined an enum outside of class Main:
enum Color : byte { red, blue, green }
Inside Main:
string[] colArr = new string[] { "red", "blue", "green" };
string input = "green";
Stopwatch s1 = new Stopwatch();
int loopIterations = 0;
s1.Restart();
while (loopIterations++ < 100000000)
foreach (var blah in colArr)
if (blah == input)
break;
s1.Stop();
Console.WriteLine("Runtime for foreach loop: {0}", s1.Elapsed);
loopIterations = 0;
s1.Restart();
while (loopIterations++ < 100000000)
if (Enum.IsDefined(typeof(Color), input))
continue;
s1.Stop();
Console.WriteLine("Runtime for IsDefined method returned value: {0}", s1.Elapsed);
And my output looks like this:
Runtime for foreach loop: 00:00:01.4862817
Runtime for IsDefined method returned value: 00:00:09.3421654
Press any key to continue . . .
So I wanted to ask if - assuming the code I wrote isn't, like, stupid or something - those numbers are normal, and if they are, in what way is using enums preferable to using a string array, specifically for the kind of jobs both would?
For starters, rather than performance a big reason for using enums over strings is maintainability of the code. E.g., trying to 'find all references' to Color.red can be done with a few clicks in visual studio. Trying to find strings isn't so easy. Always typing the strings is also error-prone. Although both problems could be alleviated somewhat by using constants, it's easier to use enums.
An enum can be seen as a constant integer value, which has good performance and has benefits such as using flags (masks). Comparing an int will be faster than comparing a string, but that's not what happens here. Mostly you want to do something for a specific value and you could test if(someString == "red") versus if(someColVal == Color.red), in which case the latter should be faster.
Checking if a value exists in an enum can be slower with the Enum.IsDefined, but that function has to look up the enum-values each time in this loop.
Meanwhile the first test has a pre-defined array. For the strict comparison in performance to your first test, you could do something like:
var colvalues = Enum.GetValues(typeof(Color)).Cast<Color>().ToArray(); // or hardcode: var colvalues = new[]{Color.red, Color.blue, Color.green};
var colinput = Color.red;
while (loopIterations++ < 100000000)
foreach (var blah in colvalues)
if (blah == colinput)
break;
Although as stated, finding if a value exists in an enum is normally not its primary function (mostly it's used for checking for a specific value). However it's integer base allows for other methods to check if a value is in an expected range, such as mask-checking or >, >=, < or <=
edit Seeing the comments about user input: mostly the input would be controlled, e.g.: the user is shown a menu. In a console environment that menu could be build with the numbers of the enum.
For example, enum enum Color : byte { red = 1, blue, green }, menu
1. red
2. blue
3. green
The user input would be an integer. On the other hand if typing is required, IsDefined would prevent having to retype the values and is good for ease of use. For performance the names could be buffered with something like var colvalues = Enum.GetNames(typeof(Color)).ToArray();
The normal use for enums is to represent logical states or a limited range of options: in your example, if e.g. a product ever only comes in three colours. Using a string to represent colours in such a case has two drawbacks: you may misspell the name of a colour somewhere in your code, and get hard to track bugs; and string comparison is inherently slower than comparing enums (which is basically comparing integers).
IsDefined() uses type reflection, and thus should be slower than straight string comparison. There are cases where you want to convert enums to and from strings: usually when doing input/output such as saving or restoring configurations. That's slower, but input/output is typically dominated by the slowness of storage media and networks, so it's seldom a big deal.
I know this is a very old post, but noticed the compared code snippets for the loops are not doing the looping in a similar fashion.
As in the first loop, you let the loop break once it finds the string in the string array but in the second scenario you dont let the loop stop but rather continue if the Enum.IsDefined finds the value.
When you actually let the loop in enum scenario to break if it finds the value, the enum scenario runs much faster...

Dynamically evaluating statement with multiple equality operators

I've seen quite a few expression parsers / tokenizers that can take a certain string and evaluate the result. For example, you could pump the string:
4+4
into the following code:
MSScriptControl.ScriptControl sc = new MSScriptControl.ScriptControl();
//' You always need to initialize a language engine
sc.Language = "VBScript";
//' this is the expression - in a real program it will probably be
//' read from a textbox control
string expr = "4+4";
double res = sc.Eval(expr);
and get 8. But, is there a parsing tool out there that can evaluate the string:
4 = 4 = 4
? So far, all examples fail with an error of not being able to compare a double and boolean (which makes sense from a compilers perspective, but from a human perspective, we can see that this is true). Anyone come across something that can achieve this?
From a human perspective, this is only true if we think of x = y = z as a special operator (with three operands), where it implies x = y, y = z, x = z. That is a specific syntactical interpretation of the expression. A human (particularly a programmer) could also interpret it the same way most compilers do, which is to choose the left-most grouping ( x = y ) and then compare the result of that comparison (a boolean value) to z. Even to a human, this doesn't make sense under this syntax. It only seems obvious from a human perspective because humans are notoriously fuzzy when it comes to choosing a syntax that 'makes sense' for a given context.
If you really want that level of 'fuzziness', you'll need to look into something like Wolfram Alpha, which performs contextual analysis to try to find a best guess for the meaning of the expression. If you enter '4 = 4 = 4' there, it will reply True.
You need to define syntax for your "language" and build parser as your expected behavior is not covered by normal expression syntax (and also normal languages should evaluate it to "false" as every language I heard of implements = as binary operation and hence will endup with "4 = true" at some point). There are tools to build parser for C#...
Side note: to match "a human perspective" is insanely hard problem still not solved even for human to human communication :).
try with
var result = (int) HtmlPage.Window.Eval("4 + 4");

string(";P") is bigger or string("-_-") is bigger?

I found very confusing when sorting a text file. Different algorithm/application produces different result, for example, on comparing two string str1=";P" and str2="-_-"
Just for your reference here gave the ASCII for each char in those string:
char(';') = 59; char('P') = 80;
char('-') = 45; char('_') = 95;
So I've tried different methods to determine which string is bigger, here is my result:
In Microsoft Office Excel Sorting command:
";P" < "-_-"
C++ std::string::compare(string &str2), i.e. str1.compare(str2)
";P" > "-_-"
C# string.CompareTo(), i.e. str1.CompareTo(str2)
";P" < "-_-"
C# string.CompareOrdinal(), i.e. CompareOrdinal(w1, w2)
";P" > "-_-"
As shown, the result varied! Actually my intuitive result should equal to Method 2 and 4, since the ASCII(';') = 59 which is larger than ASCII('-') = 45 .
So I have no idea why Excel and C# string.CompareTo() gives a opposite answer. Noted that in C# the second comparison function named string.CompareOrdinal(). Does this imply that the default C# string.CompareTo() function is not "Ordinal" ?
Could anyone explain this inconsistency?
And could anyone explain in CultureInfo = {en-US}, why it tells ;P > -_- ? what's the underlying motivation or principle? And I have ever heard about different double multiplication in different cultureInfo. It's rather a cultural shock..!
?
std::string::compare: "the result of a character comparison depends only on its character code". It's simply ordinal.
String.CompareTo: "performs a word (case-sensitive and culture-sensitive) comparison using the current culture". So,this not ordinal, since typical users don't expect things to be sorted like that.
String::CompareOrdinal: Per the name, "performs a case-sensitive comparison using ordinal sort rules".
EDIT: CompareOptions has a hint: "For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list."
Excel 2003 (and earlier) does a sort ignoring hyphens and apostrophes, so your sort really compares ; to _, which gives the result that you have. Here's a Microsoft Support link about it. Pretty sparse, but enough to get the point across.

Vectorising operators in C#

I spend much of my time programming in R or MATLAB. These languages are typically used for manipulating arrays and matrices, and consequently, they have vectorised operators for addition, equality, etc.
For example, in MATLAB, adding two arrays
[1.2 3.4 5.6] + [9.87 6.54 3.21]
returns an array of the same size
ans =
11.07 9.94 8.81
Switching over to C#, we need a loop, and it feels like a lot of code.
double[] a = { 1.2, 3.4, 5.6 };
double[] b = { 9.87, 6.54, 3.21 };
double[] sum = new double[a.Length];
for (int i = 0; i < a.Length; ++i)
{
sum[i] = a[i] + b[i];
}
How should I implement vectorised operators using C#? These should preferably work for all numeric array types (and bool[]). Working for multidimensional arrays is a bonus.
The first idea I had was to overload the operators for System.Double[], etc. directly. This has a number of problems though. Firstly, it could cause confusion and maintainability issues if built-in classes do not bahave as expected. Secondly, I'm not sure if it is even possible to change the behaviour of these built-in classes.
So my next idea was to derive a class from each numerical type and overload the operators there. This creates the hassle of converting from double[] to MyDoubleArray and back, which reduces the benefit of me doing less typing.
Also, I don't really want to have to repeat a load of almost identical functionality for every numeric type. This lead to my next idea of a generic operator class. In fact, someone else had also had this idea: there's a generic operator class in Jon Skeet's MiscUtil library.
This gives you a method-like prefix syntax for operations, e.g.
double sum = Operator<double>.Add(3.5, -2.44); // 1.06
The trouble is, since the array types don't support addition, you can't just do something like
double[] sum = Operator<double[]>.Add(a, b); // Throws InvalidOperationException
I've run out of ideas. Can you think of anything that will work?
Create a Vector class (actually I'd make it a struct) and overload the arithmentic operators for that class... This has probably been done already if you do a google search, there are numerous hits... Here's one that looks promising Vector class...
To handle vectors of arbitrary dimension, I'd:
design the internal array which would persist the individual floats for each of the
vectors dimension values an array list of arbitrary size,
make the Vector constructor take the dimension as an constructor parameter,
In the arithmentic operator overloads, add a validation that the two vectors being added, or subtracted have the same dimension.
You should probably create a Vector class that internally wraps an array and overloads the arithmetic operators. There's a decent matrix/vector code library here.
But if you really need to operate on naked arrays for some reason, you can use LINQ:
var a1 = new double[] { 0, 1, 2, 3 };
var a2 = new double[] { 10, 20, 30, 40 };
var sum = a1.Zip( a2, (x,y) => Operator<double>.Add( x, y ) ).ToArray();
Take a look at CSML. It's a fairly complete matrix library for c#. I've used it for a few things and it works well.
The XNA Framework has the classes you may be able to use. You can use it in your application like any other part of .NET. Just grab the XNA redistributable and code away.
BTW, you don't need to do anything special (like getting the game studio or joining the creator's club) to use it in your application.

working with incredibly large numbers in .NET

I'm trying to work through the problems on projecteuler.net but I keep running into a couple of problems.
The first is a question of storing large quanities of elements in a List<t>. I keep getting OutOfMemoryException's when storing large quantities in the list.
Now I admit I might not be doing these things in the best way but, is there some way of defining how much memory the app can consume?
It usually crashes when I get abour 100,000,000 elements :S
Secondly, some of the questions require the addition of massive numbers. I use ulong data type where I think the number is going to get super big, but I still manage to wrap past the largest supported int and get into negative numbers.
Do you have any tips for working with incredibly large numbers?
Consider System.Numerics.BigInteger.
You need to use a large number class that uses some basic math principals to split these operations up. This implementation of a C# BigInteger library on CodePoject seems to be the most promising. The article has some good explanations of how operations with massive numbers work, as well.
Also see:
Big integers in C#
As far as Project Euler goes, you might be barking up the wrong tree if you are hitting OutOfMemory exceptions. From their website:
Each problem has been designed according to a "one-minute rule", which means that although it may take several hours to design a successful algorithm with more difficult problems, an efficient implementation will allow a solution to be obtained on a modestly powered computer in less than one minute.
As user Jakers said, if you're using Big Numbers, probably you're doing it wrong.
Of the ProjectEuler problems I've done, none have required big-number math so far.
Its more about finding the proper algorithm to avoid big-numbers.
Want hints? Post here, and we might have an interesting Euler-thread started.
I assume this is C#? F# has built in ways of handling both these problems (BigInt type and lazy sequences).
You can use both F# techniques from C#, if you like. The BigInt type is reasonably usable from other languages if you add a reference to the core F# assembly.
Lazy sequences are basically just syntax friendly enumerators. Putting 100,000,000 elements in a list isn't a great plan, so you should rethink your solutions to get around that. If you don't need to keep information around, throw it away! If it's cheaper to recompute it than store it, throw it away!
See the answers in this thread. You probably need to use one of the third-party big integer libraries/classes available or wait for C# 4.0 which will include a native BigInteger datatype.
As far as defining how much memory an app will use, you can check the available memory before performing an operation by using the MemoryFailPoint class.
This allows you to preallocate memory before doing the operation, so you can check if an operation will fail before running it.
string Add(string s1, string s2)
{
bool carry = false;
string result = string.Empty;
if (s1.Length < s2.Length)
s1 = s1.PadLeft(s2.Length, '0');
if(s2.Length < s1.Length)
s2 = s2.PadLeft(s1.Length, '0');
for(int i = s1.Length-1; i >= 0; i--)
{
var augend = Convert.ToInt64(s1.Substring(i,1));
var addend = Convert.ToInt64(s2.Substring(i,1));
var sum = augend + addend;
sum += (carry ? 1 : 0);
carry = false;
if(sum > 9)
{
carry = true;
sum -= 10;
}
result = sum.ToString() + result;
}
if(carry)
{
result = "1" + result;
}
return result;
}
I am not sure if it is a good way of handling it, but I use the following in my project.
I have a "double theRelevantNumber" variable and an "int PowerOfTen" for each item and in my relevant class I have a "int relevantDecimals" variable.
So... when large numbers is encountered they are handled like this:
First they are changed to x,yyy form. So if the number 123456,789 was inputed and the "powerOfTen" was 10, it would start like this:
theRelevantNumber = 123456,789
PowerOfTen = 10
The number was then: 123456,789*10^10
It is then changed to:
1,23456789*10^15
It is then rounded by the number of relevant decimals (for example 5) to 1,23456 and then saved along with "PowerOfTen = 15"
When adding or subracting numbers together, any number outside the relevant decimals are ignored. Meaning if you take:
1*10^15 + 1*10^10 it will change to 1,00001 if "relevantDecimals" is 5 but will not change at all if "relevantDecimals" are 4.
This method make you able to deal with numbers up doubleLimit*10^intLimit without any problem, and at least for OOP it is not that hard to keep track of.
You don't need to use BigInteger. You can do this even with string array of numbers.
class Solution
{
static void Main(String[] args)
{
int n = 5;
string[] unsorted = new string[6] { "3141592653589793238","1", "3", "5737362592653589793238", "3", "5" };
string[] result = SortStrings(n, unsorted);
foreach (string s in result)
Console.WriteLine(s);
Console.ReadLine();
}
static string[] SortStrings(int size, string[] arr)
{
Array.Sort(arr, (left, right) =>
{
if (left.Length != right.Length)
return left.Length - right.Length;
return left.CompareTo(right);
});
return arr;
}
}
If you want to work with incredibly large numbers look here...
MIKI Calculator
I am not a professional programmer i write for myself, sometimes, so sorry for unprofessional use of c# but the program works. I will be grateful for any advice and correction.
I use this calculator to generate 32-character passwords from numbers that are around 58 digits long.
Since the program adds numbers in the string format, you can perform calculations on numbers with the maximum length of the string variable. The program uses long lists for the calculation, so it is possible to calculate on larger numbers, possibly 18x the maximum capacity of the list.

Categories

Resources