string(";P") is bigger or string("-_-") is bigger? - c#

I found very confusing when sorting a text file. Different algorithm/application produces different result, for example, on comparing two string str1=";P" and str2="-_-"
Just for your reference here gave the ASCII for each char in those string:
char(';') = 59; char('P') = 80;
char('-') = 45; char('_') = 95;
So I've tried different methods to determine which string is bigger, here is my result:
In Microsoft Office Excel Sorting command:
";P" < "-_-"
C++ std::string::compare(string &str2), i.e. str1.compare(str2)
";P" > "-_-"
C# string.CompareTo(), i.e. str1.CompareTo(str2)
";P" < "-_-"
C# string.CompareOrdinal(), i.e. CompareOrdinal(w1, w2)
";P" > "-_-"
As shown, the result varied! Actually my intuitive result should equal to Method 2 and 4, since the ASCII(';') = 59 which is larger than ASCII('-') = 45 .
So I have no idea why Excel and C# string.CompareTo() gives a opposite answer. Noted that in C# the second comparison function named string.CompareOrdinal(). Does this imply that the default C# string.CompareTo() function is not "Ordinal" ?
Could anyone explain this inconsistency?
And could anyone explain in CultureInfo = {en-US}, why it tells ;P > -_- ? what's the underlying motivation or principle? And I have ever heard about different double multiplication in different cultureInfo. It's rather a cultural shock..!

?
std::string::compare: "the result of a character comparison depends only on its character code". It's simply ordinal.
String.CompareTo: "performs a word (case-sensitive and culture-sensitive) comparison using the current culture". So,this not ordinal, since typical users don't expect things to be sorted like that.
String::CompareOrdinal: Per the name, "performs a case-sensitive comparison using ordinal sort rules".
EDIT: CompareOptions has a hint: "For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list."

Excel 2003 (and earlier) does a sort ignoring hyphens and apostrophes, so your sort really compares ; to _, which gives the result that you have. Here's a Microsoft Support link about it. Pretty sparse, but enough to get the point across.

Related

FromBase64 string length must be multiple or 4 or not?

according to my understanding, a base64 encoded string (ie the output of encode) must always be a multiple of 4.
the c# Convert.FromBase64String says that its input must be a multiple of 4
However if I give it a 25 character string it doesnt complain
[convert]::FromBase64String("ei5gsIELIki+GpnPGyPVBA==")
[convert]::FromBase64String("1ei5gsIELIki+GpnPGyPVBA==")
both work. (The first one is 24 , second is 25)
[convert]::FromBase64String("11ei5gsIELIki+GpnPGyPVBA==")
fails with Invalid length exception
I assume this is a bug in the c# library but I just want to make sure - I am writing code that is sniffing strings to see if they are valid base64 strings and I want to be sure that I understand what a valid one looks like (one possible implementation was to give the string to system.convert and see if it threw - why reinvent perfectly good code)
Yes, this is a flaw (aka bug). It got started due to a perf optimization in an internal helper function named FromBase64_ComputeResultLength() which calculates the length of the byte[] result. It has this comment (edited to fit):
// For legal input, we can assume that 0 <= padding < 3. But it may be
// more for illegal input.
// We will notice it at decode when we see a '=' at the wrong place.
The "we will notice" remark is not entirely accurate, the decoder does flag an '=' if one isn't expected but it fails to check if there's one too many. Which is the case for the 25-char string.
You can report the problem at connect.microsoft.com, I don't see an existing report that resembles it. Do note that it is fairly unlikely that Microsoft can actually fix it any time soon since the change is going to break existing programs that now successfully parse bad base64 strings. It normally requires a major .NET release update to get rid of such problems, like it was done for .NET 4.0, there isn't one on the horizon afaik.
But yes, the simple workaround for you is to check if the string length is divisible by 4, use the % operator.

Evaluating a mathematical expression given as text

I'm trying to make a calculator-like program where one would enter a calculation in a textbox and it would convert that calculation to an int with the result, here's what I have but it doesn't work much
string calcStr = textBox1.Text;
int result = calcStr;
Any suggestions that aren't too complicated?
If I understand the problem correct you want to be able to parse an expression like 1 + 3 + 4 from a textbox and execute a calculation based on the input. That is actually a harder task than one might think.
One common solution is to use the Shunting-yard algorithm to parse the expression. See http://en.wikipedia.org/wiki/Shunting-yard_algorithm for more details.
Use NCalc for this kind of job... it is free, comes with source and does all the heavy lifting (parse the mathematical expression etc.) and gives you the result of the calculation.
If you're trying to simply parse out the number from a string, use a function like
Int32.Parse(string)
If you need to take out an EQUATION, like
"3+4/2"
then you'll need to extract each character one at a time and determine what it is.
Like if the string was
"32+4/12"
You'd have to loop through every character in the string, and try to parse the current character into a number.
Theres a function to test if it's a number or not. or just check it's ascii value.
if it succeeds, take the current number plus the next one and try again until you hit a non-number character.
Now you can extract your numbers.
Characters that are not numbers are checked against the mathmatical operators you're allowing. Anything else throws an error.
Once you can extract all the whole equation, you'll probably have to do something like Stack operations to evaluate it. I believe in my Assembly class you'd push a buncha numbers and operators to the stack, and then pop it one at a top from the top, evaluating the previous number with the next number by the operator in between.
I hope this is what you were talking about. Best of luck!

Max edit distance and suggestion based on word frequency

I need a spell checker with the following specification:
Very scalable.
To be able to set a maximum edit distance for the suggested words.
To get suggestion based on provided words frequencies (most common word first).
I took a look at Hunspell:
I found the parameter MAXDIFF in the man but doesn't seem to work as expected. Maybe I'm using it the wrong way
file t.aff:
MAXDIFF 1
file dico.dic:
5
rouge
vert
bleu
bleue
orange
-
NHunspell.Hunspell h = new NHunspell.Hunspell("t.aff", "dico.dic");
List<string> s = h.Suggest("bleuue");
returns the same thing t.aff being empty or not:
bleue
bleu
We decided to use Apache Solr, which exactly fulfills our needs.
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck
A maxdiff of one should return a few, but still can return more than one.
Even a maxdiff of zero can give more than a single result, but it should lower the change. It depends on the n-gram. Try a maxdiff of zero less results, but this still doesn't guarantee you will get a single suggestion.
For your requirement to sort on the most frequent word, the Google ngram corpus is publicly available.

DB2 ZOS String Comparison Problem

I am comparing some CHAR data in a where clause in my sql like this,
where PRI_CODE < PriCode
The problem I am having is when the CHAR values are of different lengths.
So if PRI_CODE = '0800' and PriCode = '20' it is returning true instead of false.
It looks like it is comparing it like this
'08' < '20'
instead of like
'0800' < '20'
Does a CHAR comparison start from the Left until one or the other values end?
If so how do I fix this?
My values can have letters in it so convering to numeric is not an option.
It's not comparing '08' with '20', it is, as you expect, comparing '0800' with '20'.
What you don't seem to expect, however, is that '0800' (the string) is indeed less than '20' (the string).
If converting it to numerics for a numeric comparison is out of the question, you could use the following DB2 function:
right ('0000000000'||val,10)
which will give you val padded on the left with zeroes to a size of 10 (ideal for a CHAR(10), for example). That will at least guarantee that the fields are the same size and the comparison will work for your particular case. But I urge you to rethink how you're doing things: per-row functions rarely scale well, performance-wise.
If you're using z/OS, you should have a few DBAs just lying around on the computer room floor waiting for work - you can probably ask one of them for advice more tailored to your specific application :-)
One thing that comes to mind in the use of an insert/update trigger and secondary column PRI_CODE_PADDED to hold the PRI_CODE column fully padded out (using the same method as above). Then make sure your PriCode variable is similarly formatted before executing the select ... where PR_CODE_PADDED < PriCode.
Incurring that cost at insert/update time will amortise it over all the selects you're likely to do (which, because they're no longer using per-row functions, will be blindingly fast), giving you better overall performance (assuming your database isn't one of those incredibly rare beasts that are written more than read, of course).

string.Empty.StartsWith(((char)10781).ToString()) always returns true?

I trying to handle to following character: ⨝ (http://www.fileformat.info/info/unicode/char/2a1d/index.htm)
If you checking whether an empty string starting with this character, it always returns true, this does not make any sense! Why is that?
// visual studio 2008 hides lines that have this char literally (bug in visual studio?!?) so i wrote it's unicode instead.
char specialChar = (char)10781;
string specialString = specialChar.ToString();
// prints 1
Console.WriteLine(specialString.Length);
// prints 10781
Console.WriteLine((int)specialChar);
// prints false
Console.WriteLine(string.Empty.StartsWith("A"));
// both prints true WTF?!?
Console.WriteLine(string.Empty.StartsWith(specialString));
Console.WriteLine(string.Empty.StartsWith(((char)10781).ToString()));
You can fix this bug by using ordinal StringComparison:
From the MSDN docs:
When you specify either
StringComparison.Ordinal or
StringComparison.OrdinalIgnoreCase,
the string comparison will be
non-linguistic. That is, the features
that are specific to the natural
language are ignored when making
comparison decisions. This means the
decisions are based on simple byte
comparisons and ignore casing or
equivalence tables that are
parameterized by culture. As a result,
by explicitly setting the parameter to
either the StringComparison.Ordinal or
StringComparison.OrdinalIgnoreCase,
your code often gains speed, increases
correctness, and becomes more
reliable.
char specialChar = (char)10781;
string specialString = Convert.ToString(specialChar);
// prints 1
Console.WriteLine(specialString.Length);
// prints 10781
Console.WriteLine((int)specialChar);
// prints false
Console.WriteLine(string.Empty.StartsWith("A"));
// prints false
Console.WriteLine(string.Empty.StartsWith(specialString, StringComparison.Ordinal));
Nice unicode glitch ;-p
I'm not sure why it does this, but amusingly:
Console.WriteLine(string.Empty.StartsWith(specialString)); // true
Console.WriteLine(string.Empty.Contains(specialString)); // false
Console.WriteLine("abc".StartsWith(specialString)); // true
Console.WriteLine("abc".Contains(specialString)); // false
I'm guessing this is treated a bit like the non-joining character that Jon mentioned at devdays; some string functions see it, and some don't. And if it doesn't see it, this becomes "does (some string) start with an empty string", which is always true.
The underlying reason for this is the default string comparison is locale aware. This means using tables of locale data for comparisons (including equality).
Many (if not most) Unicode characters have no value for many locales, and thus don't exist (or do, but match anything, or nothing).
See entries on character weights on Michael Kaplan's blog "Sorting It All Out". This series of blogs contains a lot of background information (the APIs are native, but—as I understand—the mechanisms in .NET are the same).
Quick version: this is a complex area to get expected (normal language) comparisons right is hard, this tends to lead to odd things with code points for glyphs outside your language.

Categories

Resources