Efficient(?) string comparison

Efficient(?) string comparison - c#

What could possibly be the reasons to use -
bool result = String.Compare(fieldStr, "PIN", true).Equals(0);
instead of,
bool result = String.Equals(fieldStr, "PIN", StringComparison.CurrentCultureIgnoreCase);
or, even simpler -
bool result = fieldStr.Equals("PIN", StringComparison.CurrentCultureIgnoreCase);
for comparing two strings in .NET with C#?
I've been assigned on a project with a large code-base that has abandon use of the first one for simple equality comparison. I couldn't (not yet) find any reason why those senior guys used that approach, and not something simpler like the second or the third one. Is there any performance issue with Equals (static or instance) method? Or is there any specific benefit with using String.Compare method that even outweighs the processing of an extra operation of the entailing .Equals(0)?

I can't give immediate examples, but I suspect there are cases where the first would return true, but the second return false. Two values maybe equal in terms of sort order, while still being distinct even under case-ignoring rules. For example, one culture may decide not to treat accents as important while sorting, but still view two strings differing only in accented characters as unequal. (Or it's possible that the reverse may be true - that two strings may be considered equal, but one comes logically before the other.)
If you're basically interested in the sort order rather than equality, then using Compare makes sense. It also potentially makes sense if the code is going to be translated - e.g. for LINQ - and that overload of Compare is supported but that overload of Equals isn't.
I'll try to come up with an example where they differ. I would certainly say it's rare. EDIT: No luck so far, having tried accents, Eszet, Turkish "I" handling, and different kinds of spaces. That's a long way from saying it cant happen though.

Related

c# string performance - what is faster to compare, string text or string length

I have to read a huge xml file which consists of over 3 million records and over 10 million nested elements.
Naturally I am using xmltextreader and have got my parsing time down to about 40 seconds from earlier 90 seconds using multiple optimization tricks and tips.
But I want to further save processing time as much as I can hence below question.
Quite a few elements are of type xs:boolean and the data provider always represents values as "true" or "false" - never "1" or "0".
For such cases my earliest code was:
if (xmlTextReader.Value == "true")
{
bool subtitled = true;
}
which i further optimized to:
if (string.Equals(xmlTextReader.Value, "true", StringComparison.OrdinalIgnoreCase))
{
bool subtitled = true;
}
I wanted to know if below would be fastest (because its either "true" or "false")?
if (xtr.value.length == 4)
{
bool subtitled = true;
}

Yes, it is faster, because you only compare exactly one value, namely the length of the string.
By comparing two strings with each other, you compare each and every character, as long as both characters are the same. So if you're finding a match for the string "true", you're going to do 4 comparisons before the predicate evaluates to true.
The only problem you have with this solution is, that if someday the value is going to change from true to let's say 1, you're going to run into a problem here.

Comparing length will be faster, but less readable. I wouldn't use it unless I profile the performance of the code and conclude that I need this optimization.

What about comparing the first character to "t"?
Should (maybe :) be faster than comparing the whole string..

Measuring the length would almost invariably be faster. That said, unless this is an experiment in micro-optimization, I'd just focus on making the code to be readable and convey the proper semantics.
You might also try something like that uses the following approach:
Boolean.TryParse(xmlTextReader.Value, out subtitled)
I know that has nothing to do with your question, but I figured I'd throw it out there anyway.

Cant you just write a unit test? Run each scenario for example 1000 times and compare the datetimes.

If you know it's either "true" or "false", the last snippet must be fastest.
Anyway, you can also write:
bool subtitled = (xtr.Value.length == 4);
That should be even faster.

Old question I know but the accepted answer is wrong, or at least, incorrect in it's explanation.
Comparing the lengths maybe be the slightest bit faster but only because string.Equals is likely doing some other comparisons before it too checks the lengths and decides that they are not equal strings.
So in practice this is an optimization of last resort.
Here you can find the source for .NET core string comparison.

String comparing and parsing is very slow in .Net, I'd recommend avoid intensive using string parsing/comparing in .Net.
If you're forced to do it -- use highly optimized unmanaged or unsafe code and use parallelism.
IMHO.

DB2 ZOS String Comparison Problem

I am comparing some CHAR data in a where clause in my sql like this,
where PRI_CODE < PriCode
The problem I am having is when the CHAR values are of different lengths.
So if PRI_CODE = '0800' and PriCode = '20' it is returning true instead of false.
It looks like it is comparing it like this
'08' < '20'
instead of like
'0800' < '20'
Does a CHAR comparison start from the Left until one or the other values end?
If so how do I fix this?
My values can have letters in it so convering to numeric is not an option.

It's not comparing '08' with '20', it is, as you expect, comparing '0800' with '20'.
What you don't seem to expect, however, is that '0800' (the string) is indeed less than '20' (the string).
If converting it to numerics for a numeric comparison is out of the question, you could use the following DB2 function:
right ('0000000000'||val,10)
which will give you val padded on the left with zeroes to a size of 10 (ideal for a CHAR(10), for example). That will at least guarantee that the fields are the same size and the comparison will work for your particular case. But I urge you to rethink how you're doing things: per-row functions rarely scale well, performance-wise.
If you're using z/OS, you should have a few DBAs just lying around on the computer room floor waiting for work - you can probably ask one of them for advice more tailored to your specific application :-)
One thing that comes to mind in the use of an insert/update trigger and secondary column PRI_CODE_PADDED to hold the PRI_CODE column fully padded out (using the same method as above). Then make sure your PriCode variable is similarly formatted before executing the select ... where PR_CODE_PADDED < PriCode.
Incurring that cost at insert/update time will amortise it over all the selects you're likely to do (which, because they're no longer using per-row functions, will be blindingly fast), giving you better overall performance (assuming your database isn't one of those incredibly rare beasts that are written more than read, of course).

Is there a standard way to count statements in C#

I was looking at some code length metrics other than Lines of Code. Something that Source Monitor reports is statements. This seemed like a valuable thing to know, but the way Source Monitor counted some things seemed unintuitive. For example, a for statement is one statement, even though it contains a variable definition, a condition, and an increment statement. And if a method call is nested in an argument list to another method, the whole thing is considered one statement.
Is there a standard way that statements are counted and are their rules governing such a thing?

The first rule of metrics is "be careful what you measure". You ask for a count of statements, that's what you're going to get. As you note, that figure is perhaps not actually relevant.
If you're interested in other measures, like how "complex" code is, consider looking into other code metrics, like cyclometric complexity.
http://en.wikipedia.org/wiki/Cyclomatic_complexity
UPDATE: Re: your comment
I agree that "doing too much" is an interesting metric. My rule of thumb is that one statement should have one side effect (usually a "local" side effect like mutating a local variable, but sometimes a visible side effect, like writing to a file) and therefore "number of statements" should be roughly correlated with how much the method is "doing" in terms of its number of side effects.
In practice, of course no one's code, my own included, actually meets that bar all the time. You might consider a metric for "how much the method is doing" to count not just statements but also, say, method calls.
To actually answer your question: I'm not aware of any industry standard that regulates what "number of statements" is. The C# specification certainly defines what a "statement" is lexically, but then of course you have to do some interpretation to do a count. For example:
void M()
{
try
{
if (blah)
{
Frob();
Blob();
}
}
catch(Exception ex)
{ /* eat it */ }
finally
{
Grob();
}
}
How many statements are there in M? Well, the body of M consists of one statement, a try-catch-finally. So is the answer one? The body of the try contains one statement, an "if" statement. The consequence of the "if" contains one statement -- remember, a block is a statement. The block contains two statements. The finally contains one statement. The catch block contains no statements -- a catch block is not a statement, lexically -- but it certainly is highly relevant to the operation of the method!
So how many statements is that altogether? One could make a reasonable case for any number from one to six, depending on whether you count blocks as "real" statements, whether you consider child statements as in addition to their parent statement or not, and so on. There is no standards body which regulates the answer to this question that I'm aware of.

The closest you might get to a formal definition of "what is a statement" would be the C# specification itself. Good luck working out whether a particular tool's measurement agrees with your reading of the specification.
Given that metrics are best used as a guide to better/worse code, and not a strict formula, does the exact definition used by the tool make much difference?
If I have three methods, with "statement lengths" of 2500, 1500 and 150, I know which method I'll be examining first; that another tool might report 2480, 1620 and 174 isn't too important.
One of the best tools I've seen for measuring metrics is NDepend, though again I'm not 100% sure what definitions it is using. According to the website, NDepend has 82 separate metrics, including Number of instructions and Cyclomatic Complexity.

The C# Metrics Tool defines the things being counted ("statements", "operands"), etc. by using a precise C# BNF language definition. (In fact, it precisely parses the code according a full C# grammar and then computes structural metrics by walking over the parse tree; SLOC count it gets by countline lines as you'd expect).
You might still argue that such a definition it unintuitive (grammars rarely are), but they are precise. I agree with other posters here, however, that the precise measure isn't as important as the relative value that one block of code has with respect to another. A value of "173.92" complexity just isn't very helpful by itself; compard to another complexity value of "81.02", we can say there's a good indication that the first one is more complex than the second, and that's enough to provide a focus of attention.
I think that metrics are also useful in trending; if last week, this code was "81.02" complex, ad this week it is "173.92", I should wonder why is all that happening inthis part of the code?
You might also consider a ratio of a structural metric (e.g., Cyclomatic) to SLOC as an indication of "doing too much", or at least an indication of writing code that is way too dense to understand

One simple metric is to just count the punctuation marks (;, ,, .) between tokens (so as to avoid those in strings, comments, or numbers). Thus, for (x = 0, y = 1; x < foo.Count; x++, y++) bar[y] = foo[x]; would count as 6.

When querying with LINQ-to-XML, is it better/more efficient to leave element values as strings or convert them to the correct type?

I'm constantly running up against this when writing queries with LINQ-to-XML: the Value property of an XElement is a string, but the data may actually be an integer, boolean, etc.
Let's say I have a "where" clause in my query that checks if an ID stored in an XElement matches a local (integer) variable called "id". There are two ways I could do this.
1. Convert "id" to string
string idString = id.ToString();
IEnumerable<XElement> elements =
from
b in TableDictionary["bicycles"].Elements()
where
b.Element(_ns + "id").Value == idString
select
b;
2. Convert element value to int
IEnumerable<XElement> elements =
from
b in TableDictionary["bicycles"].Elements()
where
int.Parse(b.Element(_ns + "id").Value) == id
select
b;
I like option 2 because it does the comparison on the correct type. Technically, I could see a scenario where converting a decimal or double to a string would cause me to compare "1.0" to "1" (which would be unequal) versus Decimal(1.0) to Decimal(1) (which would be equal). Although a where clause involving decimals is probably pretty rare, I could see an OrderBy on a decimal column--in that case, this would be a very real issue.
A potential downside of this strategy, however, is that parsing tons of strings in a query could result in a performance hit (although I have no idea if it would be significant for a typical query). It might be more efficient to only parse element values when there is a risk that a string comparison would result in a different result than a comparison of the correct value type.
So, do you parse your element values religiously or only when necessary? Why?
Thanks!
EDIT:
I discovered a much less cumbersome syntax for doing the conversion.
3. Cast element to int
IEnumerable<XElement> elements =
from
b in TableDictionary["bicycles"].Elements()
where
(int)b.Element(_ns + "id") == id
select
b;
I think this will be my preferred method from now on...unless someone talks me out of it :)
EDIT II:
It occurred to me since posting my question that: THIS IS XML. If I really had enough data for performance to be an issue, I would probably be using a real database. So, yet another reason to go with casting.

Its difficult to assess the performance issues here without measuring. But I think you have two scenarios.
If you need to use most (or all) of the values in an expression sooner or later, then it is probably best to pay the CPU costs of converting to native types up front - discarding the XML string data early.
If you are only going to touch (evaluate or use) a few of the values, then it will most likely be cheaper in terms of CPU time to convert string data to native types lazily - at the time of (or close to it temporally) consumption.
Now, this is just the CPU time considerations. I suggest that it is likely that the data itself will take up considerably less memory once converted to native value types. This lets you discard the string (XML) data early.
In short, it is rare for questions like this to have black or white answers: it will depend on your scenario, the complexity of the data, how much data there is, and when it will be used (touched or evaluated).
Update
In Dan's comment to my original answer, he ask for a general rule of thumb in cases where there is not time, or reason to do detailed measurements.
My suggestion is to prefer conversion to native types at XML parsing time, not keep the string data around and parse lazily. Here is my reasoning
The code will already be burning some CPU, I/O, and memory resources at parasing time.
The code is like to be simpler doing the conversions at load time (rather than at another time) as this can all be coded in a simple procedural way.
This is likely to be more memory efficient as well.
When the data needs to be used, it is already in a native format - this will be much better performing than dealing with string data at consumption time: comparisons and computation with native types will usually be much more efficient than dealing with data in string format. This is likely to keep the consuming code simpler as well.
Again, I'm suggesting this as a rule of thumb :) There will be scenarios where another approach is more optimal from a performance standpoint, or will make the code 'better' in some way (more cohesive, modular, easier to maintain, etc).
This is one of those cases where you will most likely need to measure the results to be sure you are doing the right thing.

I agree with your second edit. If performance is an issue, you will gain much more by using a more queryable data structure (or just cache a dictionary by ID from your XML for repeated lookups) than by changing how you compare/parse values.
That said, my preference would be using the various explicit cast overrides on XElement. Also, if your ID could ever be empty (better safe than sorry), you can also do an efficient cast to int?.

Why are strings notoriously expensive

What is it about the way strings are implemented that makes them so expensive to manipulate?
Is it impossible to make a "cheap" string implementation?
or am I completely wrong in my understanding?
Thanks

Which language?
Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.
This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.
If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.
If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.

The problem with strings is that they are not primitive types. They are arrays.
Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).
Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc.
There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.

Since it creates new copy of the object every time in java its advisable to use StringBuffer
Syntax
StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();

Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them.
i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.
Pete

It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.

Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:
Concatenation.
Slicing.
Getting a character at an index.
Changing a character at an index.
Locating the index of a character.
Traversing the string.
Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.

If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.
Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...

Changes and copying of strings tends to involve memory management.
Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.

You want to read this Joel Spolsky article:
http://www.joelonsoftware.com/articles/fog0000000319.html
Me, I'm disappointed .NET doesn't have a native type called F***edString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.