I read in CLR via C# by Jeffrey Richter that String.ToUpperInvariant() is faster than String.ToLowerInvariant(). He says that this is because the FCL uses ToUpperInvariant to normalise strings, so the method is ultra-optimised. Running a couple quick tests on my machine, I concur that ToUpperInvariant() is indeed slightly faster.
My question is if anybody knows how the function is actually optimised on a technical level, and/or why the same optimisations were not applied to ToLowerInvariant() as well.
Concerning the "duplicate": The proposed "duplicate" question really doesn't provide an answer to my question. I understand the benefits of using ToUpperInvariant instead of ToLowerInvariant, but what I would like to know is how/why ToUpperInvariant performs better. This point is not addressed in the "duplicate".
Since it is now easier to read the CLR source which implements InternalChangeCaseString, we can see that it mostly calls down to the Win32 function LCMapStringEx. There appears to be no notes or any discussion on the performance of passing in LCMAP_UPPERCASE vs. LCMAP_LOWERCASE for the dwMapFlags parameter. Calling InternalChangeCaseString uses a flag isToUpper which, if true may result in better optimization by the compiler (or JITter), but since the call to LCMapStringEx has to setup a p/invoke call frame and the call itself has to do work, I'm not sure a lot of time is saved there.
Perhaps the recommendation is a hold over from some other implementation, but I can't see anything that would provide a significant speed advantage one way or the other.
Related
I had referenced at MSDN and found the register keyword, but it's only in C++.
Syntax:
register int x = 0;
Can you tell me how to do that with C#?
There is no way to do that in C#. C# is compiled to MSIL, which is then compiled to native code by the JIT.
It's the JIT that will decide whether a variable will go into a register or not. You shouldn't worry about this.
As MSIL is meant to be run on different architectures, it wouldn't make much sense to include such a feature in the language. Different architectures have a different number of registers, which may be of different sizes. That's why it's the JIT's job to optimize this.
By using a keyword? No.
With unmanaged code, you certainly can though... I mean, you really don't want to... but you can : )
It is useful in extreme optimizations, where you know for sure that you can do better than the JIT Compiler. However, in those circumstances, you should probably be looking at straight unmanaged C anyway. So, I strongly urge you to do that if you can.
Let's assume you can't, and this absolutely positively must be done from C#
C# is compiled to MSIL, which takes those choices out of your hands. It actually does quite well too, so well in fact that there's rarely a need to optimize by hand. But, with C# being a managed language you have to step into an unmanaged section to do it.
There are several methods, both with and without reflection - and both using inline and external.
Firstly, you might compile that small fast section in C, ASM or some other unmanaged language as a DLL and call it unmanaged from C# in much the same way you'd call WinAPI functions... pay attention to calling conventions, there are several and each places a slightly different burden on caller/callee... for example, in terms of how parameters are passed and who clears up the stack afterwards.
Alternatively, you could use fasmNET or similar to include inline assembly for any routines which must be ultra-fast. fast can compile strings of Assembler in c# (at runtime) into a blob of memory which can then be called unmanaged from c#... many examples exist online.
Alternatively, you could externally compile just the instructions you need, provide them as a byte array yourself, and call the byte array as code in the same manner as above, but without a runtime compilation step.
There are also many tricks you can do with inline IL that can help you fine-tune your code without the JIT compilers involvement, these may or may not be useful to you depending on your project. Custom IL sections can be accomplished both with inline IL and dynamic IL and can give you considerably more control over how your c# application runs.
Depending on how often you need to switch back and forth between managed and unmanaged, you can also create a separate application domain from your code, and load your unmanaged code into that... this can help you separate the managed/unmanaged concerns and thus avoid any costly switching back and forth.
But...
I will not give code, as to how you do it depends greatly upon what you're trying to accomplish. This is not the type of thing where you should just paste a code snippet into your project - you need to research the various methods, learn about their overheads and drawbacks, and then implement them with care, wisdom and due diligence.
Personally, I'd suggest learning C and offloading such computationally important tasks as an external service. This has the added advantage of allowing you to use processor affinity to best effect. It also allows you to write clean, normal, sensible C# for your head end.
But trust me, if your code is too slow and you think using registers for a few variables will speed things up... well... 95% of the time, it absolutely won't. C# does a tonne of work behind the scenes to wrangle those CPU resources as effectively as possible ... if you step in and snatch control of a few registers from it, it will usually end up producing less optimal code overall.
So, if pressed to guess at your best strategy, I'd suggest offloading that small task to a seperate C program or service, and then use C# to throw it problems and gather output. Coupled with affinity, this can result in substantial speed gains. If you need to, it is also possible to set up shared memory between managed and unmanaged code - although this requires a lot of forward planning, may require experience using a good commercial debugger, and certainly isn't for the beginner.
Note that whichever way you go, portability WILL be adversely affected.
Re-evaluate whether you really need to do this at all. There are likely many more sensible and productive optimisations that can be done from within C#, in terms of the algorithm itself, which you should explore fully before going anywhere near the hardware.
You can't.
There aren't any real useful registers in IL and there is no guarantee that the target machine will have registers. The JIT or Ahead-of-time compiler will make those decisions for you.
I am originally a native C++ programmer, in C++ every process in your program is bound to your code, i.e, nothing happens unless you want it to happen. And every bit of memory is allocated (and deallocated) according to what you wrote. So, performance is all your responsibility, if you do good, you get great performance.
(Note: Please don't complain about the code one haven't written himself such as STL, it's a C++ unmanaged code after all, that is the significant part).
But in managed code, such as code in Java and C#, you don't control every process, and memory is "hidden", or not under your control, to some extent. And that makes performance something relatively unknown, mostly you fear bad performance.
So my question is: What issues and Bold Lines should I look after and keep in mind to achieve a good performance in managed code?
I could think only of some practices such as:
Being aware of boxing and unboxing.
Choosing the correct Collection that best suites your needs and has the lowest operation cost.
But these never seem to be enough and even convincing! In fact perhaps I shouldn't have mentioned them.
Please note I am not asking for a C++ VS C# (or Java) code comparing, I just mentioned C++ to explain the problem.
There is no single answer here. The only way to answer this is: profile. Measure early and often. The bottlenecks are usually not where you expect them. Optimize the things that actually hurt. We use mvc-mini-profiler for this, but any similar tool will work.
You seem to be focusing on GC; now, that can sometimes be an issue, but usually only in specific cases; for the majority of systems the generational GC works great.
Obviously external resources will be slow; caching may be critical: in odd scenarios with very-long-lived data there are tricks you can do with structs to avoid long GEN-2 collects; serialization (files, network, etc), materialization (ORM), or just bad collection/algorithn choice may be the biggest issue - you cannot know until you measure.
Two things though:
make sure you understand what IDisposable and "using" mean
don't concatenate strings in loops; mass concatenation is the job of StringBuilder
Reusing large objects is very important in my experience.
Objects on the large object heap are implicitly generation 2, and thus require a full GC to clean up. And that's expensive.
The main thing to keep in mind with performance with managed languages is that your code can change structure at runtime to be better optimized.
For example the default JVM most people use is Sun's Hotspot VM which will actually optimize your code as it runs by converting parts of the program to native code, in-lining on the fly and other optimizations (such as the CLR or other managed runtimes) which you will never get using C++.
Additionally Hotspot will also detect which parts of you're code are used the most and optimize accordingly.
So as you can see optimising performance on a managed system is slightly harder than on an un-managed system because you have an intermediate layer that can make code faster without your intervention.
I am going to invoke the law of premature optimization here and say that you should first create the correct solution then, if performance becomes an issue, go back and measure what is actually slow before attempting to optimize.
I would suggest understanding better garbage collection algorithms. You can find good books on that matter, e.g. The Garbage Collection Handbook (by Richard Jones, Antony Hosking, Eliot Moss).
Then, your question is practically related to particular implementation, and perhaps even to a specific version of it. For instance, Mono used (e.g. in version 2.4) to use Boehm's garbage collector, but now uses a copying generational one.
And don't forget that some GC techniques can be remarkably efficient. Remember A.Appel's old paper Garbage Collection can be faster than stack allocation (but today, the cache performance matters much much more, so details are different).
I think that being aware of boxing (& unboxing) and allocation is enough. Some compilers are able to optimize these (by avoiding some of them).
Don't forget that GC performance can vary widely. There are good GCs (for your application) and bad ones.
And some GC implementations are quite fast. For example the one inside Ocaml
I would not bother that much: premature optimization is evil.
(and C++ memory management, even with smart pointers, or with ref-counters, can often be viewed as a poor man's garbage collection technique; and you don't have full control on what C++ is doing -unless you re-implement your ::operator new using operating system specific system calls-, so you don't really know a priori its performance)
.NET Generics don't specialize on reference types, which severely limits how much inlining can be done. It may (in certain performance hotspots) make sense to forgo a generic container type in favor of a specific implementation that will be better optimized. (Note: this doesn't mean to use .NET 1.x containers with element type object).
you must :
using large objects is very important in my experience.
Objects on the large object heap are implicitly generation 2, and thus require a full GC to clean up. And that's expensive.
As I understand, recursive functions are generally less efficient than equivalent non-recursive functions because of the overhead of function calls. However, I have recently encountered a text book saying this is not necessary true with Java (and C#).
It does not say why, but I assume this might be because the Java compiler optimizes recursive functions in some way.
Does anyone know the details of why this is so?
The text book is probably referring to tail-call optimization; see #Travis's answer for details.
However, the textbook is incorrect in the context of Java. Current Java compilers do not implement tail-call optimization, apparently because it would interfere with the Java security implementation, and would alter the behaviour of applications that introspect on the call stack for various purposes.
References:
Does the JVM prevent tail call optimizations?
This Sun bug requesting tail-call support ... still open.
This page (and the referenced paper) suggest that perhaps it wouldn't be that hard after all ...
There are hints that tail-call optimization might make it into Java 8.
This is usually only true for tail-recursion (http://en.wikipedia.org/wiki/Tail_call).
Tail-recursion is semantically equivalent to an incremented loop, and can therefore be optimized to a loop. Below is a quote from the article that I linked to (emphasis mine):
Tail calls are significant because
they can be implemented without adding
a new stack frame to the call stack.
Most of the frame of the current
procedure is not needed any more, and
it can be replaced by the frame of the
tail call, modified as appropriate.
The program can then jump to the
called subroutine. Producing such code
instead of a standard call sequence is
called tail call elimination, or tail
call optimization.
In functional programming languages,
tail call elimination is often
guaranteed by the language standard,
and this guarantee allows using
recursion, in particular tail
recursion, in place of loops
Some reasons why recursive implementations can be as efficient as iterative ones under certain circumstances:
Compilers can be clever enough to optimise out the function call for certain functions, e.g. by converting a tail-recursive function into a loop. I strongly suspect some of the modern JIT compilers for Java do this.
Modern processors do branch prediction and speculative execution, which can mean that the cost of a function call is minimal, or at least not much more than the cost of an iterative loop
In situations where you need a small amount local storage on each level of recursion, it is often more efficient to put this on the stack via a recursive function call than to allocate it in some other way (e.g. via a queue in heap memory).
My general advice however is don't bother worrying about this - the difference is so small that it is very unlikely to make a difference in your overall performance.
Guy Steele, one of the fathers of Java, wrote a paper in 1977
Debunking the "Expensive Procedure Call" Myth
or, Procedure Call Implementations Considered Harmful
or, LAMBDA: The Ultimate GOTO
Abstract:
Folklore states that GOTO statements are
"cheap', while procedure calls are 'expensive'. This
myth is largely a result of poorly designed language
Implementations.
That's funny, because even today, Java has no tail call optimization:)
To the best of my knowledge, Java does not do any sort of recursion optimization. Knowing this is important - not because of efficiency, but because recursion at an excessive depth (a few thousand should do it) will cause a stack overflow and crash your program. (Really, considering the name of this site, I'm surprised nobody brought this up before me).
I don't think so, in my experience in solving some programming problems in sites like UVA or SPOJ I had to remove the recursion in order to solve the problem within established time to solve the problem.
One way that you can think is: in recursive calls, any time that the recursion occurs, the jvm must allocate resources for the function that has being called, in non recursive functions most part of the memory is already allocated.
I'm doing a presentation in few months about .Net performance and optimization, I wanted to provide some samples of unnecessary optimization, things that will be done by the compiler anyways.
where can I find some explanation on what optimizations the compiler is actually capable of maybe some before and after code?
check out these links
C# Compiler Optimizations
compiler optimization
msdn
Also checkout this book on MSIL
1. Microsoft Intermediate Language: Comparison Between C# and VB.NET / Niranjan Kumar
What I think would be even better than examples of "things that will be done by the compiler anyways" would be examples of scenarios where the compiler doesn't perform "optimizations" that the developer assumes will yield a performance improvement but which, in fact, won't.
For example sometimes a developer will assume that caching a value locally will improve performance, when actually the savings of having one less value on the stack outweighs the miniscule cost of a field access that can be inlined.
Or the developer might assume that "force-inlining" a method call (essentially by stripping out the call itself and replacing with copied/pasted code) will be worthwhile, when in reality keeping the method call as-is would result in its getting inlined by the compiler only when it makes sense (when the benefit of inlining outweighs the growth in code size).
This is only a general idea, of course. I don't have concrete code samples that I can point to; but maybe you can scrounge some up if you look for them.
I would like to preface this with I'm not trying to start a fight. I was wondering if anyone had any good resources that compared C++ and C# for mathematically intensive code? My gut impression is that C# should be significantly slower, but I really have no evidence for this feeling. I was wondering if anyone here has ever run across a study or tested this themselves? I plan on running some tests myself, but would like to know if anyone has done this in a rigorous manner (google shows very little). Thanks.
EDIT: For intensive, I mean a lot of sin/cos/exp happening in tight loops
I have to periodically compare the performance of core math under runtimes and languages as part of my job.
In my most recent test, the performance of C# vs my optimized C++ control-case under the key benchmark — transform of a long array of 4d vectors by a 4d matrix with a final normalize step — C++ was about 30x faster than C#. I can get a peak throughput of one vector every 1.8ns in my C++ code, whereas C# got the job done in about 65ns per vector.
This is of course a specialized case and the C++ isn't naive: it uses software pipelining, SIMD, cache prefetch, the whole nine yards of microoptimization.
C# will be slower in general, but not significantly so. In some cases, depending on the structure of the code, C# can actually be faster, as JIT analysis can frequently improve the performance of a long-running algorithm.
Edit: Here's a nice discussion of C# vs C++ performance
Edit 2:
"In general" is not really accurate. As you say, the JIT compiler can actually turn your MSIL into faster native code that the C++ compiler because it can optimize for the hardware it is running on.
You must admit, however, that the act of JIT compiling itself is resource intensive, and there are runtime checks that occur in managed code. Pre-compiled and pre-optimized code will always be faster than just JITted code. Every benchmark comparison shows it. But long-running processes that can have a fair amount of runtime analysis can be improved over pre-compiled, pre-optimized native code.
So what I said was 100% accurate. For the general case, managed code is slightly slower than pre-compiled, pre-optimized code. It's not always a significant performance hit, however, and for some cases JIT analysis can improve performance over pre-optimized native code.
For straight mathematical functions asking if C# is faster than C++ is not the best question. What you should be asking
Is the assembly produced by the CLR JITer more or less efficient than assembly generated by the C++ compiler
The C# compiler has much less influence on the speed of purely mathmatical operations than the CLR JIT does. It would have almost identical performance as other .Net languages (such as VB.Net if you turn off overflow checing).
There are extensive benchmarks here:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=csharp&lang2=gpp&box=1
Note this compares the Mono JIT to C++. AFIAK there are no extensive benchmarks of Microsoft's implementation out there, so almost everything you will hear is hearsay. :(
I think you're asking the wrong question. You should be asking if C++ on can beat out the .NET family of languages in mathematical computation. Have a gander at F# timing comparisons for Runge Kutta
You do not define "mathematically intensive" very well (understatement for: not at all).
An attempt to a breakdown:
For the basic Sin/Cos/Log functions I would not expect much difference.
For linear algebra (matrices) I would expect .NET to loose out, the (always enforced) bounds checking on arrays is only optimized away under some circumstances.
You will probably have to benchmark something close to your intended domain.
I would consider using Mono.Simd to accelerate some operations. The minus is that on MS runtime it's not accelerated.
I haven't checked recently, but the last time I did check, Microsoft's license agreement for the .NET runtime required you to agree NOT to publish any benchmarks of its performance. That tends to limit the amount of solid information that gets published.
A few others have implied it, but I'll state it directly: I think you're engaging in (extremely) premature optimization -- or trying to anyway.
Edit:
Doing a bit of looking, the license has changed (a long time ago, in fact). The current terms
say you're allowed to publish benchmarks -- but only if you meet their conditions. Some of those conditions look (to me) nearly impossible to meet. For example, you can only publish provided: "your benchmark testing was performed using all performance tuning and best practice guidance set forth in the product documentation and/or on Microsoft's support Web sites". Given the size and number of Microsoft's web sites, I don't see how anybody stands a chance of being certain they're following all the guidance they might provide.
Although that web page talks about .NET 1.1, the newer licenses seem to refer back to it as well.
So, what I remembered was technically wrong, but effectively correct anyway.
For basic math library functions there won't be much difference because C# will call out to the same compiled code that C++ would use. For more interesting math that you won't find in the math library there are several factors that make C# worse. The Current JIT doesn't support SSE instructions that you would have access to in C++.