How to optimize for dual, quad and higher multi-processors?

How to optimize for dual, quad and higher multi-processors? - c#

Folks, I've been programming high speed software over 20 years and know virtually every trick in the book from micro-bench making cooperative, profiling, user-mode multitasking, tail recursion, you name it for very high performance stuff on Linux, Windows, and more.
The problem is that I find myself befuddled by what happens when multiple threads of CPU intensive work are exposed to a multi-core processors.
The results from performance in micro benchmarks of various ways of sharing date between threads (on different cores) don't seem to follow logic.
It's clear that there is some "hidden interaction" between the cores which isn't obvious from my own programming code. I hear of L1 cache and other issues but those are opaque to me.
Question is: Where can I learn this stuff ? I am looking for an in depth book on how multi-core processors work, how to program to capitalize on their memory caches or other hardware architecture instead of being punished by them.
Any advice or great websites or books? After much Googling, I'm coming up empty.
Sincerely,
Wayne

This book taught me a lot about these sorts of issues about why raw CPU power is not necessary the only thing to pay attention to. I used it in grad school years ago, but I think all of the principles still apply:
http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901
And essentially a major issue in multi-process configurations is synchronizing the access to the main memory, if you don't do this right it can be a real bottleneck in the performance. It's pretty complex with the caches that have to be kept in sync.

my own question, with answer, on stackoverflow's sister site: https://softwareengineering.stackexchange.com/questions/126986/where-can-i-find-an-overview-of-known-multithreading-design-patterns/126993#126993
I will copy the answer to avoid the need for click-through:
Quote Boris:
Parallel Programming with Microsoft .NET: Design Patterns for
Decomposition and Coordination on Multicore Architectures https://rads.stackoverflow.com/amzn/click/0735651590
This is a book, I recommend wholeheartedly.
It is:
New - published last year. Means you are not reading somewhat outdated
practices.
Short - about 200+ pages, dense with information. These
days there is too much to read and too little time to read 1000+ pages
books.
Easy to read - not only it is very well written but it
introduces hard to grasps concepts in really simple to read way.
Intended to teach - each chapter gives exercises to do. I know it is
always beneficial to do these, but rarely do. This book gives very
compelling and interesting tasks. Surprisingly I did most of them and
enjoyed doing them.
additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.
PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks

My answer on "Are you concerned about multicores"
Herb Sutter's articles
Video Series on Parallel Programming

One specific cause of unexpected poor results in parallelized code is false sharing, you won't see that coming if you dont know what's going on down there (I didn't). Here a two articles that dicuss the cause and remedy for .Net:
http://msdn.microsoft.com/en-us/magazine/cc872851.aspx
http://www.codeproject.com/KB/threads/FalseSharing.aspx
Rgds GJ

There are different aspects to multi-threading requiring different approaches.
On a webserver, for example, the use of thread-pools is widely used since it supposedly is "good for" performance. Such pools may contain hundreds of threads waiting to be put to work. Using that many threads will cause the scheduler to work overtime which is detrimental to performance but can't be avoided on Linux systems. For Windows the method of choice is the IOCP mechanism which recommends a number of threads not greater than the number of cores installed. It causes an application to become (I/O completion) event driven which means that no cycles are wasted on polling. The few threads involved reduce scheduler work to a minimum.
If the object is to implement a functionality that is scaleable (more cores <=> higher performance) then the main issue will be memory bus saturation. The saturation will occur due to code fetching, data reading and data writing. An incorrectly implemented code will run slower with two threads than with one. The only way around this is to reduce the memory bus work by actively:
tailoring the code to a minimal memory footprint (= fits in the code cache) and which doesn't call other functions or jump all over the place.
tailoring memory reads and writes to a minimum size.
informing the prefetch mechanism of coming RAM reads.
tailoring the work such that the ratio of work performed inside the core's own caches (L1 & L2) is as great as possible in relation to the work outside them (L3 & RAM).
To put this in another way: fit the applicable code and data chunks into as few cache lines (# 64 bytes each) as possible because ultimately this is what will decide the scaleability. If the cache/memory system is capable of x cache line operations every second your code will run faster if its requirements are five cache lines per unit of work (=> x/5) rather than eleven (x/11) or fifty-two (x/52).
Achieving this is not trivial since it requires a more or less unique solution every time. Some compilers do a good job of instruction ordering to take advantage of the host processor's pipelining. This does not necessarily mean that it will be a good ordering for multiple cores.
An efficient implementation of a scaleable code will not necessarily be a pretty one. Recommended coding techniques and styles may, in the end, hinder the code's execution.
My advice is to test how this works by writing a simple multi-threaded application in a low-level language (such as C) that can be adjusted to run in single or multi-threaded mode and then profiling the code for the different modes. You will need to analyze the code at the instruction level. Then you experiment using different (C) code constructs, data organization, etc. You may have to think outside the box and rethink the algorithm to make it more cache-friendly.
The first time will require lots of work. You will not learn what will work for all multi-threaded solutions but you will perhaps get an inkling of what not to do and what indications to look for when analyzing profiled code.

I found this link that specifically explains the issues of
multicore cache handling on CPUs that was affecting my
multithreaded program.
http://www.multicoreinfo.com/research/intel/mem-issues.pdf
The site multicoreinfo.com in general has lots of good
information and references about multicore programming.

Related

Does multi-threading equal less CPU?

I have a small list of rather large files that I want to process, which got me thinking...
In C#, I was thinking of using Parallel.ForEach of TPL to take advantage of modern multi-core CPUs, but my question is more of a hypothetical character;
Does the use of multi-threading in practicality mean that it would take longer time to load the files in parallel (using as many CPU-cores as possible), as opposed to loading each file sequentially (but with probably less CPU-utilization)?
Or to put it in another way (:
What is the point of multi-threading? More tasks in parallel but at a slower rate, as opposed to focusing all computing resources on one task at a time?

In order to not increase latency, parallel computational programs typically only create one thread per core. Applications which aren't purely computational tend to add more threads so that the number of runnable threads is the number of cores (the others are in I/O wait, and not competing for CPU time).
Now, parallelism on disk-I/O bound programs may well cause performance to decrease, if the disk has a non-negligible seek time then much more time will be wasted performing seeks and less time actually reading. This is called "churning" or "thrashing". Elevator sorting helps somewhat, true random access (such as solid state memories) helps more.
Parallelism does almost always increase the total raw work done, but this is only important if battery life is of foremost importance (and by the time you account for power used by other components, such as the screen backlight, completing quicker is often still more efficient overall).

You asked multiple questions, so I've broken up my response into multiple answers:
Multithreading may have no effect on loading speed, depending on what your bottleneck during loading is. If you're loading a lot of data off disk or a database, I/O may be your limiting factor. On the other hand if 'loading' involves doing a lot of CPU work with some data, you may get a speed up from using multithreading.
Generally speaking you can't focus "all computing resources on one task." Some multicore processors have the ability to overclock a single core in exchange for disabling other cores, but this speed boost is not equal to the potential performance benefit you would get from fully utilizing all of the cores using multithreading/multiprocessing. In other words it's asymmetrical -- if you have a 4 core 1Ghz CPU, it won't be able to overclock a single core all the way to 4ghz in exchange for disabling the others. In fact, that's the reason the industry is going multicore in the first place -- at least for now we've hit limits on how fast we can make a single CPU run, so instead we've gone the route of adding more CPUs.
There are 2 reasons for multithreading. The first is that you want to tasks to run at the same time simply because it's desirable for both to be able to happen simultaneously -- e.g. you want your GUI to continue to respond to clicks or keyboard presses while it's doing other work (event loops are another way to accomplish this though). The second is to utilize multiple cores to get a performance boost.

For loading files from disk, this is likely to make things much slower. What happens is the operating system tries to lay out files on disk such that you should only need to do an expensive disk seek once for each file. If you have a lot of threads reading a lot of files, you're gonna have contention over which thread has access to the disk, and you'll have to seek back to the right place in the file every time the next thread gets a turn.
What you can do is use exactly two threads. Set one to load all of the files in the background, and let the other remain available for other tasks, like handling user input. In C# winforms, you can do this easily with a BackgroundWorker control.

Multi-threading is useful for highly parallelizable tasks. CPU intensive tasks are perfect. Your CPU has many cores, many threads can use many cores. They'll use more CPU time, but in the end they'll use less "user" time. If your app is I/O bounded, then multithreading isn't always the solution (but it COULD help)

It might be helpful to first understand the difference between Multithreading and Parallelism, as more often than not I see them being used rather interchangeably. Joseph Albahari has written a quite interesting guide about the subject: Threading in C# - Part 5 - Parallelism

As with all great programming endeavors, it depends. By and large, you'll be requesting files from one physical store, or one physical controller which will serialize the requests anyhow (or worse, cause a LOT of head back-and-forth on a classical hard drive) and slow down the already slow I/O.
OTOH, if the controllers and the medium are separate, multiple cores loading data from them should be improved over a sequential method.

Divide work among processes or threads?

I am interning for a company this summer, and I got passed down this program which is a total piece. It does very computationally intensive operations throughout most of its duration. It takes about 5 minutes to complete a run on a small job, and the guy I work with said that the larger jobs have taken up to 4 days to run. My job is to find a way to make it go faster. My idea was that I could split the input in half and pass the halves to two new threads or processes, I was wondering if I could get some feedback on how effective that might be and whether threads or processes are the way to go.
Any inputs would be welcomed.
Hunter

I'd take a strong look at TPL that was introduced in .net4 :) PLINQ might be especially useful for easy speedups.
Genereally speaking, splitting into diffrent processes(exefiles) is inadvicable for perfomance since starting processes is expensive. It does have other merits such as isolation(if part of a program crashes) though, but i dont think they are applicable for your problem.

If the jobs are splittable, then going multithreaded/multiprocessed will bring better speed. That is assuming, of course, that the computer they run on actually has multiple cores/cpus.
Threads or processes doesn't really matter regarding speed (if the threads don't share data). The only reason to use processes that I know of is when a job is likely to crash an entire process, which is not likely in .NET.

Use threads if theres lots of memory sharing in your code but if you think you'd like to scale the program to run across multiple computers (when required cores > 16) then develop it using processes with a client/server model.

Best way when optimising code, always, is to Profile it to find out where the Logjam's are IMO.
Sometimes you can find non obvious huge speed increases with little effort.
Eqatec, and SlimTune are two free C# profilers which may be worth trying out.
(Of course the other comments about which parallelization architecture to use are spot on - it's just I prefer analysis first....

Have a look at the Task Parallel Library -- this sounds like a prime candidate problem for using it.
As for the threads vs processes dilemma: threads are fine unless there is a specific reason to use processes (e.g. if you were using buggy code that you couldn't fix, and you did not want a bad crash in that code to bring down your whole process).

Well if the problem has a parallel solution then this is the right way to (ideally) significantly (but not always) increase performance.
However, you don't control making additional processes except for running an app that launches multiple mini apps ... which is not going to help you with this problem.
You are going to need to utilize multiple threads. There is a pretty cool library added to .NET for parallel programming you should take a look at. I believe its namespace is System.Threading.Tasks or System.Threading with the Parallel class.
Edit: I would definitely suggest though, that you think about whether or not a linear solution may fit better. Sometimes parallel solutions would taken even longer. It all depends on the problem in question.

If you need to communicate/pass data, go with threads (and if you can go .Net 4, use the Task Parallel Library as others have suggested). If you don't need to pass info that much, I suggest processes (scales a bit better on multiple cores, you get the ability to do multiple computers in a client/server setup [server passes info to clients and gets a response, but other than that not much info passing], etc.).

Personally, I would invest my effort into profiling the application first. You can gain a much better awareness of where the problem spots are before attempting a fix. You can parallelize this problem all day long, but it will only give you a linear improvement in speed (assuming that it can be parallelized at all). But, if you can figure out how to transform the solution into something that only takes O(n) operations instead of O(n^2), for example, then you have hit the jackpot. I guess what I am saying is that you should not necessarily focus on parallelization.
You might find spots that are looping through collections to find specific items. Instead you can transform these loops into hash table lookups. You might find spots that do frequent sorting. Instead you could convert those frequent sorting operations into a single binary search tree (SortedDictionary) which maintains a sorted collection efficiently through the many add/remove operations. And maybe you will find spots that repeatedly make the same calculations. You can cache the results of already made calculations and look them up later if necessary.

Is C# really slower than say C++?

I've been wondering about this issue for a while now.
Of course there are things in C# that aren't optimized for speed, so using those objects or language tweaks (like LinQ) may cause the code to be slower.
But if you don't use any of those tweaks, but just compare the same pieces of code in C# and C++ (It's easy to translate one to another). Will it really be that much slower ?
I've seen comparisons that show that C# might be even faster in some cases, because in theory the JIT compiler should optimize the code in real time and get better results:
Managed Or Unmanaged?
We should remember that the JIT compiler compiles the code at real time, but that's a 1-time overhead, the same code (once reached and compiled) doesn't need to be compiled again at run time.
The GC doesn't add a lot of overhead either, unless you create and destroy thousands of objects (like using String instead of StringBuilder). And doing that in C++ would also be costly.
Another point that I want to bring up is the better communication between DLLs introduced in .Net. The .Net platform communicates much better than Managed COM based DLLs.
I don't see any inherent reason why the language should be slower, and I don't really think that C# is slower than C++ (both from experience and lack of a good explanation)..
So, will a piece of the same code written in C# will be slower than the same code in C++ ?
In if so, then WHY ?
Some other reference (Which talk about that a bit, but with no explanation about WHY):
Why would you want to use C# if its slower than C++?

Warning: The question you've asked is really pretty complex -- probably much more so than you realize. As a result, this is a really long answer.
From a purely theoretical viewpoint, there's probably a simple answer to this: there's (probably) nothing about C# that truly prevents it from being as fast as C++. Despite the theory, however, there are some practical reasons that it is slower at some things under some circumstances.
I'll consider three basic areas of differences: language features, virtual machine execution, and garbage collection. The latter two often go together, but can be independent, so I'll look at them separately.
Language Features
C++ places a great deal of emphasis on templates, and features in the template system that are largely intended to allow as much as possible to be done at compile time, so from the viewpoint of the program, they're "static." Template meta-programming allows completely arbitrary computations to be carried out at compile time (I.e., the template system is Turing complete). As such, essentially anything that doesn't depend on input from the user can be computed at compile time, so at runtime it's simply a constant. Input to this can, however, include things like type information, so a great deal of what you'd do via reflection at runtime in C# is normally done at compile time via template metaprogramming in C++. There is definitely a trade-off between runtime speed and versatility though -- what templates can do, they do statically, but they simply can't do everything reflection can.
The differences in language features mean that almost any attempt at comparing the two languages simply by transliterating some C# into C++ (or vice versa) is likely to produce results somewhere between meaningless and misleading (and the same would be true for most other pairs of languages as well). The simple fact is that for anything larger than a couple lines of code or so, almost nobody is at all likely to use the languages the same way (or close enough to the same way) that such a comparison tells you anything about how those languages work in real life.
Virtual Machine
Like almost any reasonably modern VM, Microsoft's for .NET can and will do JIT (aka "dynamic") compilation. This represents a number of trade-offs though.
Primarily, optimizing code (like most other optimization problems) is largely an NP-complete problem. For anything but a truly trivial/toy program, you're pretty nearly guaranteed you won't truly "optimize" the result (i.e., you won't find the true optimum) -- the optimizer will simply make the code better than it was previously. Quite a few optimizations that are well known, however, take a substantial amount of time (and, often, memory) to execute. With a JIT compiler, the user is waiting while the compiler runs. Most of the more expensive optimization techniques are ruled out. Static compilation has two advantages: first of all, if it's slow (e.g., building a large system) it's typically carried out on a server, and nobody spends time waiting for it. Second, an executable can be generated once, and used many times by many people. The first minimizes the cost of optimization; the second amortizes the much smaller cost over a much larger number of executions.
As mentioned in the original question (and many other web sites) JIT compilation does have the possibility of greater awareness of the target environment, which should (at least theoretically) offset this advantage. There's no question that this factor can offset at least part of the disadvantage of static compilation. For a few rather specific types of code and target environments, it can even outweigh the advantages of static compilation, sometimes fairly dramatically. At least in my testing and experience, however, this is fairly unusual. Target dependent optimizations mostly seem to either make fairly small differences, or can only be applied (automatically, anyway) to fairly specific types of problems. Obvious times this would happen would be if you were running a relatively old program on a modern machine. An old program written in C++ would probably have been compiled to 32-bit code, and would continue to use 32-bit code even on a modern 64-bit processor. A program written in C# would have been compiled to byte code, which the VM would then compile to 64-bit machine code. If this program derived a substantial benefit from running as 64-bit code, that could give a substantial advantage. For a short time when 64-bit processors were fairly new, this happened a fair amount. Recent code that's likely to benefit from a 64-bit processor will usually be available compiled statically into 64-bit code though.
Using a VM also has a possibility of improving cache usage. Instructions for a VM are often more compact than native machine instructions. More of them can fit into a given amount of cache memory, so you stand a better chance of any given code being in cache when needed. This can help keep interpreted execution of VM code more competitive (in terms of speed) than most people would initially expect -- you can execute a lot of instructions on a modern CPU in the time taken by one cache miss.
It's also worth mentioning that this factor isn't necessarily different between the two at all. There's nothing preventing (for example) a C++ compiler from producing output intended to run on a virtual machine (with or without JIT). In fact, Microsoft's C++/CLI is nearly that -- an (almost) conforming C++ compiler (albeit, with a lot of extensions) that produces output intended to run on a virtual machine.
The reverse is also true: Microsoft now has .NET Native, which compiles C# (or VB.NET) code to a native executable. This gives performance that's generally much more like C++, but retains the features of C#/VB (e.g., C# compiled to native code still supports reflection). If you have performance intensive C# code, this may be helpful.
Garbage Collection
From what I've seen, I'd say garbage collection is the poorest-understood of these three factors. Just for an obvious example, the question here mentions: "GC doesn't add a lot of overhead either, unless you create and destroy thousands of objects [...]". In reality, if you create and destroy thousands of objects, the overhead from garbage collection will generally be fairly low. .NET uses a generational scavenger, which is a variety of copying collector. The garbage collector works by starting from "places" (e.g., registers and execution stack) that pointers/references are known to be accessible. It then "chases" those pointers to objects that have been allocated on the heap. It examines those objects for further pointers/references, until it has followed all of them to the ends of any chains, and found all the objects that are (at least potentially) accessible. In the next step, it takes all of the objects that are (or at least might be) in use, and compacts the heap by copying all of them into a contiguous chunk at one end of the memory being managed in the heap. The rest of the memory is then free (modulo finalizers having to be run, but at least in well-written code, they're rare enough that I'll ignore them for the moment).
What this means is that if you create and destroy lots of objects, garbage collection adds very little overhead. The time taken by a garbage collection cycle depends almost entirely on the number of objects that have been created but not destroyed. The primary consequence of creating and destroying objects in a hurry is simply that the GC has to run more often, but each cycle will still be fast. If you create objects and don't destroy them, the GC will run more often and each cycle will be substantially slower as it spends more time chasing pointers to potentially-live objects, and it spends more time copying objects that are still in use.
To combat this, generational scavenging works on the assumption that objects that have remained "alive" for quite a while are likely to continue remaining alive for quite a while longer. Based on this, it has a system where objects that survive some number of garbage collection cycles get "tenured", and the garbage collector starts to simply assume they're still in use, so instead of copying them at every cycle, it simply leaves them alone. This is a valid assumption often enough that generational scavenging typically has considerably lower overhead than most other forms of GC.
"Manual" memory management is often just as poorly understood. Just for one example, many attempts at comparison assume that all manual memory management follows one specific model as well (e.g., best-fit allocation). This is often little (if any) closer to reality than many peoples' beliefs about garbage collection (e.g., the widespread assumption that it's normally done using reference counting).
Given the variety of strategies for both garbage collection and manual memory management, it's quite difficult to compare the two in terms of overall speed. Attempting to compare the speed of allocating and/or freeing memory (by itself) is pretty nearly guaranteed to produce results that are meaningless at best, and outright misleading at worst.
Bonus Topic: Benchmarks
Since quite a few blogs, web sites, magazine articles, etc., claim to provide "objective" evidence in one direction or another, I'll put in my two-cents worth on that subject as well.
Most of these benchmarks are a bit like teenagers deciding to race their cars, and whoever wins gets to keep both cars. The web sites differ in one crucial way though: they guy who's publishing the benchmark gets to drive both cars. By some strange chance, his car always wins, and everybody else has to settle for "trust me, I was really driving your car as fast as it would go."
It's easy to write a poor benchmark that produces results that mean next to nothing. Almost anybody with anywhere close to the skill necessary to design a benchmark that produces anything meaningful, also has the skill to produce one that will give the results he's decided he wants. In fact it's probably easier to write code to produce a specific result than code that will really produce meaningful results.
As my friend James Kanze put it, "never trust a benchmark you didn't falsify yourself."
Conclusion
There is no simple answer. I'm reasonably certain that I could flip a coin to choose the winner, then pick a number between (say) 1 and 20 for the percentage it would win by, and write some code that would look like a reasonable and fair benchmark, and produced that foregone conclusion (at least on some target processor--a different processor might change the percentage a bit).
As others have pointed out, for most code, speed is almost irrelevant. The corollary to that (which is much more often ignored) is that in the little code where speed does matter, it usually matters a lot. At least in my experience, for the code where it really does matter, C++ is almost always the winner. There are definitely factors that favor C#, but in practice they seem to be outweighed by factors that favor C++. You can certainly find benchmarks that will indicate the outcome of your choice, but when you write real code, you can almost always make it faster in C++ than in C#. It might (or might not) take more skill and/or effort to write, but it's virtually always possible.

Because you don't always need to use the (and I use this loosely) "fastest" language? I don't drive to work in a Ferrari just because it's faster...

Circa 2005 two MS performance experts from both sides of the native/managed fence tried to answer the same question. Their method and process are still fascinating and the conclusions still hold today - and I'm not aware of any better attempt to give an informed answer. They noted that a discussion of potential reasons for differences in performance is hypothetical and futile, and a true discussion must have some empirical basis for the real world impact of such differences.
So, the Old New Raymond Chen, and Rico Mariani set rules for a friendly competition. A Chinese/English dictionary was chosen as a toy application context: simple enough to be coded as a hobby side-project, yet complex enough to demonstrate non trivial data usage patterns. The rules started simple - Raymond coded a straightforward C++ implementation, Rico migrated it to C# line by line, with no sophistication whatsoever, and both implementations ran a benchmark. Afterwards, several iterations of optimizations ensued.
The full details are here: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.
This dialogue of titans is exceptionally educational and I whole heartily recommend to dive in - but if you lack the time or patience, Jeff Atwood compiled the bottom lines beautifully:
Eventually, C++ was 2x faster - but initially, it was 13x slower.
As Rico sums up:
So am I ashamed by my crushing defeat? Hardly. The managed code
achieved a very good result for hardly any effort. To defeat the
managed version, Raymond had to:
Write his own file/io stuff
Write his own string class
Write his own allocator
Write his own international mapping
Of course he used available lower level libraries to do this,
but that's still a lot of work. Can you call what's left an STL
program? I don't think so.
That is my experience still, 11 years and who knows how many C#/C++ versions later.
That is no coincidence, of course, as these two languages spectacularly achieve their vastly different design goals. C# wants to be used where development cost is the main consideration (still the majority of software), and C++ shines where you'd save no expenses to squeeze every last ounce of performance out of your machine: games, algo-trading, data-centers, etc.

C++ always have an edge for the performance. With C#, I don't get to handle memory and I have literally tons of resources available for me to do my job.
What you need to question yourself is more about which one saves you time. Machines are incredibly powerful now and most of your code should be done in a language that allows you to get the most value in the least amount of time.
If there is a core processing that takes way too long in C#, you could then build a C++ and interop your way to it with C#.
Stop thinking about your code performance. Start building value.

C# is faster than C++. Faster to write. For execution times, nothing beats a profiler.
But C# does not have as much libraries as C++ can interface easily.
And C# depends heavily on windows...

BTW, time critical applications are not coded in C# or Java, primarily due to uncertainty of when the Garbage Collection will be performed.
In modern times, application or execution speed is not as important as was previously. Development schedules, correctness and robustness are higher priorities. A high speed version of an application is no good if it has lots of bugs, crashes a lot or worse, misses an opportunity to get to market or be deployed.
Since development schedules are a priority, new languages are coming out that speed up development. C# is one of these. C# also assists in correctness and robustness by removing features from C++ that cause common problems: one example is pointers.
The differences in execution speed of an application developed with C# and one developed using C++ is negligible on most platforms. This is due to the fact that the execution bottlenecks are not language dependent but usually depend on the operating system or I/O. For example if C++ performs a function in 5 ms but C# uses 2ms, and waiting for data takes 2 seconds, the time spent in the function is insignificant compared to the time waiting for data.
Choose a language that is best suited for the developers, platform and projects. Work towards the goals of correctness, robustness and deployment. The speed of an application should be treated as a bug: prioritize it, compare to other bugs, and fix as necessary.

A better way to look at it everything is slower than C/C++ because it abstracts away rather than following the stick and mud paradigm. It's called systems programming for a reason, you program against the grain or bare metal. Doing so also grants you speed you cannot achieve with other languages like C# or Java. But alas C roots are all about doing things the hard way, so your mostly going to be writing more code and spending more time debugging it.
C is also case sensitive, also objects in C++ also follow strict rule sets. Example a purple ice cream cone may not be the same as a blue ice cream cone, though they might be cones they may not necessarily belong to the cone family and if you forget to define what cone is you bug out. Thus the properties of ice cream may or may not be clones. Now the speed argument, C/C++ uses the stack and heap approach this is where bare metal gets it's metal.
With the boost library you can achieve incredible speeds unfortunately most game studios stick to the standard library. The other reason for this might be because software written in C/C++ tends to be massive in file size, as it's a giant collection of files instead of a single file. Also take note all operating systems are written in C so generally why must we ask the question what could be faster?!
Also caching is not faster than pure memory management, sorry but this just doesn't fan out. Memory is something physical, caching is something software does in order to gain a kick in performance. One could also reason that without physical memory caching would simply not exist. It doesn't void the fact memory must be managed at some level whether its automated or manual.

Why would you write a small application that doesn't require much in the way of optimization in C++, if there is a faster route(C#)?

Getting an exact answer to your question is not really possible unless you perform benchmarks on specific systems. However, it is still interesting to think about some fundamental differences between programming languages like C# and C++.
Compilation
Executing C# code requires an additional step where the code is JIT'ed. With regard to performance that will be in favor of C++. Also, the JIT compiler is only able to optimize the generated code within the unit of code that is JIT'ed (e.g. a method) while a C++ compiler can optimize across method calls using more aggressive techniques.
However, The JIT compiler is able to optimize the generated machine code to closely match the underlying hardware enabling it to take advantage of additional hardware features if they exist. To my knowledge the .NET JIT compiler doesn't do that but it would conceiveably be able to generate different code for Atom as opposed to Pentium CPU's.
Memory access
The garbage collected architecture can in many cases create more optimal memory access patterns than standard C++ code. If the memory area used for the first generation is small enough in can stay within the CPU cache increasing performance. If you create and destroy a lot of small objects the overhead of maintaing the managed heap may be smaller than what is required by the C++ runtime. Again, this is highly dependent on the application. A study Python of performance demonstrates that a specific managed Python application is able to scale much better than the compiled version as a result of more optimal memory access patterns.

Don't let confusing!
If a C# application is written in the best case and a C++ application is written in the best case, the C++ is faster.
Many reason is here about why C++ is faster that C# inherently, such as C# uses virtual machine similar to JVM in Java. Basically higher level language has less performance (if uses in best case).
If you're an experienced professional C# programmer just like you're an experienced professional C++ programmer, developing an application using C# is much more easy and fast than C++.
Many other situations between these situations is possible. For example, you can write an C# application and C++ application so that C# app runs faster than C++ one.
For choosing a language you should note the circumstances of the project and its subject. For a general business project you should use C#. For a hight performance required project like a Video Converter or Image Processing project you should choose C++.
Update:
OK. Lets compare some practical reason about why most possible speed of C++ is more than C#. Consider a good written C# application and same C++ version:
C# uses a VM as a middle layer for executing the application. It has overhead.
AFAIK CLR could not optimises all C# codes in target machine. C++ application could be compiled on target machine with most optimisation.
In C# the most possible optimisation for runtime means most possible fast VM. VM has overhead anyway.
C# is a higher level language thus it generates more program code lines for the final process. (consider difference between an Assembly application and Ruby one! same condition is between C++ and a higher level language such as C#/Java)
If you prefer to get some more info in practice as an expert, see this. It is about Java but it also applies to C#.

The primary concern would not be speed, but stability across windows versions and upgrades. Win32 is mostly immune across windows versions making it highly stable.
When servers are decommissioned and software migrated, A lot of anxiety happens around anything using .Net and usually a lot of panic about .net versions but a Win32 application built 10 years ago just keeps running like nothing happened.

I have been specializing in optimization for about 15 years, and regularly re write C++ code, making heavy use of compiler intrinsics as much as possible because C++ performance is often nowhere near what the CPU is capable of. Cache performance often needs to be considered. Many vector maths instructions are required to replace the standard C++ floating point code.
A great deal of STL code is re written and often runs many times faster. Maths and code which makes heavy use of data can be re written with spectacular results, as the CPU approaches its optimum performance.
None of this is possible in C#. To compare their relative #real time# performance is really a staggeringly ignorant question. The fastest piece of code in C++ will be when each single assembler instruction is optimised for the task in hand, with no unnecessary instructions - at all. Where each piece of memory is used when it is required, and not copied n times because that’s what the language design requires. Where each required memory movement works in harmony with the cache.
Where the final algorithm cannot be improved, based on the exact real time requirements, considering accuracy and functionality.
Then you will be approaching an optimal solution.
To compare C# with this ideal situation is staggering. C# can’t compete. In fact, I am currently re writing a whole bunch of C# code (when I say re writing I mean removing and replacing it completely) because it is not even in the same city, let alone ball park when it comes to heavy lifting real time performance.
So please, stop fooling yourselves. C# is slow. Dead slow. All software is slowing down, and C# is making this speed decline worse. All software runs using the fetch execute cycle in assembler (you know – on the CPU). You use 10 times as many instructions; it’s going to go 10 times as slow. You cripple the cache; it’s going to go even slower. You add garbage collect to a real time piece of software then you are often fooled into thinking that the code runs ‘ok’ there are just those few moments every now and then when the code goes ‘a bit slow for a while’.
Try adding a garbage collection system to code where every cycle counts. I wonder if the stock market trading software has garbage collection (you know – on the system running on the new undersea cable which cost $300 million?). Can we spare 300 milliseconds every 2 seconds? What about flight control software on the space shuttle – is GC ok there? How about engine management software in performance vehicles? (Where victory in a season can be worth millions).
Garbage collection in real time is a complete failure.
So no, emphatically, C++ is much faster. C# is a leap backwards.

Designing library performance comparison tests

I am getting ready to perform a series of performance comparisons of various of the shelf products.
What do I need to do to show credibility in the tests? How do I design my benchmark tests so that they are respectable?
I am also interested in any suggestions on the actual design of the tests. Ways to load data without effecting the tests (Heisenberg Uncertainty Principle), or ways to monitor... etc

This is a bit tricky to answer without knowing what sort of "off the shelf" products you are trying to assess. Are you looking for UI responsiveness, throughput (e.g. email, transactions/sec), startup time, etc - all of these have different criteria for what measures you should track and different tools for testing or evaluating. But to answer some of your general questions:
Credibility - this is important. Try to make sure that whatever you are measuring has little run to run variance. Utilize the technique of doing several runs of the same scenario, get rid of outliers (i.e. your lowest and highest), and evaluate your avg/max/min/median values. If you're doing some sort of throughput test, consider making it long running so you have a good sample set. For example, if you are looking at something like Microsoft Exchange and thus are using their perf counters, try to make sure you are taking frequent samples (once per sec or every few secs) and have the test run for 20mins or so. Again, chop off the first few mins and the last few mins to eliminate any startup/shutdown noise.
Heisenburg - tricky. In most modern systems, depending on what application/measures you are measuring, you can minimize this impact by being smart about what/how you are measuring. Sometimes (like in the Exchange example), you'll see near 0 impact. Try to use as least invasive tools as possible. For example, if you're measuring startup time, consider using xperfinfo and utilize the events built into the kernel. If you're using perfmon, don't flood the system with extraneous counters that you don't care about. If you're doing some exteremely long running test, ratchet down your sampling interval.
Also try to eliminate any sources of environment variability or possible sources of noise. If you're doing something network intensive, consider isolating the network. Try to disable any services or applications that you don't care about. Limit any sort of disk IO, memory intensive operations, etc. If disk IO might introduce noise in something that is CPU bound, consider using SSD.
When designing your tests, keep repeatability in mind. If you doing some sort of microbenchmark type testing (e.g. perf unit test) then have your infrastructure support running the same operation n times exactly the same. If you're driving UI, try not to physically drive the mouse and instead use the underlying accessibility layer (MSAA, UIAutomation, etc) to hit controls directly programmatically.
Again, this is just general advice. If you have more specifics then I can try to follow up with more relavant guidance.
Enjoy!

Your question is very interesting, but a bit vague, because without knowing what to test it is not easy to give you some clues.
You can test performance from many different angles, then, depending on the use or target of the library you should try one approach or another; I will try to enumerate some of the things you may have to consider for measurement:
Multithreading: if the library uses
it or your software will use the
library in a multithreaded context
then you may have to test it with
many different processor and
multiprocessor configurations to see
how it reacts.
Startup time: its
importance depends on how intensively
will you use the library and what’s
the nature of the product being built
with it (client, server …).
Response time: for this do not take
the first execution, try to execute
the same call many times after the
first one and do an average. Using
System.Diagnostics.StopWatch could be
very useful for that.
Memory
consumption: analyze the growth,
beware of exponential ones ;). Go a
step further and measure quantity of
objects being created and disposed.
Responsiveness: you should not only
measure raw performance, how the user
feels the speed of the product it is
very important too.
Network: if the
library uses resources on the network
you may have to test it with
different bandwidth and latency
configurations, there is software to
simulate these situations.
Data:
try to create many different testing
data packages, trying to cover, for
example: a big bunch of raw data,
then a large set made of many smaller
chunks, a long iteration with small
pieces of data, …
Tools:
System.Diagnostics.Stopwatch: essential for benchmarking method calls
Performance counters: whenever available they are very useful to know what’s happening inside, allowing you to monitor the software without affecting its performance.
Profilers: there are some good memory and performance profilers in the market, but as you said, they always affect the measurements. They are good for finding bottlenecks in your software, but I don’t think you can use them for a comparison test.

Why do you care about the performance? In both cases the time taken to write the message to wherever you a storing your log will be a lot slower than anything else.
If you are really doing that match logging, then you are likely to need to index your log files so you can find the log entry you need, at that point you are not doing standard logging.

What is faster- Java or C# (or good old C)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm currently deciding on a platform to build a scientific computational product on, and am deciding on either C#, Java, or plain C with Intel compiler on Core2 Quad CPU's. It's mostly integer arithmetic.
My benchmarks so far show Java and C are about on par with each other, and .NET/C# trails by about 5%- however a number of my coworkers are claiming that .NET with the right optimizations will beat both of these given enough time for the JIT to do its work.
I always assume that the JIT would have done it's job within a few minutes of the app starting (Probably a few seconds in my case, as it's mostly tight loops), so I'm not sure whether to believe them
Can anyone shed any light on the situation? Would .NET beat Java? (Or am I best just sticking with C at this point?).
The code is highly multithreaded and data sets are several terabytes in size.
Haskell/Erlang etc are not options in this case as there is a significant quantity of existing legacy C code that will be ported to the new system, and porting C to Java/C# is a lot simpler than to Haskell or Erlang. (Unless of course these provide a significant speedup).
Edit: We are considering moving to C# or Java because they may, in theory, be faster. Every percent we can shave off our processing time saves us tens of thousands of dollars per year. At this point we are just trying to evaluate whether C, Java, or c# would be faster.

The key piece of information in the question is this:
Every percent we can shave off our
processing time saves us tens of
thousands of dollars per year
So you need to consider how much it will cost to shave each percent off. If that optimization effort costs tens of thousands of dollars per year, then it isn't worth doing. You could make a bigger saving by firing a programmer.
With the right skills (which today are rarer and therefore more expensive) you can hand-craft assembler to get the fastest possible code. With slightly less rare (and expensive) skills, you can do almost as well with some really ugly-looking C code. And so on. The more performance you squeeze out of it, the more it will cost you in development effort, and there will be diminishing returns for ever greater effort. If the profit from this stays at "tens of thousands of dollars per year" then there will come a point where it is no longer worth the effort. In fact I would hazard a guess you're already at that point because "tens of thousands of dollars per year" is in the range of one salary, and probably not enough to buy the skills required to hand-optimize a complex program.
I would guess that if you have code already written in C, the effort of rewriting it all as a direct translation in another language will be 90% wasted effort. It will very likely perform slower simply because you won't be taking advantage of the capabilities of the platform, but instead working against them, e.g. trying to use Java as if it was C.
Also within your existing code, there will be parts that make a crucial contribution to the running time (they run frequently), and other parts that are totally irrelevant (they run rarely). So if you have some idea for speeding up the program, there is no economic sense in wasting time applying it to the parts of the program that don't affect the running time.
So use a profiler to find the hot spots, and see where time is being wasted in the existing code.
Update when I noticed the reference to the code being "multithreaded"
In that case, if you focus your effort on removing bottlenecks so that your program can scale well over a large number of cores, then it will automatically get faster every year at a rate that will dwarf any other optimization you can make. This time next year, quad cores will be standard on desktops. The year after that, 8 cores will be getting cheaper (I bought one over a year ago for a few thousand dollars), and I would predict that a 32 core machine will cost less than a developer by that time.

I'm sorry, but that is not a simple question. It would depend a lot on what exactly was going on. C# is certainly no slouch, and you'd be hard-pressed to say "java is faster" or "C# is faster". C is a very different beast... it maybe has the potential to be faster - if you get it right; but in most cases it'll be about the same, but much harder to write.
It also depends how you do it - locking strategies, how you do the parallelization, the main code body, etc.
Re JIT - you could use NGEN to flatten this, but yes; if you are hitting the same code it should be JITted very early on.
One very useful feature of C#/Java (over C) is that they have the potential to make better use of the local CPU (optimizations etc), without you having to worry about it.
Also - with .NET, consider things like "Parallel Extensions" (to be bundled in 4.0), which gives you a much stronger threading story (compared to .NET without PFX).

Don't worry about language; parallelize!
If you have a highly multithreaded, data-intensive scientific code, then I don't think worrying about language is the biggest issue for you. I think you should concentrate on making your application parallel, especially making it scale past a single node. This will get you far more performance than just switching languages.
As long as you're confined to a single node, you're going to be starved for compute power and bandwidth for your app. On upcoming many-core machines, it's not clear that you'll have the bandwidth you need to do data-intensive computing on all the cores. You can do computationally intensive work (like a GPU does), but you may not be able to feed all the cores if you need to stream a lot of data to every one of them.
I think you should consider two options:
MapReduce
Your problem sounds like a good match for something like Hadoop, which is designed for very data-intensive jobs.
Hadoop has scaled to 10,000 nodes on Linux, and you can shunt your work off either to someone else's (e.g. Amazon's, Microsoft's) or your own compute cloud. It's written in Java, so as far as porting goes, you can either call your existing C code from within Java, or you can port the whole thing to Java.
MPI
If you don't want to bother porting to MapReduce, or if for some reason your parallel paradigm doesn't fit the MapReduce model, you could consider adapting your app to use MPI. This would also allow you to scale out to (potentially thousands) of cores. MPI is the de-facto standard for computationally intensive, distributed-memory applications, and I believe there are Java bindings, but mostly people use MPI with C, C++, and Fortran. So you could keep your code in C and focus on parallelizing the performance-intensive parts. Take a look at OpenMPI for starters if you are interested.

I'm honestly surprised at those benchmarks.
In a computationally intensive product I would place a large wager on C to perform faster. You might write code that leaks memory like a sieve, and has interesting threading related defects, but it should be faster.
The only reason I could think that Java or C# would be faster is due to a short run length on the test. If little or no GC happened, you'll avoid the overhead of actually deallocating memory. If the process is iterative or parallel, try sticking a GC.Collect wherever you think you're done a bunch of objects(after setting things to null or otherwise removing references).
Also, if you're dealing with terabytes of data, my opinion is you're going to be much better off with deterministic memory allocation that you get with C. If you deallocate roughly close to when you allocate your heap will stay largely unfragmented. With a GC environment you may very well end up with your program using far more memory after a decent run length than you would guess, just because of fragmentation.
To me this sounds like the sort of project where C would be the appropriate language, but would require a bit of extra attention to memory allocation/deallocation. My bet is that C# or Java will fail if run on a full data set.

Quite some time ago Raymond Chen and Rico Mariani had a series of blog posts incrementally optimising a file load into a dictionary tool. While .NET was quicker early on (i.e. easy to make quick) the C/Win32 approach eventually was significantly faster -- but at considerable complexity (e.g. using custom allocators).
In the end the answer to which is faster will heavily depend on how much time you are willing to expend on eking every microsecond out of each approach. That effort (assuming you do it properly, guided by real profiler data) will make a far greater difference than choice of language/platform.
The first and last performance blog entries:
Chen part 1
Mariani part 1
Check final part
Mariani final part
(The last link gives an overall summary of the results and some analysis.)

It is going to depend very much on what you are doing specifically. I have Java code that beats C code. I have Java code that is much slower than C++ code (I don't do C#/.NET so cannot speak to those).
So, it depends on what you are doing, I am sure you can find something that is faster in language X than language Y.
Have you tried running the C# code through a profiler to see where it is taking the most time (same with Java and C while you are at it). Perhaps you need to do something different.
The Java HotSpot VM is more mature (roots of it going back to at least 1994) than the .NET one, so it may come down to the code generation abilities of both for that.

You say "the code is multithreaded" which implies that the algorithms are parallelisable. Also, you save the "data sets are several terabytes in size".
Optimising is all about finding and eliminating bottlenecks.
The obvious bottleneck is the bandwidth to the data sets. Given the size of the data, I'm guessing that the data is held on a server rather than on a desktop machine. You haven't given any details of the algorithms you're using. Is the time taken by the algorithm greater than the time taken to read/write the data/results? Does the algorithm work on subsets of the total data?
I'm going to assume that the algorithm works on chunks of data rather than the whole dataset.
You have two scenarios to consider:
The algorithm takes more time to process the data than it does to get the data. In this case, you need to optimise the algorithm.
The algorithm takes less time to process the data than it does to get the data. In this case, you need to increase the bandwidth between the algorithm and the data.
In the first case, you need a developer that can write good assembler code to get the most out of the processors you're using, leveraging SIMD, GPUs and multicores if they're available. Whatever you do, don't just crank up the number of threads because as soon as the number of threads exceeds the number of cores, your code goes slower! This due to the added overhead of switching thread contexts. Another option is to use a SETI like distributed processing system (how many PCs in your organisation are used for admin purposes - think of all that spare processing power!). C#/Java, as bh213 mentioned, can be an order of magnitude slower than well written C/C++ using SIMD, etc. But that is a niche skillset these days.
In the latter case, where you're limited by bandwidth, then you need to improve the network connecting the data to the processor. Here, make sure you're using the latest ethernet equipment - 1Gbps everywhere (PC cards, switches, routers, etc). Don't use wireless as that's slower. If there's lots of other traffic, consider a dedicated network in parallel with the 'office' network. Consider storing the data closer to the clients - for every five or so clients use a dedicated server connected directly to each client which mirrors the data from the server.
If saving a few percent of processing time saves "tens of thousands of dollars" then seriously consider getting a consultant in, two actually - one software, one network. They should easily pay for themselves in the savings made. I'm sure there's many here that are suitably qualified to help.
But if reducing cost is the ultimate goal, then consider Google's approach - write code that keeps the CPU ticking over below 100%. This saves energy directly and indirectly through reduced cooling, thus costing less. You'll want more bang for your buck so it's C/C++ again - Java/C# have more overhead, overhead = more CPU work = more energy/heat = more cost.
So, in summary, when it comes to saving money there's a lot more to it than what language you're going to choose.

If there is already a significant quantity of legacy C code that will be added to the system then why move to C# and Java?
In response to your latest edit about wanting to take advantage of any improvements in processing speed....then your best bet would be to stick to C as it runs closer to the hardware than C# and Java which have the overhead of a runtime environment to deal with. The closer to the hardware you can get the faster you should be able to run. Higher Level languages such as C# and Java will result in quicker development times...but C...or better yet Assembly will result in quicker processing time...but longer development time.

I participated in a few TopCoder's Marathon matches where performance was they key to victory.
My choice was C#. I think C# solutions placed slightly above Java and were slighly slower than C++... Until somebody wrote a code in C++ that was a order of magnitude faster. You were alowed to use Intel compiler and the winning code was full of SIMD insturctions and you cannot replicate that in C# or Java. But if SIMD is not an option, C# and Java should be good enough as long as you take care to use memory correctly (e.g. watch for cache misses and try to limit memory access to the size of L2 cache)

You question is poorly phrased (or at least the title is) because it implies this difference is endemic and holds true for all instances of java/c#/c code.
Thankfully the body of the question is better phrased because it presents a reasonably detailed explanation of the sort of thing your code is doing. It doesn't state what versions (or providers) of c#/java runtimes you are using. Nor does it state the target architecture or machine the code will run on. These things make big differences.
You have done some benchmarking, this is good. Some suggestions as to why you see the results you do:
You aren't as good at writing performant c# code as you are at java/c (this is not a criticism, or even likely but it is a real possibility you should consider)
Later versions of the JVM have some serious optimizations to make uncontended locks extremely fast. This may skew things in your favour (And especially the comparison with the c implementation threading primitives you are using)
Since the java code seems to run well compared to the c code it is likely that you are not terribly dependent on the heap allocation strategy (profiling would tell you this).
Since the c# code runs less well than the java one (and assuming the code is comparable) then several possible reasons exist:
You are using (needlessly) virtual functions which the JVM will inline but the CLR will not
The latest JVM does Escape Analysis which may make some code paths considerably more efficient (notably those involving string manipulation whose lifetime is stack bound
Only the very latest 32 bit CLR will inline methods involving non primitive structs
Some JVM JIT compilers use hotspot style mechanisms which attempt to detect the 'hotspots' of the code and spend more effort re-jitting them.
Without an understanding of what your code spends most of its time doing it is impossible to make specific suggestions. I can quite easily write code which performs much better under the CLR due to use of structs over objects or by targeting runtime specific features of the CLR like non boxed generics, this is hardly instructive as a general statement.

Actually it is 'Assembly language'.

Depends on what kind of application you are writing.
Try The Computer Language Benchmarks Game
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=csharp&lang2=java&box=1
http://shootout.alioth.debian.org/u64/benchmark.php?test=all&lang=csharp&lang2=java&box=1

To reiterate a comment, you should be using the GPU, not the CPU if you are doing arithmetic scientific computing. Matlab with CUDA plugins would be much more awesome than Java or c# if Matlab licensing is not an issue. The nVidia documentation shows how to compile any CUDA function into a mex file. If you need free software, I like pycuda.
If however, GPUs are not an option, I personally like C for a lot of routines because the optimizations the compiler makes are not as complicated as JIT: you don't have to worry about whether a "class" becomes like a "struct" or not. In my experience, problems can usually be broken down such that higher-level things can be written in a very expressive language like Python (rich primitives, dynamic types, incredibly flexible reflection), and transformations can be written in something like C. Additionally, there's neat compiler software, like PLUTO (automatic loop parallelization and OpenMP code generation), and libraries like Hoard, tcmalloc, BLAS (CUBLAS for gpu), etc. if you choose to go the C/C++ route.

One thing to notice is that IF your application(s) would benefit of lazy evaluation a functional programming language like Haskell may yield speedups of a totally different magnitude than the theretically optimal structured/OO code just by not evaluating unnecessary branches.
Also, if you are talking about the monetary benefit of better performance, don't forget to add the cost of maintaing your software into the equation.

Surely the answer is to go and buy the latest PC with the most cores/processors you can afford. If you buy one of the latest 2x4 core PCs you will find not only does it have twice as many cores as a quad core but also they run 25-40% faster than the previous generation of processors/machines.
This will give you approximately a 150% speed up. Far more than choosing Java/C# or C.
and whats more your get the same again every 18 months if you keep buying in new boxes!
You can sit there for months rewriting you code or I could go down to my local PC store this afternoon and be running faster than all your efforts same day.
Improving code quality/efficiency is good but sometimes implementation dollars are better spent elsewhere.

Writing in one language or another will only give you small speed ups for a large amount of work. To really speed things up you might want to look at the following:
Buying the latest fastest Hardware.
Moving from 32 bit operating system to 64 bit.
Grid computing.
CUDA / OpenCL.
Using compiler optimisation like vectorization.

I would go with C# (or Java) because your development time will probably be much faster than with C. If you end up needing extra speed then you can always rewrite a section in C and call it as a module.

My preference would be C or C++ because I'm not separated from the machine language by a JIT compiler.
You want to do intense performance tuning, and that means stepping through the hot spots one instruction at a time to see what it is doing, and then tweaking the source code so as to generate optimal assembler.
If you can't get the compiler to generate what you consider good enough assembler code, then by all means write your own assembler for the hot spot(s). You're describing a situation where the need for performance is paramount.
What I would NOT do if I were in your shoes (or ever) is rely on anecdotal generalizations about one language being faster or slower than another. What I WOULD do is multiple passes of intense performance tuning along the lines of THIS and THIS and THIS. I have done this sort of thing numerous times, and the key is to iterate the cycle of diagnosis-and-repair because every slug fixed makes the remaining ones more evident, until you literally can't squeeze another cycle out of that turnip.
Good luck.
Added: Is it the case that there is some seldom-changing configuration information that determines how the bulk of the data is processed? If so, it may be that the program is spending a lot of its time re-interpreting the configuration info to figure out what to do next. If so, it is usually a big win to write a code generator that will read the configuration info and generate an ad-hoc program that can whizz through the data without constantly having to figure out what to do.

Depends what you benchmark and on what hardware. I assume it's speed rather than memory or CPU usage.But....
If you have a dedicated machine for an app only with very large amounts of memory then java might be 5% faster.
If you go down in the real world with limited memory and more apps running on the same machine .net looks better at utilizing computing resources :see here
If the hardware is very constrained, C/C++ wins hands down.

If you are using a highly multithreaded code, I would recommend you to take a look at the upcoming Task Parallel Library (TPL) for .NET and the Parallel Pattern Library (PPL) for native C++ applications. That will save you a lot of issues with thread/dead lockíng and all other issues that you would spend a lot of time digging into and solving for yourself.
For my self, I truly believe that the memory management in the managed world will be more efficient and beat the native code in the long term.

If much of your code is in C why not keep it?
In principal and by design it's obvious that C is faster. They may close the gap over time but they always have more level os indirection and "safety". C is fast because it's "unsafe". Just think about bound checking. Interfacing to C is supported in every langauge. And so I can not see why one would not like to just wrap the C code up if it's still working and use it in whatever language you like

I would consider what everyone else uses - not the folks on this site, but the folks who write the same kind of massively parallel, or super high-performance applications.
I find they all write their code in C/C++. So, just for this fact alone (ie. regardless of any speed issues between the languages), I would go with C/C++. The tools they use and have developed will be of much more use to you if you're writing in the same language.
Aside from that, I've found C# apps to have somewhat less than optimal performance in some areas, multithreading is one. .NET will try to keep you safe from thread problems (probably a good thing in most cases), but this will cause your specific case problems (to test: try writing a simple loop that accesses a shared object using lots of threads. Run that on a single core PC and you get better performance than if you run it on a multiple core box - .net is adding its own locks to make sure you don't muck it up)(I used Jon Skeet's singleton benchmark. The static lock on took 1.5sec on my old laptop, 8.5s on my superfast desktop, the lock version is even worse, try it yourself)
The next point is that with C you tend to access memory and data directly - nothing gets in the way, with C#/Java you will use some of the many classes that are provided. These will be good in the general case, but you're after the best, most efficient way to access this (which, for your case is a big deal with multi-terabytes of data, those classes were not designed with those datasets in mind, they were designed for the common cases everyone else uses), so again, you would be safer using C for this - you'll never get the GC getting clogged up by a class that creates new strings internally when you read a couple of terabytes of data if you write it in C!
So it may appear that C#/Java can give you benefits over a native application, but I think you'll find those benefits are only realised for the kind of line-of-business applications that are commonly written.

Note that for heavy computations there is a great advantage in having tight loops which can fit in the CPU's first level cache as it avoids having to go to slower memory repeatedly to get the instructions.
Even for level two cache a large program like Quake IV gets a 10% performance increase with 4 Mb level 2 cache versus 1 Mb level 2 cache - http://www.tomshardware.com/reviews/cache-size-matter,1709-5.html
For these tight loops C is most likely the best as you have the most control of the generated machine code, but for everything else you should go for the platform with the best libraries for the particular task you need to do. For instance the netlib libraries are reputed to have very good performance for a very large set of problems, and many ports to other languages are available.

If every percentage will really save you tens of thousands of dollars, then you should bring in a domain expert to help with the project. Well designed and written code with performance considered at the initial stages may be an order of magnitude faster, saving you 90%, or $900,000. I recently found a subtle flaw in some code that sped up a process by over 100 times. A colleague of mine found an algorithm that was running in O(n^3) that he re-wrote to make it O(N log n). This tends to be where the huge performance saving are.
If the problem is so simple that you are certain that a better algorithm cannot be employed giving you significant savings, then C is most likely your best language.

The most important things are already said here. I would add:
The developer utilizes a language which the compiler(s) utilize(s) to generate machine instructions which the processor(s) utilize(s) to use system resources. A program will be "fast" when ALL parts of the chain perform optimally.
So for the "best" language choice:
take that language which you are best able to control and
which is able to instruct the compiler sufficiently to
generate nearly optimal machine code so that
the processor on the target machine is able to utilize processing resources optimally.
If you are not a performance expert you will have a hard time to archieve 'peak performance' within ANY language. Possibly C++ still provides the most options to control the machine instructions (especially SSE extensions a.s.o).
I suggest to orient on the well known 80:20 rule. This is fairly well true for all: the hardware, the languages/platforms and the developer efforts.
Developers have always relied on the hardware to fix all performance issues automatically due to an upgrade to a faster processor f.e.. What might have worked in the past will not work in the (nearest) future. The developer now has the responsibility to structure her programs accordingly for parallelized execution. Languages for virtual machines and virtual runtime environments will show some advantage here. And even without massive parallelization there is little to no reason why C# or Java shouldn't succeed similar well as C++.
#Edit: See this comparison of C#, Matlab and FORTRAN, where FORTRAN does not win alone!

Ref; "My benchmarks so far show Java and C are about on par with each other"
Then your benchmarks are severely flawed...
C will ALWAYS be orders of magnitudes faster then both C# and Java unless you do something seriously wrong...!
PS!
Notice that this is not an attempt to try to bully neither C# nor Java, I like both Java and C#, and there are other reasons why you would for many problems choose either Java or C# instead of C. But neither Java nor C# would in a correct written tests NEVER be able to perform with the same speed as C...
Edited because of the sheer number of comments arguing against my rhetoric
Compare these two buggers...
C#
public class MyClass
{
public int x;
public static void Main()
{
MyClass[] y = new MyClass[1000000];
for( int idx=0; idx < 1000000; idx++)
{
y[idx] = new MyClass();
y[idx].x = idx;
}
}
}
against this one (C)
struct MyClass
{
int x;
}
void Main()
{
MyClass y[1000000];
for( int idx = 0; idx < 1000000; idx++)
{
y[idx].x = idx;
}
}
The C# version first of all needs to store its array on the heap. The C version stores the array on the stack. To store stuff on the stack is merely changing the value of an integer value while to store stuff on the heap means finding a big enough chunk of memory and potentially means traversing the memory for a pretty long time.
Now mostly C# and Java allocates huge chunks of memory which they keep on spending till it's out which makes this logic execute faster. But even then to compare this against changing the value of an integer is like an F16 against an oil tanker speedwise...
Second of all in the C version since all those objects are already on the stack we don't need to explicitly create new objects within the loop. Yet again for C# this is a "look for available memory operation" while the C version is a ZIP (do nothing operation)
Third of all is the fact that the C version will automatically delete all these objects when they run out of scope. Yet again this is an operation which ONLY CHANGES THE VALUE OF AN INTEGER VALUE. Which would on most CPU architectures take between 1 and 3 CPU cycles. The C# version doesn't do that, but when the Garbage Collector kicks in and needs to collect those items my guess is that we're talking about MILLIONS of CPU cycles...
Also the C version will instantly become x86 code (on an x86 CPU) while the C# version would first become IL code. Then later when executed it would have to be JIT compiled, which probably alone takes orders of magnitudes longer time then only executing the C version.
Now some wise guy could probably execute the above code and measure CPU cycles. However that's basically no point at all in doing because mathematically it's proven that the Managed Version would probably take several million times the number of CPU cycles as the C version. So my guess is that we're now talking about 5-8 orders of magnitudes slower in this example. And sure, this is a "rigged test" in that I "looked for something to prove my point", however I challenge those that commented badly against me on this post to create a sample which does NOT execute faster in C and which also doesn't use constructs which you normally never would use in C due to "better alternatives" existing.
Note that C# and Java are GREAT languages. I prefer them over C ANY TIME OF THE DAY. But NOT because they're FASTER. Because they are NOT. They are ALWAYS slower then C and C++. Unless you've coded blindfolded in C or C++...
Edit;
C# of course have the struct keyword, which would seriously change the speed for the above C# version, if we changed the C# class to a value type by using the keyword struct instead of class. The struct keyword means that C# would store new objects of the given type on the stack - which for the above sample would increase the speed seriously. Still the above sample happens to also feature an array of these objects.
Even though if we went through and optimized the C# version like this, we would still end up with something several orders of magnitudes slower then the C version...
A good written piece of C code will ALWAYS be faster then C#, Java, Python and whatever-managed-language-you-choose...
As I said, I love C# and most of the work I do today is C# and not C. However I don't use C# because it's faster then C. I use C# because I don't need the speed gain C gives me for most of my problems.
Both C# and Java is though ridiculously slower then C, and C++ for that matter...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.