Full CPU usage for Parallel.For loops

Full CPU usage for Parallel.For loops - c#

I am writing a WPF application that processes an image data stream from an IR camera. The application uses a class library for processing steps such as rescaling or colorizing, which I am also writing myself. An image processing step looks something like this:
ProcessFrame(double[,] frame)
{
int width = frame.GetLength(1);
int height = frame.GetLength(0);
byte[,] result = new byte[height, width];
Parallel.For(0, height, row =>
{
for(var col = 0; col < width; ++col)
ManipulatePixel(frame[row, col]);
});
}
Frames are processed by a task that runs in the background. The issue is, that depending on how costly the specific processing algorithm is ( ManipulatePixel() ), the application can't keep up with the camera's frame rate any more. However, I have noticed that despite me using parallel for loops, the application simply won't use all of the CPU that is available - task manager performance tab shows about 60-80% CPU usage.
I have used the same processing algorithms in C++ before, using the concurrency::parallel_for loops from the parallel patterns library. The C++ code uses all of the CPU it can get, as I would expect, and I also tried PInvoking a C++ DLL from my C# code, doing the same algorithm that runs slowly in the C# library - it also uses all the CPU power available, CPU usage is right at 100% virtually the whole time and there is no trouble at all keeping up with the camera.
Outsourcing the code into a C++ DLL and then marshalling it back into C# is an extra hassle I'd of course rather avoid. How do I make my C# code actually make use of all the CPU potential? I have tried increasing process priority like this:
using (Process process = Process.GetCurrentProcess())
process.PriorityClass = ProcessPriorityClass.RealTime;
Which has an effect, but only a very small one. I also tried setting the degree of parallelism for the Parallel.For() loops like this:
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
and then passing that to the Parallel.For() loop, this had no effect at all but I suppose that's not surprising since the default settings should already be optimized. I also tried setting this in the application configuration:
<runtime>
<Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
<GCCpuGroup enabled="true"></GCCpuGroup>
<gcServer enabled="true"></gcServer>
</runtime>
but this actually makes it run even slower.
EDIT:
The ProcessFrame code block I quoted originally was actually not quite correct. What I was doing at the time was:
ProcessFrame(double[,] frame)
{
byte[,] result = new byte[frame.GetLength(0), frame.GetLength(1)];
Parallel.For(0, frame.GetLength(0), row =>
{
for(var col = 0; col < frame.GetLength(1); ++col)
ManipulatePixel(frame[row, col]);
});
}
Sorry for this, I was paraphrasing code at the time and I didn't realize that this is an actual pitfall, that produces different results. I have since changed the code to what I originally wrote (i.e. the width and height variables set at the beginning of the function, and the array's length properties only queried once each instead of in the for loop's conditional statements). Thank you #Seabizkit, your second comment inspired me to try this. The change in fact already makes the code run noticeably faster - I didn't realize this because C++ doesn't know 2D arrays so I had to pass the pixel dimensions as separate arguments anyway. Whether it is fast enough as it is I cannot say yet however.
Also thank you for the other answers, they contain a lot of things I don't know yet but it's great to know what I have to look for. I'll update once I reached a satisfactory result.

I would need to have all of your code and be able to run it locally in order to diagnose the problem because your posting is devoid of details (I would need to see inside your ManipulatePixel function, as well as the code that calls ProcessFrame). but here's some general tips that apply in your case.
2D arrays in .NET are significantly slower than 1D arrays and staggered arrays, even in .NET Core today - this is a longstanding bug.
See here:
https://github.com/dotnet/coreclr/issues/4059
Why are multi-dimensional arrays in .NET slower than normal arrays?
Multi-dimensional array vs. One-dimensional
So consider changing your code to use either a jagged array (which also helps with memory locality/proximity caching, as each thread would have its own private buffer) or a 1D array with your own code being responsible for bounds-checking.
Or better-yet: use stackalloc to manage the buffer's lifetime and pass that by-pointer (unsafe ahoy!) to your thread delegate.
Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.
Avoid allocating a new buffer for each frame encountered - if a frame has a limited lifespan then consider using reusable buffers using a buffer-pool.
Consider using the SIMD and AVX features in .NET. While modern C/C++ compilers are smart enough to compile code to use those instructions, the .NET JIT isn't so hot - but you can make explicit calls into SMID/AVX instructions using the SIMD-enabled types (you'll need to use .NET Core 2.0 or later for the best accelerated functionality)
Also, avoid copying individual bytes or scalar values inside a for loop in C#, instead consider using Buffer.BlockCopy for bulk copy operations (as these can use hardware memory copy features).
Regarding your observation of "80% CPU usage" - if you have a loop in a program then that will cause 100% CPU usage within the time-slices provided by the operating-system - if you don't see 100% usage then your code then:
Your code is actually running faster than real-time (this is a good thing!) - (unless you're certain your program can't keep-up with the input?)
Your codes' thread (or threads) is blocked by something, such as a blocking IO call or a misplaced Thread.Sleep. Use tools like ETW to see what your process is doing when you think it should be CPU-bound.
Ensure you aren't using any lock (Monitor) calls or using other thread or memory synchronization primitives.

Efficiency matters ( it is not true-[PARALLEL], but may, yet need not, benefit from a "just"-[CONCURRENT] work
The BEST, yet a rather hard way, if ultimate performance is a MUST :
in-line an assembly, optimised as per cache-line sizes in the CPU hierarchy and keep indexing that follows the actual memory-layout of the 2D data { column-wise | row-wise }. Given there is no 2D-kernel-transformation mentioned, your process does not need to "touch" any topological-neighbours, the indexing can step in whatever order "across" both of the ranges of the 2D-domain and the ManipulatePixel() may get more efficient on transforming rather blocks-of pixels, instead of bearing all overheads for calling a process just for each isolated atomicised-1px ( ILP + cache-efficiency are on your side ).
Given your target production-platform CPU-family, best use (block-SIMD)-vectorised instructions available from AVX2, best AVX512 code. As you most probably know, may use C/C++ using AVX-intrinsics for performance optimisations with assembly-inspection and finally "copy" the best resulting assembly for your C# assembly-inlining. Nothing will run faster. Tricks with CPU-core affinity mapping and eviction/reservation are indeed a last resort, yet may help for indeed an almost hard-real-time production settings ( though, hard R/T systems are seldom to get developed in an ecosystem with non-deterministic behaviour )
A CHEAP, few-seconds step :
Test and benchmark the run-time per batch of frames of a reversed composition of moving the more-"expensive"-part, the Parallel.For(...{...}) inside the for(var col = 0; col < width; ++col){...} to see the change of the costs of instantiations of the Parallel.For() instrumentation.
Next, if going this cheap way, think about re-factoring the ManipulatePixel() to at least use a block of data, aligned with data-storage layout and being a multiple of cache-line length ( for cache-hits ~ 0.5 ~ 5 [ns] improved costs-of-memory accesses, being ~ 100 ~ 380 [ns] otherwise - here, a will to distribute a work (the worse per 1px) across all NUMA-CPU-cores will result in paying way more time, due to extended access-latencies for cross-NUMA-(non-local) memory addresses and besides never re-using an expensively cached block-of-fetched-data, you knowingly pay excessive costs from cross-NUMA-(non-local) memory fetches ( from which you "use" just 1px and "throw" away all the rest of the cached-block ( as those pixels will get re-fetched and manipulated in some other CPU-core in some other time ~ a triple-waste of time ~ sorry to have mentioned that explicitly, but when shaving each possible [ns] this cannot happen in production pipeline ) )
Anyway, let me wish you perseverance and good luck on your steps forwards to gain the needed efficiency back onto your side.

Here's what I ended up doing, mostly based on Dai's answer:
made sure to query image pixel dimensions once at the beginning of the processing functions, not within the for loop's conditional statement. With parallel loops, it would seem this creates competitive access of those properties from multriple threads which noticeably slows things down.
removed allocation of output buffers within the processing functions. They now return void and accept the output buffer as an argument. The caller creates one buffer for each image processing step (filtering, scaling, colorizing) only, which doesn't change in size but gets overwritten with each frame.
removed an extra data processing step where raw image data in the format ushort (what the camera originally spits out) was converted to double (actual temperature values). Instead, processing is applied to the raw data directly. Conversion to actual temperatures will be dealt with later, as necessary.
I also tried, without success, to use 1D arrays instead of 2D but there is actually no difference in performance. I don't know if it's because the bug Dai mentioned was fixed in the meantime, but I couldn't confirm 2D arrays to be any slower than 1D arrays.
Probably also worth mentioning, the ManipulatePixel() function in my original post was actually more of a placeholder rather than a real call to another function. Here's a more proper example of what I am doing to a frame, including the changes I made:
private static void Rescale(ushort[,] originalImg, byte[,] scaledImg, in (ushort, ushort) limits)
{
Debug.Assert(originalImg != null);
Debug.Assert(originalImg.Length != 0);
Debug.Assert(scaledImg != null);
Debug.Assert(scaledImg.Length == originalImg.Length);
ushort min = limits.Item1;
ushort max = limits.Item2;
int width = originalImg.GetLength(1);
int height = originalImg.GetLength(0);
Parallel.For(0, height, row =>
{
for (var col = 0; col < width; ++col)
{
ushort value = originalImg[row, col];
if (value < min)
scaledImg[row, col] = 0;
else if (value > max)
scaledImg[row, col] = 255;
else
scaledImg[row, col] = (byte)(255.0 * (value - min) / (max - min));
}
});
}
This is just one step and some others are much more complex but the approach would be similar.
Some of the things mentioned like SIMD/AVX or the answer of user3666197 unfortunately are well beyond my abilities right now so I couldn't test that out.
It's still relatively easy to put enough processing load into the stream to tank the frame rate but for my application the performance should be enough now. Thanks to everyone who provided input, I'll mark Dai's answer as accepted because I found it the most helpful.

Related

C# Quick bit array

as stated in the title i am evaluating the cost of implement a BitArray over bytes[] (i have understood that native BitArray is pretty slow) insthead of using a string representation of bits (eg : "001001001" ) but i am open to any suggestion that are more effective.
The length of array is not known at design time, but i suppose may be between 200 and 500 bit per array.
Memory is not a concern, so use a lot of memory for represent the array is not an issue, what matter is speed when array is created and manupulated (thiy will be manipulated a lot).
Thanks in advance for yours consideration and suggenstion onto the topic.

A few suggestions:
1) Computers don't process bits o even n int or long will work at the same speed
2) To reach speed you can consider writing it with unsafe code
3) New is expensive. If the objects are created a lot you can do the following: Create a bulk of 10K
objects at a time and serve them from a method when required. Once the cache runs out you can recreate them. Have another method that once an object processing completes you clean it up and return it to the cache
4) Make sure your manipulation is optimal

Making C# mandelbrot drawing more efficient

First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)

If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.

WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.

For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).

Drawing signal with a lot of samples

I need to display a set of signals. Each signal is defined by millions of samples. Just processing the collection (for converting samples to points according to bitmap size) of samples takes a significant amount of time (especially during scrolling).
So I implemented some kind of downsampling. I just skip some points: take every 2nd, every 3rd, every 50th point depending on signal characteristics. It increases speed very much but significantly distorts signal form.
Are there any smarter approaches?

We've had a similar issue in a recent application. Our visualization (a simple line graph) became too cluttered when zoomed out to see the full extent of the data (about 7 days of samples with a sample taken every 6 seconds more or less), so down-sampling was actually the way to go. If we didn't do that, zooming out wouldn't have much meaning, as all you would see was just a big blob of lines smeared out over the screen.
It all depends on how you are going to implement the down-sampling. There's two (simple) approaches: down-sample at the moment you get your sample or down-sample at display time.
What really gives a huge performance boost in both of these cases is the proper selection of your data-sources.
Let's say you have 7 million samples, and your viewing window is just interested in the last million points. If your implementation depends on an IEnumerable, this means that the IEnumerable will have to MoveNext 6 million times before actually starting. However, if you're using something which is optimized for random reads (a List comes to mind), you can implement your own enumerator for that, more or less like this:
public IEnumerator<T> GetEnumerator(int start, int count, int skip)
{
// assume we have a field in the class which contains the data as a List<T>, named _data
for(int i = start;i<count && i < _data.Count;i+=skip)
{
yield return _data[i];
}
}
Obviously this is a very naive implementation, but you can do whatever you want within the for-loop (use an algorithm based on the surrounding samples to average?). However, this approach will make usually smooth out any extreme spikes in your signal, so be wary of that.
Another approach would be to create some generalized versions of your dataset for different ranges, which update itself whenever you receive a new signal. You usually don't need to update the complete dataset; just updating the end of your set is probably good enough. This allows you do do a bit more advanced processing of your data, but it will cost more memory. You will have to cache the distinct 'layers' of detail in your application.
However, reading your (short) explanation, I think a display-time optimization might be good enough. You will always get a distortion in your signal if you generalize. You always lose data. It's up to the algorithm you choose on how this distortion will occur, and how noticeable it will be.

You need a better sampling algorithm, also you can employ parallel processing features of c#. Refer to Task Parallel Library

How do I learn enough about CLR to make educated guesses about performance problems?

Yes, I am using a profiler (ANTS). But at the micro-level it cannot tell you how to fix your problem. And I'm at a microoptimization stage right now. For example, I was profiling this:
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
packedCells.Add(Data[x, y].HasCar);
packedCells.Add(Data[x, y].RoadState);
packedCells.Add(Data[x, y].Population);
}
}
ANTS showed that the y-loop-line was taking a lot of time. I thought it was because it has to constantly call the Height getter. So I created a local int height = Height; before the loops, and made the inner loop check for y < height. That actually made the performance worse! ANTS now told me the x-loop-line was a problem. Huh? That's supposed to be insignificant, it's the outer loop!
Eventually I had a revelation - maybe using a property for the outer-loop-bound and a local for the inner-loop-bound made CLR jump often between a "locals" cache and a "this-pointer" cache (I'm used to thinking in terms of CPU cache). So I made a local for Width as well, and that fixed it.
From there, it was clear that I should make a local for Data as well - even though Data was not even a property (it was a field). And indeed that bought me some more performance.
Bafflingly, though, reordering the x and y loops (to improve cache usage) made zero difference, even though the array is huge (3000x3000).
Now, I want to learn why the stuff I did improved the performance. What book do you suggest I read?

CLR via C# by Jeffrey Richter.
It is such a great book that someone stolen it in my library together with C# in depth.

The CLR is not involved at all here, this should all be translated to straight machine code without calls into the CLR. The JIT compiler is responsible for generating that machine code, it has an optimizer that tries to come up with the most efficient code. It has limitations, it cannot spend a large amount of time on it.
One of the important things it does is figuring out what local variables should be stored in the CPU registers. That's something that changed when you put the Height property in a local variable. It possibly decided to store that variable in a register. But now there's one less available to store another variable. Like the x or y variable, one that's critical for speed. Yes, that will slow it down.
You got a bad diagnostic about the outer loop. That could possibly be caused by the JIT optimizer re-arranging the loop code, giving the profiler a harder time mapping the machine code back to the corresponding C# statement.
Similarly, the optimizer might have decided that you were using the array inefficiently and switched the indexing order back. Not so sure it actually does that, but not impossible.
Anyhoo, the only way you can get some insight here is by looking at the generated machine code. There are many decent books about x86 assembly code, although they might be a bit hard to find these days. Your starting point is Debug + Windows + Disassembly.
Keep in mind however that even the machine code is not a very good predictor of how efficient code is going to run. Modern CPU cores are enormously complicated and the machine code is no longer representative for what actually happens inside the core. The only tried and true way is what you've already been doing: trial and error.

Albin - no. Honestly I didn't think that running outside a profiler would change the performance difference, so I didn't bother. You think I should have? Has that been a problem for you before? (I am compiling with optimizations on though)
Running under a debugger changes the performance: when it's being run under a debugger, the just-in-time compiler automatically disables optimizations (to make it easier to debug)!
If you must, use the debugger to attach to an already-running already-JITted process.

One thing you should know about working with Arrays is that the CLR will always make sure that array-indices are not out-of-bounds. It has an optimization for 1-dimensional arrays but not for 2+ dimensions.
Knowing this, you may want to benchmark MyCell Data[][] instead of MyCell Data[,]

Hm, I don't think that the loop enrolling is the real problem.
1. I'd try to avoid accessing the array Data three times per inner loop.
2. I'd also recommend, to re-think the three Add statements: you are apparently accessing a collection three times to add trivial some data. Make it only one access per iteration and add a data type containing three entries:
for (int y = 0; ... {
tTemp = Data[x, y];
packedCells.Add(new {
tTemp.HasCar, tTemp.RoadState, tTemp.Population
});
}
Another look reveals, that you are basically vectorizing a matrix by copying it into an array (or some other sequential collection)... Is that necessary at all? Why don't you just define a specialized indexer which simulates that linear access? Even better, if you only need to enumerate the entries (in that example you do, no random access required), why don't you use an adequate LINQ expression?

Point 1) Educated guesses are not the way to do performance tuning. In this case I can guess about as well as most, but guessing is the wrong way to do it.
Point 2) Profilers need to be well understood before you know what they're actually telling you. Here's a discussion of the issues. For example, what many profilers do is tell you "where the program spends its time", i.e. where the program counter spends its time, so they are almost absolutely blind to time requested by function calls, which is what your inner loop seems to consist of.
I do a lot of performance tuning, and here is what I do. I cycle between two activities:
Overall time measurement. This doesn't require special tools. I'm not trying to measure individual routines.
"Bottleneck" location. This does not require running the code at any kind of speed, because I'm not measuring. What I'm doing is locating lines of code that are responsible for a significant percent of time. I know which lines they are because they are on the stack for that percent, and stack samples easily find them.
Once I find a "bottleneck" and fix it, I go back to the first step, measure what percent of time I saved, and do it all again on the next "bottleneck", typically from 2 to 6 times. I am helped by the "magnification effect", in which a fixed problem magnifies the percentage used by remaining problems. It works for both macro and micro optimization.
(Sorry if I can't write "bottleneck" without quotes, because I don't think I've ever found a performance problem that resembled the neck of a bottle. Rather they were all simply doing things that didn't really need to be done.)

Since the comment might be overseen, I repeat myself: it is quite cumbersome to optimize code which is per se overfluous. You do not really need to explicitely linearize your matrix at all, see the comment above: Define a linearizing adapter which implements IEnumerable<MyCell> and feed it into the formatter.
I am getting a warning when I try to add another answer, so I am going to recycle this one.. :) After reading Steve's comments and thinking about it for a while, I suggest the following:
If serializing a multi-dimensional array is too slow (haven't tryied, I just believe you...) don't use it at all! It appears, that your matrix is not sparse and has fixed dimensions. So define the structure holding your cells as simple linear array with indexer:
[Serializable()]
class CellMatrix {
Cell [] mCells;
public int Rows { get; }
public int Columns { get; }
public Cell this (int i, int j) {
get {
return mCells[i + Rows * j];
}
// setter...
}
// constructor taking rows/cols...
}
A thing like this should serialize as fast as native Array does... I don't recommend hard coding the layout of Cell in order to save few bytes there...
Cheers,
Paul

Performance - Python vs. C#/C++/C reading char-by-char

So I have these giant XML files (and by giant, I mean like 1.5GB+) and they don't have CRLFs. I'm trying to run a diff-like program to find the differences between these files.
Since I've yet to find a diff program that won't explode due to memory exhaustion, I've decided the best bet was to add CRLFs after closing tags.
I wrote a python script to read char-by-char and add new-lines after '>'. The problem is I'm running this on a single core PC circa-1995 or something ridiculous, and it's only processing about 20MB/hour when I have both converting at the same time.
Any idea if writing this in C#/C/C++ instead will yield any benefits? If not, does anyone know of a diff program that will go byte-by-byte? Thanks.
EDIT:
Here's the code for my processing function...
def read_and_format(inputfile, outputfile):
''' Open input and output files, then read char-by-char and add new lines after ">" '''
infile = codecs.open(inputfile,"r","utf-8")
outfile = codecs.open(outputfile,"w","utf-8")
char = infile.read(1)
while(1):
if char == "":
break
else:
outfile.write(char)
if(char == ">"):
outfile.write("\n")
char = infile.read(1)
infile.close()
outfile.close()
EDIT2:
Thanks for the awesome responses. Increaseing the read size created an unbelievable speed increase. Problem solved.

Reading and writing a single character at a time is almost always going to be slow, because disks are block-based devices, rather than character-based devices - it will read a lot more than just the one byte you're after, and the surplus parts need to be discarded.
Try reading and writing more at a time, say, 8192 bytes (8KB) and then finding and adding newlines in that string before writing it out - you should save a lot in performance because a lot less I/O is required.
As LBushkin points out, your I/O library may be doing buffering, but unless there is some form of documentation that shows this does indeed happen (for reading AND writing), it's a fairly easy thing to try before rewriting in a different language.

Why don't you just use sed?
cat giant.xml | sed 's/>/>\x0a\x0d/g' > giant-with-linebreaks.xml

Rather than reading byte by byte, which incurs a disk access for each byte read, try reading ~20 MB at a time and doing your search + replace on that :)
You can probably do this in Notepad....
Billy3

For the type of problem you describe, I suspect the algorithm you employ for comparing the data will have a much more significant effect than the I/O model or language. In fact, string allocation and search may be more expensive here than anything else.
Some general suggestions before you write this yourself:
Try running on a faster machine if you have one available. That will make a huge difference.
Look for an existing tool online for doing XML diffs ... don't write one yourself.
If are are going to write this in C# (or Java or C/C++), I would do the following:
Read a fairly large block into memory all at once (let's say between 200k and 1M)
Allocate an empty block that's twice that size (this assumes a worst case of every character is a '>')
Copy from the input block to the output block conditionally appending a CRLF after each '>' character.
Write the new block out to disk.
Repeat until all the data has been processed.
Additionally, you could also write such a program to run on multiple threads, so that while once thread is perform CRLF insertions in memory, a separate thread is read blocks in from disk. This type of parallelization is complicated ... so I would only do so if you really need maximum performance.
Here's a really simple C# program to get you started, if you need it. It accepts an input file path and an output path on the command line, and performs the substitution you are looking for ('>' ==> CRLF). This sample leaves much to be improved (parallel processing, streaming, some validation, etc)... but it should be a decent start.
using System;
using System.IO;
namespace ExpandBrackets
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 2)
{
using( StreamReader input = new StreamReader( args[0] ) )
using( StreamWriter output = new StreamWriter( args[1] ) )
{
int readSize = 0;
int blockSize = 100000;
char[] inBuffer = new char[blockSize];
char[] outBuffer = new char[blockSize*3];
while( ( readSize = input.ReadBlock( inBuffer, 0, blockSize ) ) > 0 )
{
int writeSize = TransformBlock( inBuffer, outBuffer, readSize );
output.Write( outBuffer, 0, writeSize );
}
}
}
else
{
Console.WriteLine( "Usage: repchar {inputfile} {outputfile}" );
}
}
private static int TransformBlock( char[] inBuffer, char[] outBuffer, int size )
{
int j = 0;
for( int i = 0; i < size; i++ )
{
outBuffer[j++] = inBuffer[i];
if (inBuffer[i] == '>') // append CR LF
{
outBuffer[j++] = '\r';
outBuffer[j++] = '\n';
}
}
return j;
}
}
}

All of the languages mentioned typically, at some point, revert to the C runtime library for byte by byte file access. Writing this in C will probably be the fastest option.
However, I doubt it will provide a huge speed boost. Python is fairly speedy, if you're doing things correctly.
The main way to really get a big speed improvement would be to introduce threading. If you read the data in from the file in a large block in one thread, and had a separate thread that did your newline processing + diff processing, you could dramatically improve the speed of this algorithm. This would probably be easier to implement in C++, C#, or IronPython than in C or CPython directly, since they provide very easy, high-level synchronization tools for handling the threading issues (especially when using .NET).

you could try xmldiff - http://msdn.microsoft.com/en-us/library/aa302294.aspx
I haven't used it for such huge data but I think it would be reasonably optimized

I put this as a comment on another answer, but in case you miss it--you might want to look at The Shootout. It's a highly optimized set of code for various problems in many languages.
According to those results, Python tends to be about 50x slower than c (but it is faster than the other interpreted languages). In comparison Java is about 2x slower than c. If you went to one of the faster compiled languages, I don't see why you wouldn't see a similar increase.
By the way, the figures attained from the shootout are wonderfully un-assailable, you can't really challenge them, instead if you don't believe the numbers are fair because the code to solve a problem in your favorite language isn't optimized properly, then you can submit better code yourself. The act of many people doing this means most of the code on there is pretty damn optimized for every popular language. If you show them a more optimized compiler or interpreter, they may include the results from it as well.
Oh: except C#, that's only represented by MONO so if Microsoft's compiler is more optimized, it's not shown. All the tests seem to run on Linux machines. My guess is Microsoft's C# should run at about the same speed as Java, but the shootout lists mono as a bit slower (about 3x as slow as C)..

As others said, if you do it in C it will be pretty much unbeatable, because C buffers I/O, and getc() is inlined (in my memory).
Your real performance issue will be in the diff.
Maybe there's a pretty good one out there, but for those size files I doubt it. For fun, I'm a do-it-yourselfer. The strategy I would use is to have a rolling window in each file, several megabytes long. The search strategy for mismatches is diagonal search, which is if you are at lines i and j, compare in this sequence:
line(i+0) == line(j+0)
line(i+0) == line(j+1)
line(i+1) == line(j+0)
line(i+0) == line(j+2)
line(i+1) == line(j+1)
line(i+2) == line(j+0)
and so on. No doubt there's a better way, but if I'm going to code it myself and manage the rolling windows, that's what I'd try.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.