Making my program more modular made it slower

Making my program more modular made it slower - c#

I wrote a run-once program to read data from one table and migrate what was read into several other tables (using LINQ). It was one Main() method that extracted the data, transformed it where needed, converted some fields, etc. and inserted the data into the appropriate tables. Basically, just migrating data from one format to another. The program would take about 5 minutes to run, but it did what I needed.
While looking at the program, I thought I'd break up the huge Main() method into smaller chunks. Basically, I just refactored areas of the code and extracted them to methods.
The program still does what it's supposed to, migrate data, but it takes twice as long now, if not longer.
So, my question is: Do method calls slow down processing? None of the code itself changed, other than being put inside its own method.

Yes, function calls generally have a cost but it's not usually very high unless your code has been refactored to a point where every function has only one line, or you're calling them billions of times :-)
The question you have to ask yourself is: do the benefits outweigh the cost?
Modularising your code will almost certainly make it easier to maintain, unless it's some Mickey-Mouse Hello-World type of program.
The other question you have to ask is, if it's run-once, why did you bother trying to improve it? If five minutes is acceptable, then the effort you spent improving it seems like a sunk cost to me. If it's going to be used a lot, or by many other people, that's one thing. But, if you're only running it (for example) once a month, why bother?
If you really want to know where the bottlenecks are, Microsoft have spent some time making it easy for you.
Though not a huge sample, consider the following C program (since that's my area of expertise):
#include <stdio.h>
void xyzzy(int argc, char *argv[]) {}
int main (void) {
int x = argc;
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 1000000; j++) {
x = x + 1;
//xyzzy();
}
}
printf ("%d\n", x);
return 0;
}
When compiled (without any optimisation since I don't want the compiler second-guessing me, and using trickery to reduce the chances of the compiler weaving any magic before running the code), the figures I get for with and without the function call (five separate runs each, using sys+user times from the time command) are:
with without
------- -------
2.452 2.264
2.451 2.358
2.468 2.342
2.390 2.233
2.374 2.249
------- -------
12.135 11.446 total
2.468 2.358 max
2.374 2.233 min
So what can we tell from that, apart from the fact I'm a lousy statistician? :-)
It appears, based on the total that the one without the function call is about 6% faster. It's also telling that the fastest run with the function call is still slower than the slowest run without it.

Related

Full CPU usage for Parallel.For loops

I am writing a WPF application that processes an image data stream from an IR camera. The application uses a class library for processing steps such as rescaling or colorizing, which I am also writing myself. An image processing step looks something like this:
ProcessFrame(double[,] frame)
{
int width = frame.GetLength(1);
int height = frame.GetLength(0);
byte[,] result = new byte[height, width];
Parallel.For(0, height, row =>
{
for(var col = 0; col < width; ++col)
ManipulatePixel(frame[row, col]);
});
}
Frames are processed by a task that runs in the background. The issue is, that depending on how costly the specific processing algorithm is ( ManipulatePixel() ), the application can't keep up with the camera's frame rate any more. However, I have noticed that despite me using parallel for loops, the application simply won't use all of the CPU that is available - task manager performance tab shows about 60-80% CPU usage.
I have used the same processing algorithms in C++ before, using the concurrency::parallel_for loops from the parallel patterns library. The C++ code uses all of the CPU it can get, as I would expect, and I also tried PInvoking a C++ DLL from my C# code, doing the same algorithm that runs slowly in the C# library - it also uses all the CPU power available, CPU usage is right at 100% virtually the whole time and there is no trouble at all keeping up with the camera.
Outsourcing the code into a C++ DLL and then marshalling it back into C# is an extra hassle I'd of course rather avoid. How do I make my C# code actually make use of all the CPU potential? I have tried increasing process priority like this:
using (Process process = Process.GetCurrentProcess())
process.PriorityClass = ProcessPriorityClass.RealTime;
Which has an effect, but only a very small one. I also tried setting the degree of parallelism for the Parallel.For() loops like this:
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
and then passing that to the Parallel.For() loop, this had no effect at all but I suppose that's not surprising since the default settings should already be optimized. I also tried setting this in the application configuration:
<runtime>
<Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
<GCCpuGroup enabled="true"></GCCpuGroup>
<gcServer enabled="true"></gcServer>
</runtime>
but this actually makes it run even slower.
EDIT:
The ProcessFrame code block I quoted originally was actually not quite correct. What I was doing at the time was:
ProcessFrame(double[,] frame)
{
byte[,] result = new byte[frame.GetLength(0), frame.GetLength(1)];
Parallel.For(0, frame.GetLength(0), row =>
{
for(var col = 0; col < frame.GetLength(1); ++col)
ManipulatePixel(frame[row, col]);
});
}
Sorry for this, I was paraphrasing code at the time and I didn't realize that this is an actual pitfall, that produces different results. I have since changed the code to what I originally wrote (i.e. the width and height variables set at the beginning of the function, and the array's length properties only queried once each instead of in the for loop's conditional statements). Thank you #Seabizkit, your second comment inspired me to try this. The change in fact already makes the code run noticeably faster - I didn't realize this because C++ doesn't know 2D arrays so I had to pass the pixel dimensions as separate arguments anyway. Whether it is fast enough as it is I cannot say yet however.
Also thank you for the other answers, they contain a lot of things I don't know yet but it's great to know what I have to look for. I'll update once I reached a satisfactory result.

I would need to have all of your code and be able to run it locally in order to diagnose the problem because your posting is devoid of details (I would need to see inside your ManipulatePixel function, as well as the code that calls ProcessFrame). but here's some general tips that apply in your case.
2D arrays in .NET are significantly slower than 1D arrays and staggered arrays, even in .NET Core today - this is a longstanding bug.
See here:
https://github.com/dotnet/coreclr/issues/4059
Why are multi-dimensional arrays in .NET slower than normal arrays?
Multi-dimensional array vs. One-dimensional
So consider changing your code to use either a jagged array (which also helps with memory locality/proximity caching, as each thread would have its own private buffer) or a 1D array with your own code being responsible for bounds-checking.
Or better-yet: use stackalloc to manage the buffer's lifetime and pass that by-pointer (unsafe ahoy!) to your thread delegate.
Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.
Avoid allocating a new buffer for each frame encountered - if a frame has a limited lifespan then consider using reusable buffers using a buffer-pool.
Consider using the SIMD and AVX features in .NET. While modern C/C++ compilers are smart enough to compile code to use those instructions, the .NET JIT isn't so hot - but you can make explicit calls into SMID/AVX instructions using the SIMD-enabled types (you'll need to use .NET Core 2.0 or later for the best accelerated functionality)
Also, avoid copying individual bytes or scalar values inside a for loop in C#, instead consider using Buffer.BlockCopy for bulk copy operations (as these can use hardware memory copy features).
Regarding your observation of "80% CPU usage" - if you have a loop in a program then that will cause 100% CPU usage within the time-slices provided by the operating-system - if you don't see 100% usage then your code then:
Your code is actually running faster than real-time (this is a good thing!) - (unless you're certain your program can't keep-up with the input?)
Your codes' thread (or threads) is blocked by something, such as a blocking IO call or a misplaced Thread.Sleep. Use tools like ETW to see what your process is doing when you think it should be CPU-bound.
Ensure you aren't using any lock (Monitor) calls or using other thread or memory synchronization primitives.

Efficiency matters ( it is not true-[PARALLEL], but may, yet need not, benefit from a "just"-[CONCURRENT] work
The BEST, yet a rather hard way, if ultimate performance is a MUST :
in-line an assembly, optimised as per cache-line sizes in the CPU hierarchy and keep indexing that follows the actual memory-layout of the 2D data { column-wise | row-wise }. Given there is no 2D-kernel-transformation mentioned, your process does not need to "touch" any topological-neighbours, the indexing can step in whatever order "across" both of the ranges of the 2D-domain and the ManipulatePixel() may get more efficient on transforming rather blocks-of pixels, instead of bearing all overheads for calling a process just for each isolated atomicised-1px ( ILP + cache-efficiency are on your side ).
Given your target production-platform CPU-family, best use (block-SIMD)-vectorised instructions available from AVX2, best AVX512 code. As you most probably know, may use C/C++ using AVX-intrinsics for performance optimisations with assembly-inspection and finally "copy" the best resulting assembly for your C# assembly-inlining. Nothing will run faster. Tricks with CPU-core affinity mapping and eviction/reservation are indeed a last resort, yet may help for indeed an almost hard-real-time production settings ( though, hard R/T systems are seldom to get developed in an ecosystem with non-deterministic behaviour )
A CHEAP, few-seconds step :
Test and benchmark the run-time per batch of frames of a reversed composition of moving the more-"expensive"-part, the Parallel.For(...{...}) inside the for(var col = 0; col < width; ++col){...} to see the change of the costs of instantiations of the Parallel.For() instrumentation.
Next, if going this cheap way, think about re-factoring the ManipulatePixel() to at least use a block of data, aligned with data-storage layout and being a multiple of cache-line length ( for cache-hits ~ 0.5 ~ 5 [ns] improved costs-of-memory accesses, being ~ 100 ~ 380 [ns] otherwise - here, a will to distribute a work (the worse per 1px) across all NUMA-CPU-cores will result in paying way more time, due to extended access-latencies for cross-NUMA-(non-local) memory addresses and besides never re-using an expensively cached block-of-fetched-data, you knowingly pay excessive costs from cross-NUMA-(non-local) memory fetches ( from which you "use" just 1px and "throw" away all the rest of the cached-block ( as those pixels will get re-fetched and manipulated in some other CPU-core in some other time ~ a triple-waste of time ~ sorry to have mentioned that explicitly, but when shaving each possible [ns] this cannot happen in production pipeline ) )
Anyway, let me wish you perseverance and good luck on your steps forwards to gain the needed efficiency back onto your side.

Here's what I ended up doing, mostly based on Dai's answer:
made sure to query image pixel dimensions once at the beginning of the processing functions, not within the for loop's conditional statement. With parallel loops, it would seem this creates competitive access of those properties from multriple threads which noticeably slows things down.
removed allocation of output buffers within the processing functions. They now return void and accept the output buffer as an argument. The caller creates one buffer for each image processing step (filtering, scaling, colorizing) only, which doesn't change in size but gets overwritten with each frame.
removed an extra data processing step where raw image data in the format ushort (what the camera originally spits out) was converted to double (actual temperature values). Instead, processing is applied to the raw data directly. Conversion to actual temperatures will be dealt with later, as necessary.
I also tried, without success, to use 1D arrays instead of 2D but there is actually no difference in performance. I don't know if it's because the bug Dai mentioned was fixed in the meantime, but I couldn't confirm 2D arrays to be any slower than 1D arrays.
Probably also worth mentioning, the ManipulatePixel() function in my original post was actually more of a placeholder rather than a real call to another function. Here's a more proper example of what I am doing to a frame, including the changes I made:
private static void Rescale(ushort[,] originalImg, byte[,] scaledImg, in (ushort, ushort) limits)
{
Debug.Assert(originalImg != null);
Debug.Assert(originalImg.Length != 0);
Debug.Assert(scaledImg != null);
Debug.Assert(scaledImg.Length == originalImg.Length);
ushort min = limits.Item1;
ushort max = limits.Item2;
int width = originalImg.GetLength(1);
int height = originalImg.GetLength(0);
Parallel.For(0, height, row =>
{
for (var col = 0; col < width; ++col)
{
ushort value = originalImg[row, col];
if (value < min)
scaledImg[row, col] = 0;
else if (value > max)
scaledImg[row, col] = 255;
else
scaledImg[row, col] = (byte)(255.0 * (value - min) / (max - min));
}
});
}
This is just one step and some others are much more complex but the approach would be similar.
Some of the things mentioned like SIMD/AVX or the answer of user3666197 unfortunately are well beyond my abilities right now so I couldn't test that out.
It's still relatively easy to put enough processing load into the stream to tank the frame rate but for my application the performance should be enough now. Thanks to everyone who provided input, I'll mark Dai's answer as accepted because I found it the most helpful.

if statement performance in c# [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm just trying to determine the impact of each "if" statement on the performance of my C# application when it is used in cycles with large number of iterations. I have not found the topic about this so I have created this one.
For the test I made 2 cycles: one without "if" and one with a single "if" statement. The code is the following.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Diagnostics;
namespace IfPerformance
{
class Program
{
static void Main(string[] args)
{
int N = 500000000;
Stopwatch sw = new Stopwatch();
double a = 0, b = 0;
bool f;
sw.Restart();
for (int i = 0; i < N; i++)
{
a += 1.1;
f = a < N;
}
sw.Stop();
Console.WriteLine("Without if: " + sw.ElapsedMilliseconds + " ms");
a = 0;
sw.Restart();
for (int i = 0; i < N; i++)
{
if (a < N)
a += 1.1;
else
b += 1.1;
}
sw.Stop();
Console.WriteLine("With if: " + sw.ElapsedMilliseconds + " ms");
Console.ReadKey();
}
}
}
I ran the test with "Optimize code" build option and "Start without debugging". The result is the following:
Without if: 154 ms
With if: 742 ms
This means that a single "if" statement brings almost 5 times slowdown to the performance. I think regarding this will be helpful.
Also, I have noticed that the presence of several extra "if"s in a large loop may slow down my final application by 25%, which on my opinion is significant.
To be specific, I'm running Monte-Carlo optimization on a set of data, which require many loops through the whole data set. The loop contains branching which depends on the user settings. From this point "if"s arise.
My questions to the professionals in performance aspects are:
What is the impact of extra "if"s in a loop on the time of running many iterations?
How to avoid the slowdown?
Please post your opinion if I'm going in the wrong direction.

It doesn't matter ...
You're testing 500 MILLION iterations ... and it takes less than a second ... IN THE WORST case ...
As comments said, you'll be in a hell of a trouble to begin with, since you won't be running in debug for testing performance, and even then, you'll have heaps of other things to take into consideration (it's a whole big world about performance testing, and it's not as simple as it seems usually).
Now, do notice that you're doing two different things in the two places. If you would like to see the performance of the if, you should have them do basically the same. I'm sure the branching changes the IL code to begin with ...
Last, but not least, as I said again ... it DOESTN'T MATTER ... unless you really need to run 500 MILLION times, and have this in so many places that your program starts to slow down because of that.
Go for readability over obsessing if you can save some micro seconds on an if statement
Feel free to read these articles by Eric Lippert (who has "only" 250K rep and i̶s̶ was a principal developer on the C# compiler team :) who'll get you on the right direction:
c# performance benchmarks mistakes part 1
c# performance benchmarks mistakes part 2
c# performance benchmarks mistakes part 3
c# performance benchmarks mistakes part 4
(Talking about this, I would guess that garbage collection (article 4) might have been something to consider ...)
Then look at: this elaborate answer about the topic
And last, but not least, have a look at Writing Faster Managed Code: Know What Things Cost. This is by Jan Gray, from the Microsoft CLR Performance Team. I'll be honest and say I didn't read this one yet :). I Will though, later on...
It goes on an on ... :)

These code sample are two different codes one is a boolean assignment and the other one is condition statement so this is not suitable method to evaluate performance

Those benchmarks tell you essentially nothing at all.
There are much more things at play than just an additional if.
You also have to take branch-prediction and caching into account.
Such micro optimizations will only hinder you writing good code.
You will spend more time optimizing useless stuff than you spend time implementing good features in your software...
Think of it this way, no kind of optimization will help you if you have even a single design mistake in your code.
For example using a unfitting datastructure (for example a list for 'fast' lookup instead of a dictionary).

Best practice with Math.Pow

I'm working on a n image processing library which extends OpenCV, HALCON, ... . The library must be with .NET Framework 3.5 and since my experiences with .NET are limited I would like to ask some questions regarding the performance.
I have encountered a few specific things which I cannot explain to myself properly and would like you to ask a) why and b) what is the best practise to deal with the cases.
My first question is about Math.pow. I already found some answers here on StackOverflow which explains it quite well (a) but not what to do about this(b). My benchmark Program looks like this
Stopwatch watch = new Stopwatch(); // from the Diagnostics class
watch.Start();
for (int i = 0; i < 1000000; i++)
double result = Math.Pow(4,7) // the function call
watch.Stop()
The result was not very nice (~300ms on my computer) (I have run the test 10 times and calcuated the average value).
My first idea was to check wether this is because it is a static function. So I implemented my own class
class MyMath
{
public static double Pow (double x, double y) //Using some expensive functions to calculate the power
{
return Math.Exp(Math.Log(x) * y);
}
public static double PowLoop (double x, int y) // Using Loop
{
double res = x;
for(int i = 1; i < y; i++)
res *= x;
return res;
}
public static double Pow7 (double x) // Using inline calls
{
return x * x * x * x * x * x * x;
}
}
THe third thing I checked were if I would replace the Math.Pow(4,7) directly through 4*4*4*4*4*4*4.
The results are (the average out of 10 test runs)
300 ms Math.Pow(4,7)
356 ms MyMath.Pow(4,7) //gives wrong rounded results
264 ms MyMath.PowLoop(4,7)
92 ms MyMath.Pow7(4)
16 ms 4*4*4*4*4*4*4
Now my situation now is basically like this: Don't use Math for Pow. My only problem is just that... do I really have to implement my own Math-class now? It seems somehow ineffective to implement an own class just for the power function. (Btw. PowLoop and Pow7 are even faster in the Release build by ~25% while Math.Pow is not).
So my final questions are
a) am I wrong if I wouldn't use Math.Pow at all (but for fractions maybe) (which makes me somehow sad).
b) if you have code to optimize, are you really writing all such mathematical operations directly?
c) is there maybe already a faster (open-source^^) library for mathematical operations
d) the source of my question is basically: I have assumed that the .NET Framework itself already provides very optimized code / compile results for such basic operations - be it the Math-Class or handling arrays and I was a little surprised how much benefit I would gain by writing my own code. Are there some other, general "fields" or something else to look out in C# where I cannot trust C# directly.

Two things to bear in mind:
You probably don't need to optimise this bit of code. You've just done a million calls to the function in less than a second. Is this really going to cause big problems in your program?
Math.Pow is probably fairly optimal anyway. At a guess, it will be calling a proper numerics library written in a lower level language, which means you shouldn't expect orders of magnitude increases.
Numerical programming is harder than you think. Even the algorithms that you think you know how to calculate, aren't calculated that way. For example, when you calculate the mean, you shouldn't just add up the numbers and divide by how many numbers you have. (Modern numerics libraries use a two pass routine to correct for floating point errors.)
That said, if you decide that you definitely do need to optimise, then consider using integers rather than floating point values, or outsourcing this to another numerics library.

Firstly, integer operations are much faster than floating point. If you don't need floating point values, don't use the floating point data type. This generally true for any programming language.
Secondly, as you have stated yourself, Math.Pow can handle reals. It makes use of a much more intricate algorithm than a simple loop. No wonder it is slower than simply looping. If you get rid of the loop and just do n multiplications, you are also cutting off the overhead of setting up the loop - thus making it faster. But if you don't use a loop, you have to know
the value of the exponent beforehand - it can't be supplied at runtime.
I am not really sure why Math.Exp and Math.Log is faster. But if you use Math.Log, you can't find the power of negative values.
Basically int are faster and avoiding loops avoid extra overhead. But you are trading off some flexibility when you go for those. But it is generally a good idea to avoid reals when all you need are integers, but in this case coding up a custom function when one already exists seems a little too much.
The question you have to ask yourself is whether this is worth it. Is Math.Pow actually slowing your program down? And in any case, the Math.Pow already bundled with your language is often the fastest or very close to that. If you really wanted to make an alternate implementation that is really general purpose (i.e. not limited to only integers, positive values, etc.), you will probably end up using the same algorithm used in the default implementation anyway.

When you are talking about making a million iterations of a line of code then obviously every little detail will make a difference.
Math.Pow() is a function call which will be substantially slower than your manual 4*4...*4 example.
Don't write your own class as its doubtful you'll be able to write anything more optimised than the standard Math class.

Is it more memory-efficient to assign a variable to an expression?

What's more efficient?
decimal value1, value2, formula
This:
for(int i = 0; i>1000000000000; i++);
{
value1 = getVal1fromSomeWhere();
value2 = getVal2fromSomeWhere();
SendResultToA( value1*value2 + value1/value2);
SendResultToB( value1*value2 + value1/value2);
}
Or this:
for(int i = 0; i>1000000000000; i++)
{
value1 = getVal1fromSomeWhere();
value2 = getVal2fromSomeWhere();
formula = value1*value2 + value1/value2;
SendResultToA(formula);
SendResultToA(formula);
}
Intuitively I would go for the latter...
I guess there's a tradeoff between having an extra-assignment at each iteration (decimal, formula) and performing the computation on and on with no extra-variable...
EDIT :
Uhhh. God... Do I Have to go through this each time I ask a question ?
If I ask it, it is because YES it DOES MATTER to me, fellows.
Everybody does not live in a gentle non-memory-critical world, WAKE-UP !
this was just an overly simple example. I am doing MILLIONS of scientific computation and clouding multithreaded stuff, do not take me for a noob :-)
So YES, DEFINITELY every nanosecond counts.
PS : I almost regret C++ and pointers. Automatic Memory Management and GC's definitely made developers ignorant and lazy :-P

First of all profile first, and only do such micro optimizations if it's necessary. Else optimize for readability. And in your case I think the second one is easier to read.
And your statement that the second code has an additional assignment isn't true anyways. The result of your formula needs to be stored into a register in both codes.
The concept of the extra variable isn't valid once the code is compiled. For example in your case the compiler can store formula in the register where value1 or value2 was stored before, since their lifetimes don't overlap.
I wouldn't be surprised if the first one gets optimized to the second one. I think this optimization is called "Common subexpression folding". But of course it's only possible if the expression is free of side-effects.
And inspecting the IL isn't always enough to see what gets optimized. The jitter optimizes too. I had some code that was quite ugly and slow looking in IL, but very short in the finally generated x86 code. And when inspecting the machine code you need to make sure it's actually optimized. For example if you run in VS even the release code isn't fully optimized.
So my guess is that they are equally fast if the compiler can optimize them, and else the second one is faster since it doesn't need to evaluate your formula twice.

Unless you're doing this tens of thousands of times a second, it doesn't matter at all. Optimize towards readability and maintainability!
Edit: Haters gonna hate, okay fine, here you go. My code:
static void MethodA()
{
for (int i = 0; i < 1000; i++) {
var value1 = getVal1fromSomeWhere();
var value2 = getVal2fromSomeWhere();
SendResultToA(value1 * value2 + value1 / value2);
SendResultToB(value1 * value2 + value1 / value2);
}
}
static void MethodB()
{
for (int i = 0; i < 1000; i++) {
var value1 = getVal1fromSomeWhere();
var value2 = getVal2fromSomeWhere();
var formula = value1 * value2 + value1 / value2;
SendResultToA(formula);
SendResultToB(formula);
}
}
And the actual x86 assembly generated by both of them:
MethodA: http://pastie.org/1532794
MethodB: http://pastie.org/1532792
These are very long because it inlined getVal[1/2]fromSomewhere and SendResultTo[A/B], which I wired up to Random and Console.WriteLine. We can see that indeed, the CLR nor the Jitter is not smart enough to not duplicate the previous calculation, so we spend an additional 318 bytes of x86 bytecode doing the extra math.
However, keep this in mind - any gains you make by these kinds of optimizations are immediately made irrelevant by even a single extra page fault or disk read/write. These days, CPUs are rarely the bottleneck in most applications - I/O and memory are. Optimize toward spatial locality (i.e using contiguous arrays so you hit less page faults), and reducing disk I/O and hard page faults (i.e. loading code you don't need requires the OS to fault it in).

To the extent that it might matter, I think you're right. And both are equally readable (arguably).
Remember, the number of loop iterations has nothing to do with the local memory requirements. You're only talking about a few extra bytes, (and the value is going to be put on the stack for passage to the function, anyway); whereas the cycles you save* by caching the result of the calculation does go down significantly with the number of iterations.
* That is, provided that the compiler doesn't do this for you. It would be instructive to look at the IL generated in each case.

You'd have to disassemble the bytecode and/or benchmark to be sure but I'd argue that this would probably be the same since it's trivial for the compiler to see that formula (in the loop scope) does not change and can quite easily be 'inlined' (substituted) directly.
EDIT: As user CodeInChaos correctly comments disassembling the bytecode might not be enough since it's possible the optimisation is only introduced after jitting.

How do I learn enough about CLR to make educated guesses about performance problems?

Yes, I am using a profiler (ANTS). But at the micro-level it cannot tell you how to fix your problem. And I'm at a microoptimization stage right now. For example, I was profiling this:
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
packedCells.Add(Data[x, y].HasCar);
packedCells.Add(Data[x, y].RoadState);
packedCells.Add(Data[x, y].Population);
}
}
ANTS showed that the y-loop-line was taking a lot of time. I thought it was because it has to constantly call the Height getter. So I created a local int height = Height; before the loops, and made the inner loop check for y < height. That actually made the performance worse! ANTS now told me the x-loop-line was a problem. Huh? That's supposed to be insignificant, it's the outer loop!
Eventually I had a revelation - maybe using a property for the outer-loop-bound and a local for the inner-loop-bound made CLR jump often between a "locals" cache and a "this-pointer" cache (I'm used to thinking in terms of CPU cache). So I made a local for Width as well, and that fixed it.
From there, it was clear that I should make a local for Data as well - even though Data was not even a property (it was a field). And indeed that bought me some more performance.
Bafflingly, though, reordering the x and y loops (to improve cache usage) made zero difference, even though the array is huge (3000x3000).
Now, I want to learn why the stuff I did improved the performance. What book do you suggest I read?

CLR via C# by Jeffrey Richter.
It is such a great book that someone stolen it in my library together with C# in depth.

The CLR is not involved at all here, this should all be translated to straight machine code without calls into the CLR. The JIT compiler is responsible for generating that machine code, it has an optimizer that tries to come up with the most efficient code. It has limitations, it cannot spend a large amount of time on it.
One of the important things it does is figuring out what local variables should be stored in the CPU registers. That's something that changed when you put the Height property in a local variable. It possibly decided to store that variable in a register. But now there's one less available to store another variable. Like the x or y variable, one that's critical for speed. Yes, that will slow it down.
You got a bad diagnostic about the outer loop. That could possibly be caused by the JIT optimizer re-arranging the loop code, giving the profiler a harder time mapping the machine code back to the corresponding C# statement.
Similarly, the optimizer might have decided that you were using the array inefficiently and switched the indexing order back. Not so sure it actually does that, but not impossible.
Anyhoo, the only way you can get some insight here is by looking at the generated machine code. There are many decent books about x86 assembly code, although they might be a bit hard to find these days. Your starting point is Debug + Windows + Disassembly.
Keep in mind however that even the machine code is not a very good predictor of how efficient code is going to run. Modern CPU cores are enormously complicated and the machine code is no longer representative for what actually happens inside the core. The only tried and true way is what you've already been doing: trial and error.

Albin - no. Honestly I didn't think that running outside a profiler would change the performance difference, so I didn't bother. You think I should have? Has that been a problem for you before? (I am compiling with optimizations on though)
Running under a debugger changes the performance: when it's being run under a debugger, the just-in-time compiler automatically disables optimizations (to make it easier to debug)!
If you must, use the debugger to attach to an already-running already-JITted process.

One thing you should know about working with Arrays is that the CLR will always make sure that array-indices are not out-of-bounds. It has an optimization for 1-dimensional arrays but not for 2+ dimensions.
Knowing this, you may want to benchmark MyCell Data[][] instead of MyCell Data[,]

Hm, I don't think that the loop enrolling is the real problem.
1. I'd try to avoid accessing the array Data three times per inner loop.
2. I'd also recommend, to re-think the three Add statements: you are apparently accessing a collection three times to add trivial some data. Make it only one access per iteration and add a data type containing three entries:
for (int y = 0; ... {
tTemp = Data[x, y];
packedCells.Add(new {
tTemp.HasCar, tTemp.RoadState, tTemp.Population
});
}
Another look reveals, that you are basically vectorizing a matrix by copying it into an array (or some other sequential collection)... Is that necessary at all? Why don't you just define a specialized indexer which simulates that linear access? Even better, if you only need to enumerate the entries (in that example you do, no random access required), why don't you use an adequate LINQ expression?

Point 1) Educated guesses are not the way to do performance tuning. In this case I can guess about as well as most, but guessing is the wrong way to do it.
Point 2) Profilers need to be well understood before you know what they're actually telling you. Here's a discussion of the issues. For example, what many profilers do is tell you "where the program spends its time", i.e. where the program counter spends its time, so they are almost absolutely blind to time requested by function calls, which is what your inner loop seems to consist of.
I do a lot of performance tuning, and here is what I do. I cycle between two activities:
Overall time measurement. This doesn't require special tools. I'm not trying to measure individual routines.
"Bottleneck" location. This does not require running the code at any kind of speed, because I'm not measuring. What I'm doing is locating lines of code that are responsible for a significant percent of time. I know which lines they are because they are on the stack for that percent, and stack samples easily find them.
Once I find a "bottleneck" and fix it, I go back to the first step, measure what percent of time I saved, and do it all again on the next "bottleneck", typically from 2 to 6 times. I am helped by the "magnification effect", in which a fixed problem magnifies the percentage used by remaining problems. It works for both macro and micro optimization.
(Sorry if I can't write "bottleneck" without quotes, because I don't think I've ever found a performance problem that resembled the neck of a bottle. Rather they were all simply doing things that didn't really need to be done.)

Since the comment might be overseen, I repeat myself: it is quite cumbersome to optimize code which is per se overfluous. You do not really need to explicitely linearize your matrix at all, see the comment above: Define a linearizing adapter which implements IEnumerable<MyCell> and feed it into the formatter.
I am getting a warning when I try to add another answer, so I am going to recycle this one.. :) After reading Steve's comments and thinking about it for a while, I suggest the following:
If serializing a multi-dimensional array is too slow (haven't tryied, I just believe you...) don't use it at all! It appears, that your matrix is not sparse and has fixed dimensions. So define the structure holding your cells as simple linear array with indexer:
[Serializable()]
class CellMatrix {
Cell [] mCells;
public int Rows { get; }
public int Columns { get; }
public Cell this (int i, int j) {
get {
return mCells[i + Rows * j];
}
// setter...
}
// constructor taking rows/cols...
}
A thing like this should serialize as fast as native Array does... I don't recommend hard coding the layout of Cell in order to save few bytes there...
Cheers,
Paul

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.