Why does the compiler do to optimize my code?
I have 2 functions:
public void x1() {
x++;
x++;
}
public void x2() {
x += 2;
}
public void x3() {
x = x + 2;
}
public void y3() {
x = x * x + x * x;
}
And that is what I can see with ILSpy after compiling in Release mode:
// test1.Something
public void x1()
{
this.x++;
this.x++;
}
// test1.Something
public void x2()
{
this.x += 2;
}
// test1.Something
public void x3()
{
this.x += 2;
}
// test1.Something
public void y3()
{
this.x = this.x * this.x + this.x * this.x;
}
x2 and x3 might be ok. But why is x1 not optimized to the same result? There is not reason to keep it a 2 step increment?
And why is y3 not x=2*(x*x)? Shouldn't that be faster than x*x+x*x?
That leads to the question? What kind of optimization does the C# compiler if not such simple things?
When I read articles about writing code you often hear, write it readable and the compiler will do the rest. But in this case the compiler does nearly nothing.
Adding one more example:
public void x1() {
int a = 1;
int b = 1;
int c = 1;
x = a + b + c;
}
and using ILSpy:
// test1.Something
public void x1()
{
int a = 1;
int b = 1;
int c = 1;
this.x = a + b + c;
}
Why is it not this.x = 3?
The compiler cannot perform this optimization without making an assumption that variable x is not accessed concurrently with your running method. Otherwise it risks changing the behavior of your method in a detectable way.
Consider a situation when the object referenced by this is accessed concurrently from two threads. Tread A repeatedly sets x to zero; thread B repeatedly calls x1().
If the compiler optimizes x1 to be an equivalent of x2, the two observable states for x after your experiment would be 0 and 2:
If A finishes before B, you get 2
If B finishes before A, you get 0
If A pre-empts B in the middle, you would still get a 2.
However, the original version of x1 allows for three outcomes: x can end up being 0, 1, or 2.
If A finishes before B, you get 2
If B finishes before A, you get 0
If B gets pre-empted after the first increment, then A finishes, and then B runs to completion, you get 1.
x1 and x2 are NOT the same:
if x were a public field and was accessed in a multi-threaded environment, it's entirely possible that a second thread mutates x between the two calls, which would not be possible with the code in x2.
For y2, if + and/or * were overloaded for the type of x then x*x + x*x could be different than 2*x*x.
The compiler will optimize things like (not an exhaustive list my any means):
removing local variables that are not used (freeing up registers)
removing code that does not affect the logic flow or the output.
inlining calls to simple methods
Compiler optimization should NOT change the behavior of the program (although it does happen). So reordering/combining math operations are out of scope of optimization.
write it readable and the compiler will do the rest.
Well, the compiler may do some optimization, but there is still a LOT that can be done to improve performance at design-time. Yes readable code is definitely valuable, but the compiler's job is to generate working IL that corresponds with your source code, not to change your source code to be faster.
Related
I have been practicing c# nowadays and decided to write a code that converts any decimal to any base representation for practice purpose. And i have some troubles. Since i want to practice i decided to do it with an additional function where calculations take place. First i wanted to use an array to keep my result. But since ,at the beginning, i do not know the length of the array i could not define it.So i decided to use list(somehow i assumed undeclared slots are 0 as default). This is what i end up with.
class MainClass
{
static double number;
static double baseToConvert;
static int counter = 0;
static List<double> converted = new List<double>();
public static void Main(string[] args)
{
Console.WriteLine("Enter a decimal");
number = double.Parse(Console.ReadLine());
Console.WriteLine("Enter a base you want to convert to");
baseToConvert = double.Parse(Console.ReadLine());
ConverterToBase(number);
for (int i = converted.Count - 1; i >= 0; i--)
{
Console.WriteLine(converted[i]);
}
Console.ReadLine();
}
public static void ConverterToBase(double x)
{
double temp = x;
while (x >= baseToConvert)
{
x /= baseToConvert;
counter++;
}
converted[counter] = x;
counter = 0;
if (temp - x * Math.Pow(baseToConvert, Convert.ToDouble(counter)) >= baseToConvert)
{
ConverterToBase(temp - x * Math.Pow(baseToConvert, Convert.ToDouble(counter)));
}
else
{
converted[0] = temp - x * Math.Pow(baseToConvert, Convert.ToDouble(counter));
}
}
}
But after i write inputs console gets stuck without an error. My guess is that since i do not have any elements in the list " converted[counter] " does not make sense. But i do not know maybe the problem is somewhere else.
My question is not about the way i calculate the problem(Of course any suggestions are welcomed). I just want to know what i am doing wrong and how i can handle such situation(unknown array size , use of list , accessing a variable,array,.. etc from another method... ).
Thanks.
My previous answer was wrong as pointed out by #Rufus L. There is no infinite for loop. However upon further review, there seems to be an infinite recursion going on in your code in this line:
if (temp - x * Math.Pow(baseToConvert, Convert.ToDouble(counter)) >= baseToConvert)
{
ConverterToBase(temp - x * Math.Pow(baseToConvert, Convert.ToDouble(counter)));
}
ConverterToBase calls itself and there seems to be no base case nor return statement to end the recursion.
In the method named "ConverterToBase(double x)" you want to set value of 0 element. But you didn't add any element. The converted is Empty.
Firstly add value or values to your list.
I have three cases to test the relative performance of classes, classes with inheritence and structs. These are to be used for tight loops so performance counts. Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code. The below tests are indicative of real world performance problems I have seen.
The results for 100000000 times through the loop and application of the dot product gives
ControlA 208 ms ( class with inheritence )
ControlB 201 ms ( class with no inheritence )
ControlC 85 ms ( struct )
The tests were being run without debugging and optimization turned on. My question is, what is it about classes in this case that cause them to be so slow?
I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. Note that if I disable optimizations then my results are identical.
ControlA 3239
ControlB 3228
ControlC 3213
They are always within 20ms of each other if the test is re-run.
The classes under investigation
using System;
using System.Diagnostics;
public class PointControlA
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public PointControlA(double x, double y)
{
X = x;
Y = y;
}
}
public class Point3ControlA : PointControlA
{
public double Z
{
get;
set;
}
public Point3ControlA(double x, double y, double z): base (x, y)
{
Z = z;
}
public static double Dot(Point3ControlA a, Point3ControlA b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Point3ControlB
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlB(double x, double y, double z)
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlB a, Point3ControlB b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public struct Point3ControlC
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlC(double x, double y, double z):this()
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlC a, Point3ControlC b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
Test Script
public class Program
{
public static void TestStructClass()
{
var vControlA = new Point3ControlA(11, 12, 13);
var vControlB = new Point3ControlB(11, 12, 13);
var vControlC = new Point3ControlC(11, 12, 13);
var sw = Stopwatch.StartNew();
var n = 10000000;
double acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlB.Dot(vControlB, vControlB);
}
Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
}
public static void Main()
{
TestStructClass();
}
}
This dotnet fiddle is proof of compilation only. It does not show the performance differences.
I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. I now have the test case to prove it but I can't understand why.
NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. Looking at the IL with JIT optimizations turned off doesn't tell me anything.
EDIT
After the answer by #pkuderov I took his code and played with it. I changed the code and found that if I forced inlining via
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double Dot(Point3Class a)
{
return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
}
the difference between the struct and class for dot product vanished. Why with some setups the attribute is not needed but for me it was is not clear. However I did not give up. There is still a performance problem with the vendor code and I think the DotProduct is not the best example.
I modified #pkuderov's code to implement Vector Add which will create new instances of the structs and classes. The results are here
https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48
In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ).
The results show that:
DotProduct performance is identical or maybe faster for classes
Vector Add, and I assume anything that creates a new object is slower.
Add class/class 2777ms
Add struct/struct 2457ms
DotProd class/class 1909ms
DotProd struct/struct 2108ms
The full code and results are here if anybody wants to try it out.
Edit Again
For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers
var accStruct = new Point3Struct(0, 0, 0);
for (int i = 0; i < n; i++)
accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);
the asm body is
// load the next vector into a register
00007FFA3CA2240E vmovsd xmm3,qword ptr [rax]
00007FFA3CA22413 vmovsd xmm4,qword ptr [rax+8]
00007FFA3CA22419 vmovsd xmm5,qword ptr [rax+10h]
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F vaddsd xmm0,xmm0,xmm3
00007FFA3CA22424 vaddsd xmm1,xmm1,xmm4
00007FFA3CA22429 vaddsd xmm2,xmm2,xmm5
but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient
var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);
the asm body is
// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A vmovsd xmm0,qword ptr [r14+8]
00007FFA3CA22250 vmovaps xmm7,xmm0
00007FFA3CA22255 vaddsd xmm7,xmm7,mmword ptr [r12+8]
// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C vmovsd xmm0,qword ptr [r14+10h]
00007FFA3CA22262 vmovaps xmm8,xmm0
00007FFA3CA22267 vaddsd xmm8,xmm8,mmword ptr [r12+10h]
// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E vmovsd xmm9,qword ptr [r14+18h]
00007FFA3CA22283 vmovaps xmm0,xmm9
00007FFA3CA22288 vaddsd xmm0,xmm0,mmword ptr [r12+18h]
// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F vmovsd qword ptr [rax+8],xmm7
00007FFA3CA22295 vmovsd qword ptr [rax+10h],xmm8
00007FFA3CA2229B vmovsd qword ptr [rax+18h],xmm0
Update
After spending some time thinking about problem I think I'm aggree with #DavidHaim that memory jump overhead is not the case here because of caching.
Also I've added to your tests more options (and removed first one with inheritance). So I have:
cl = variable of class with 3 points:
Dot(cl, cl) - initial method
Dot(cl) - which is "square product"
Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z) aka Dot(cl.xyz)- pass fields
st = variable of struct with 3 points:
Dot(st, st) - initial
Dot(st) - square product
Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z) aka Dot(st.xyz) - pass fields
st6 = vairable of struct with 6 points:
Dot(st6) - wanted to check if size of struct matters
Dot(x, y, z, x, y, z) aka Dot(xyz) - just local const double variables.
Result times are:
Dot(cl.xyz) is the worst ~570ms,
Dot(st6), Dot(st.xyz) is the second worst ~440ms and ~480ms
the others are ~325ms
...And I don't really sure why I see these results.
Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. Maybe some kind of loop unwinding.
I think my expertise is just not enough :) But still, my results counter your results.
Full test code with results on my machine and generated IL code you can find here.
In C# classes are reference types and structs are value types. One major effect is that value types can be (and most of the time are!) allocated on the stack, while reference types are always allocated on the heap.
So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers.
I think you see a difference because of this.
P.S. btw, by "most of the time are" I meant boxing; it's a technique used to place value type objects on the heap (e.g. to cast value types to an interface or for dynamic method call binding).
As I thought , this test doesn't prove much.
TLDR: the compiler completely optimizes away the call to Point3ControlC.Dot while preserves the calls to the other two. the difference is not because structs are faster in this case, but because you skip the entire calculation part.
My settings:
Visual studio 2015 update 3
.Net framework version 4.6.1
Release mode, Any CPU (my CPU is 64 bit)
Windows 10
CPU: Processor Intel(R) Core(TM) i5-5300U CPU # 2.30GHz, 2295 Mhz, 2 Core(s), 4 Logical Processor(s)
The generated assembly for
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
is:
00DC0573 xor edx,edx // temp = 0
00DC0575 mov dword ptr [ebp-10h],edx // i = temp
00DC0578 mov ecx,edi // load vControlA as first parameter
00DC057A mov edx,edi //load vControlA as second parameter
00DC057C call dword ptr ds:[0BA4F0Ch] //call Point3ControlA.Dot
00DC0582 fstp st(0) //store the result
00DC0584 inc dword ptr [ebp-10h] //i++
00DC0587 cmp dword ptr [ebp-10h],989680h //does i == n?
00DC058E jl 00DC0578 //if not, jump to the begining of the loop
After thoughts:
The JIT compiler for some reason did not use a register for i, so it incremented an integer on the stack (ebp-10h) instead. as result, this test has the poorest performance.
Moving on to the second test:
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Generated assembly:
00DC0612 xor edi,edi //i = 0
00DC0614 mov ecx,esi //load vControlB as the first argument
00DC0616 mov edx,esi //load vControlB as the second argument
00DC0618 call dword ptr ds:[0BA4FD4h] // call Point3ControlB.Dot
00DC061E fstp st(0) //store the result
00DC0620 inc edi //++i
00DC0621 cmp edi,989680h //does i == n
00DC0627 jl 00DC0614 //if not, jump to the beginning of the loop
After thoughts: this generated assembly is almost identical to the first one, but this time, the JIT did use a register for i, hence the minor performance boost over the first test.
Moving on to the test in question:
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
And for the generated assembly:
00DC06A7 xor eax,eax //i = 0
00DC06A9 inc eax //++i
00DC06AA cmp eax,989680h //does i == n ?
00DC06AF jl 00DC06A9 //if not, jump to the beginning of the loop
As we can see, the JIT has completely optimized away the call for Point3ControlC.Dot , so actually, you only pay for the loop, and not for the call itself. hence this "test" finishes first, as it didn't do much to begin with.
Can we say something about structs vs classes from this test alone? well, no.
I'm still not quit sure why has the compiler decided to optimize out the call for the struct-function while preserved the other calls. what I'm sure about is that in real-life code, the compiler can not optimize the call away if the result is used. in this mini-benchmark, we don't do much with the result and even if we did, the compiler can calculate the result on compile time. so the compiler can be more aggressive than it could have been than in real-life code.
in my project, i need to code for X1 in such a way as
in local variables i declare and wrote the condition as
x1 = 0;
if(in1_w == 1)
{
x1 = 1;
}
if((in1_w == 1) && (in2_w == 1))
{
x1 = 2;
}
i's an microcontroller based in and out so, now i need to know how to right the delay code if x1=1 and x1=2. i have written as
for(k=0;k<=x1;k++)
{
delay_40sec();
}
but don't know how to write separately?
waiting for your kind help plz
You can use
System.Threading.Thread.Sleep(x)
where x is in milliseconds.
However be aware that it is not guaranteed to be accurate nor repeatable.
Suppose I have written such a class (number of functions doesn't really matter, but in real, there will be somewhere about 3 or 4).
private class ReallyWeird
{
int y;
Func<double, double> f1;
Func<double, double> f2;
Func<double, double> f3;
public ReallyWeird()
{
this.y = 10;
this.f1 = (x => 25 * x + y);
this.f2 = (x => f1(x) + y * f1(x));
this.f3 = (x => Math.Log(f2(x) + f1(x)));
}
public double CalculusMaster(double x)
{
return f3(x) + f2(x);
}
}
I wonder if the C# compiler can optimize such a code so that it won't go through numerous stack calls.
Is it able to inline delegates at compile-time at all? If yes, on which conditions and to which limits? If no, is there an answer why?
Another question, maybe even more important: will it be significantly slower than if I had declared f1, f2 and f3 as methods?
I ask this because I want to keep my code as DRY as possible, so I want to implement a static class which extends the basic random number generator (RNG) functionality: its methods accept one delegate (e.g. from method NextInt() of the RNG) and returning another Func delegate (e.g. for generating ulongs), built on top of the former. and as long as there are many different RNG's which can generate ints, I prefer not to think about implementing all the same extended functionality ten times in different places.
So, this operation may be performed several times (i.e. initial method of the class may be 'wrapped' by a delegate twice or even three times). I wonder what will be the performance overhead like.
Thank you!
If you use Expression Trees instead of complete Func<> the compiler will be able to optimize the expressions.
Edit To clarify, note that I'm not saying the runtime would optimize the expression tree itself (it shouldn't) but rather that since the resulting Expression<> tree is .Compile()d in one step, the JIT engine will simply see the repeated subexpressions and be able to optimize, consolidate, substitue, shortcut and what else it does normally.
(I'm not absolutely sure that it does on all platforms, but at least it should be able to fully leverage the JIT engine)
Comment response
First, expression trees has in potential shall equal execution speed as Func<> (however a Func<> will not have the same runtime cost - JITting probably takes place while jitting the enclosing scope; in case of ngen, it will even be AOT, as opposed to an expresseion tree)
Second: I agree that Expression Trees can be hard to use. See here for a famous simple example of how to compose expressions. However, more complicated examples are pretty hard to come by. If I've got the time I'll see whether I can come up with a PoC and see what MS.Net and MONO actually generate in MSIL for these cases.
Third: don't forget Henk Holterman is probably right in saying this is premature optimization (although composing Expression<> instead of Func<> ahead of time just adds the flexibility)
Lastly, when you really think about driving this very far, you might consider using Compiler As A Service (which Mono already has, I believe it is still upcoming for Microsoft?).
I would not expect a compiler to optimize this. The complications (because of the delegates) would be huge.
And I would not worry about a few stack-frames here either. With 25 * x + y the stack+call overhead could be significant but call a few other methods (PRNG) and the part you are focusing on here becomes very marginal.
I compiled a quick test application where I compared the delegate approach to an approach where I defined each calculation as a function.
When doing 10.000.000 calculations for each version I got the following results:
Running using delegates: 920 ms average
Running using regular method calls: 730 ms average
So while there is a difference it is not very large and probably negligible.
Now, there may be an error in my calculations so I am adding the entire code below. I compiled it in release mode in Visual Studio 2010:
class Program
{
const int num = 10000000;
static void Main(string[] args)
{
for (int run = 1; run <= 5; run++)
{
Console.WriteLine("Run " + run);
RunTest1();
RunTest2();
}
Console.ReadLine();
}
static void RunTest1()
{
Console.WriteLine("Test1");
var t = new Test1();
var sw = Stopwatch.StartNew();
double x = 0;
for (var i = 0; i < num; i++)
{
t.CalculusMaster(x);
x += 1.0;
}
sw.Stop();
Console.WriteLine("Total time for " + num + " iterations: " + sw.ElapsedMilliseconds + " ms");
}
static void RunTest2()
{
Console.WriteLine("Test2");
var t = new Test2();
var sw = Stopwatch.StartNew();
double x = 0;
for (var i = 0; i < num; i++)
{
t.CalculusMaster(x);
x += 1.0;
}
sw.Stop();
Console.WriteLine("Total time for " + num + " iterations: " + sw.ElapsedMilliseconds + " ms");
}
}
class Test1
{
int y;
Func<double, double> f1;
Func<double, double> f2;
Func<double, double> f3;
public Test1()
{
this.y = 10;
this.f1 = (x => 25 * x + y);
this.f2 = (x => f1(x) + y * f1(x));
this.f3 = (x => Math.Log(f2(x) + f1(x)));
}
public double CalculusMaster(double x)
{
return f3(x) + f2(x);
}
}
class Test2
{
int y;
public Test2()
{
this.y = 10;
}
private double f1(double x)
{
return 25 * x + y;
}
private double f2(double x)
{
return f1(x) + y * f1(x);
}
private double f3(double x)
{
return Math.Log(f2(x) + f1(x));
}
public double CalculusMaster(double x)
{
return f3(x) + f2(x);
}
}
Note: I'm optimizing because of past experience and due to profiler software's advice. I realize an alternative optimization would be to call GetNeighbors less often, but that is a secondary issue at the moment.
I have a very simple function described below. In general, I call it within a foreach loop. I call that function a lot (about 100,000 times per second). A while back, I coded a variation of this program in Java and was so disgusted by the speed that I ended up replacing several of the for loops which used it with 4 if statements. Loop unrolling seems ugly, but it did make a noticeable difference in application speed. So, I've come up with a few potential optimizations and thought I would ask for opinions on their merit and for suggestions:
Use four if statements and totally ignore the DRY principle. I am confident this will improve performance based on past experience, but it makes me sad. To clarify, the 4 if statements would be pasted anywhere I called getNeighbors() too frequently and would then have the inside of the foreach block pasted within them.
Memoize the results in some mysterious manner.
Add a "neighbors" property to all squares. Generate its contents at initialization.
Use a code generation utility to turn calls to GetNeighbors into if statements as part of compilation.
public static IEnumerable<Square> GetNeighbors(Model m, Square s)
{
int x = s.X;
int y = s.Y;
if (x > 0) yield return m[x - 1, y];
if (y > 0) yield return m[x, y - 1];
if (x < m.Width - 1) yield return m[x + 1, y];
if (y < m.Height - 1) yield return m[x, y + 1];
yield break;
}
//The property of Model used to get elements.
private Square[,] grid;
//...
public Square this[int x, int y]
{
get
{
return grid[x, y];
}
}
Note: 20% of the time spent by the GetNeighbors function is spent on the call to m.get_Item, the other 80% is spent in the method itself.
Brian,
I've run into similar things in my code.
The two things I've found with C# that helped me the most:
First, don't be afraid necessarily of allocations. C# memory allocations are very, very fast, so allocating an array on the fly can often be faster than making an enumerator. However, whether this will help depends a lot on how you're using the results. The only pitfall I see is that, if you return a fixed size array (4), you're going to have to check for edge cases in the routine that's using your results.
Depending on how large your matrix of Squares is in your model, you may be better off doing 1 check up front to see if you're on the edge, and if not, precomputing the full array and returning it. If you're on an edge, you can handle those special cases separately (make a 1 or 2 element array as appropriate). This would put one larger statement in there, but that is often faster in my experience. If the model is large, I would avoid precomputing all of the neighbors. The overhead in the Squares may outweigh the benefits.
In my experience, as well, preallocating and returning vs. using yield makes the JIT more likely to inline your function, which can make a big difference in speed. If you can take advantage of the IEnumerable results and you are not always using every returned element, that is better, but otherwise, precomputing may be faster.
The other thing to consider - I don't know what information is saved in Square in your case, but if hte object is relatively small, and being used in a large matrix and iterated over many, many times, consider making it a struct. I had a routine similar to this (called hundreds of thousands or millions of times in a loop), and changing the class to a struct, in my case, sped up the routine by over 40%. This is assuming you're using .net 3.5sp1, though, as the JIT does many more optimizations on structs in the latest release.
There are other potential pitfalls to switching to struct vs. class, of course, but it can have huge performance impacts.
I'd suggest making an array of Squares (capacity four) and returning that instead. I would be very suspicious about using iterators in a performance-sensitive context. For example:
// could return IEnumerable<Square> still instead if you preferred.
public static Square[] GetNeighbors(Model m, Square s)
{
int x = s.X, y = s.Y, i = 0;
var result = new Square[4];
if (x > 0) result[i++] = m[x - 1, y];
if (y > 0) result[i++] = m[x, y - 1];
if (x < m.Width - 1) result[i++] = m[x + 1, y];
if (y < m.Height - 1) result[i++] = m[x, y + 1];
return result;
}
I wouldn't be surprised if that's much faster.
I'm on a slippery slope, so insert disclaimer here.
I'd go with option 3. Fill in the neighbor references lazily and you've got a kind of memoization.
ANother kind of memoization would be to return an array instead of a lazy IEnumerable, and GetNeighbors becomes a pure function that is trivial to memoize. This amounts roughly to option 3 though.
In any case, but you know this, profile and re-evaluate every step of the way. I am for example unsure about the tradeoff between the lazy IEnumerable or returning an array of results directly. (you avoid some indirections but need an allocation).
Why not make the Square class responsible of returning it's neighbours? Then you have an excellent place to do lazy initialisation without the extra overhead of memoization.
public class Square {
private Model _model;
private int _x;
private int _y;
private Square[] _neightbours;
public Square(Model model, int x, int y) {
_model = model;
_x = x;
_y = y;
_neightbours = null;
}
public Square[] Neighbours {
get {
if (_neightbours == null) {
_neighbours = GetNeighbours();
}
return _neighbours;
}
}
private Square[] GetNeightbours() {
int len = 4;
if (_x == 0) len--;
if (_x == _model.Width - 1) len--;
if (_y == 0) len--;
if (-y == _model.Height -1) len--;
Square [] result = new Square(len);
int i = 0;
if (_x > 0) {
result[i++] = _model[_x - 1,_y];
}
if (_x < _model.Width - 1) {
result[i++] = _model[_x + 1,_y];
}
if (_y > 0) {
result[i++] = _model[_x,_y - 1];
}
if (_y < _model.Height - 1) {
result[i++] = _model[_x,_y + 1];
}
return result;
}
}
Depending on the use of GetNeighbors, maybe some inversion of control could help:
public static void DoOnNeighbors(Model m, Square s, Action<s> action) {
int x = s.X;
int y = s.Y;
if (x > 0) action(m[x - 1, y]);
if (y > 0) action(m[x, y - 1]);
if (x < m.Width - 1) action(m[x + 1, y]);
if (y < m.Height - 1) action(m[x, y + 1]);
}
But I'm not sure, if this has better performance.