Can managed code impact instruction level parallelism?

Can managed code impact instruction level parallelism? - c#

Is there any way I can impact Instruction Level Parallelism writing C# code? In other words, is there a way I can "help" the compiler produce code that best makes use of ILP? I ask this because I'm trying to abstract away from a few concepts of machine architecture, and I need to know if this is possible. If not, then I will be warranted to abstract away from ILP.
EDIT: you will notice that I do not want to exploit ILP using C# in any way. My question is exactly the opposite. Paraphrasing: "I hope there's no way to exploit ILP from C#"
Thanks.

ILP is a feature of the CPU. You have no way to control it.
Compilers try their best to exploit it by breaking dependency chains.
This may include the .Net JIT Compiler, however I have no evidence of this.

You are at the mercy of the JIT when getting instruction level parallelism.
Who knows what optimisations the JIT actually does? I would select another language, like C++ if I really need this.
To best exploit ILP you need to break dependency chains. This should still apply. See this thread.
However, with all the abstraction, I would doubt that it is still possible to effectively exploit this in but the most extreme cases. What examples do you have where you need this?

There is NO explicit or direct way to influence or hint to the .NET compiler in IL or C# to do this. This is entirely the compilers job.
The only influence you can have on this would be to structure your program such that it would be more likely (although not guaranteed) to do this for you, and it would be difficult to know if it even acted on the structure or not. This is well abstracted away from the .NET languages and IL.

You are able to use ILP in CLI. So the short answer is No.
A bit longer:
I wrote a code for a simple image processing task before, and used this kind of optimazation to made my code a "bit" faster.
A "short" example:
static void Main( string[] args )
{
const int ITERATION_NUMBER = 100;
TimeSpan[] normal = new TimeSpan[ITERATION_NUMBER];
TimeSpan[] ilp = new TimeSpan[ITERATION_NUMBER];
int SIZE = 4000000;
float[] data = new float[SIZE];
float safe = 0.0f;
//Normal for
Stopwatch sw = new Stopwatch();
for (int iteration = 0; iteration < ITERATION_NUMBER; iteration++)
{
//Initialization
for (int i = 0; i < data.Length; i++)
{
data[i] = 1.0f;
}
sw.Start();
for (int index = 0; index < data.Length; index++)
{
data[index] /= 3.0f * data[index] > 2.0f / data[index] ? 2.0f / data[index] : 3.0f * data[index];
}
sw.Stop();
normal[iteration] = sw.Elapsed;
safe = data[0];
//Initialization
for (int i = 0; i < data.Length; i++)
{
data[i] = 1.0f;
}
sw.Reset();
//ILP For
sw.Start();
float ac1, ac2, ac3, ac4;
int length = data.Length / 4;
for (int i = 0; i < length; i++)
{
int index0 = i << 2;
int index1 = index0;
int index2 = index0 + 1;
int index3 = index0 + 2;
int index4 = index0 + 3;
ac1 = 3.0f * data[index1] > 2.0f / data[index1] ? 2.0f / data[index1] : 3.0f * data[index1];
ac2 = 3.0f * data[index2] > 2.0f / data[index2] ? 2.0f / data[index2] : 3.0f * data[index2];
ac3 = 3.0f * data[index3] > 2.0f / data[index3] ? 2.0f / data[index3] : 3.0f * data[index3];
ac4 = 3.0f * data[index4] > 2.0f / data[index4] ? 2.0f / data[index4] : 3.0f * data[index4];
data[index1] /= ac1;
data[index2] /= ac2;
data[index3] /= ac3;
data[index4] /= ac4;
}
sw.Stop();
ilp[iteration] = sw.Elapsed;
sw.Reset();
}
Console.WriteLine(data.All(item => item == data[0]));
Console.WriteLine(data[0] == safe);
Console.WriteLine();
double normalElapsed = normal.Max(time => time.TotalMilliseconds);
Console.WriteLine(String.Format("Normal Max.: {0}", normalElapsed));
double ilpElapsed = ilp.Max(time => time.TotalMilliseconds);
Console.WriteLine(String.Format("ILP Max.: {0}", ilpElapsed));
Console.WriteLine();
normalElapsed = normal.Average(time => time.TotalMilliseconds);
Console.WriteLine(String.Format("Normal Avg.: {0}", normalElapsed));
ilpElapsed = ilp.Average(time => time.TotalMilliseconds);
Console.WriteLine(String.Format("ILP Avg.: {0}", ilpElapsed));
Console.WriteLine();
normalElapsed = normal.Min(time => time.TotalMilliseconds);
Console.WriteLine(String.Format("Normal Min.: {0}", normalElapsed));
ilpElapsed = ilp.Min(time => time.TotalMilliseconds);
Console.WriteLine(String.Format("ILP Min.: {0}", ilpElapsed));
}
Results are (on .Net framework 4.0 Client profile, Release):
On a Virtual Machine (I think with no ILP):
True
True
Nor Max.: 111,1894
ILP Max.: 106,886
Nor Avg.: 78,163619
ILP Avg.: 77,682513
Nor Min.: 58,3035
ILP Min.: 56,7672
On a Xenon:
True
True
Nor Max.: 40,5892
ILP Max.: 30,8906
Nor Avg.: 35,637308
ILP Avg.: 25,45341
Nor Min.: 34,4247
ILP Min.: 23,7888
Explanation of Results:
In Debug, there is no optization applyed by the compiler, but the second for loop is more optimal than the first so there is a significant difference.
The answer seems to be in the results of the execution of Release mode builded assemblies. The IL compiler/JIT-er make it's best to minimize the performance counsumption (I think even ILP). But whether you make a code like the second for loop, you can reach better results in special cases, and second loop can overperform the first one on some achitectures. But
You are at the mercy of the JIT
as mentioned, sadly. Sad thing there is no mention of implementation could define more optimization, like ILP (a short paragraph can be placed in the specification). But they can not enumerate every form of architectural optomizations of code, and CLI is on a higher level:
This is well abstracted away from the .NET languages and IL.
This is a very complex problem to answer it only experimental way. I don't think we could get much more precise answer a way like this. And I think the Question is missleading becuse it isn't depending on C#, it depends on the implementation of CLI.
There could be many influencing factors, and it makes hard to answer correctly a question like this thinking about JIT until we think it as Black Box.
I found things about loop vectorization and autothreading on page 512-513.:
http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-335.pdf
I think they don't specify explicitly how the JIT-er need to behave in cases like this, and impelenters can choose the way of optimization. So I think you can impact, if you can write more optimal code, and JIT will try to use the ILP if it is possible/implemented.
I think because they don't specify, there is a possibility.
So the answer seems to be No. I belive you can't abstract away from ILP in the case of CLI, if the specification doesn't say it.
Update:
I found a blog post before, but I haven't found it until now:
http://igoro.com/archive/gallery-of-processor-cache-effects/
Example four contains a short, but proper answer for your question.

Related

3SUM - O(n^2 * log n) slower than O(n^2)?

In the scenario I present to you, my solution is supposed to represent O(n^2 * log n), and the "pointers" solution, which I assume is the fastest way to resolve the "3SUM" problem, represents O(n^2 * 1); leaving the question of is O(1) faster than O(log n), exampling it with my code.
Could someone explain why this seems to be the case? Please. My logic tells me that O(log n) should be as fast as O(1), if not faster.
I hope my comments on the code of my solution are clear.
Edit: I know that this does not sound very smart... log(n) counts the input (n -> ∞), while 1... is just 1. BUT, in this case, for finding a number, how is it supposed to be faster to do sums and subtractions instead of using binary search (log n)? It just does not enter my head.
LeetCode 3SUM problem description
O(n^2 * log n)
For an input of 3,000 values:
Iterations: 1,722,085 (61% less than the "pointers solution")
Runtime: ~92 ms (270% slower than the typical O(n^2) solution)
public IList<IList<int>> MySolution(int[] nums)
{
IList<IList<int>> triplets = new List<IList<int>>();
Array.Sort(nums);
for (int i = 0; i < nums.Length; i++)
{
// Avoid duplicating results.
if (i > 0 && nums[i] == nums[i - 1])
continue;
for (int j = i+1; j < nums.Length - 1; j++)
{
// Avoid duplicating results.
if (j > (i+1) && nums[j] == nums[j - 1])
continue;
// The solution for this triplet.
int numK = -(nums[i] + nums[j]);
// * This is the problem.
// Search for 'k' index in the array.
int kSearch = Array.BinarySearch(nums, j + 1, nums.Length - (j + 1), numK);
// 'numK' exists in the array.
if (kSearch > 0)
{
triplets.Add(new List<int>() { nums[i], nums[j], numK });
}
// 'numK' is too small, break this loop since its value is just going to increase.
else if (~kSearch == (j + 1))
{
break;
}
}
}
return triplets;
}
O(n^2)
For the same input of 3,000 values:
Iterations: 4.458.579
Runtime: ~34 ms
public IList<IList<int>> PointersSolution(int[] nums)
{
IList<IList<int>> triplets = new List<IList<int>>();
Array.Sort(nums);
for (int i = 0; i < nums.Length; i++)
{
if (i > 0 && nums[i] == nums[i - 1])
continue;
int l = i + 1, r = nums.Length - 1;
while (l < r)
{
int sum = nums[i] + nums[l] + nums[r];
if (sum < 0)
{
l++;
}
else if (sum > 0)
{
r--;
}
else
{
triplets.Add(new List<int>() { nums[i], nums[l], nums[r] });
do
{
l++;
}
while (l < r && nums[l] == nums[l - 1]);
}
}
}
return triplets;
}

It seems that your conceptual misunderstanding comes from the fact that you are missing that Array.BinarySearch does some iterations too (it was indicated by the initial iterations counts in the question which you now have changed).
So while assumption that binary search should be faster than simple iteration trough the collection is pretty valid - you are missing that binary search is basically an extra loop, so you should not compare those two but compare the second for loop + binary search in the first solution against the second loop of the second.
P.S.
To argue about time complexity based on runtimes with at least some degree of certainty you need at least to perform several tests with different increasing number of elements (like 100, 1000, 10000, 100000 ...) and see how the runtime changes. Also different inputs for the same number of elements are recommended cause in theory you can hit some optimal cases for one algorithm which can be the worst case scenarios for another.

Quick interjection--not sure your second solution (pointers) is O(n^2)--It has a third inner loop. (See Stron's response below)
I took a moment to profile you code with a generic .NET profiler and:
That ought to do it, huh? ;)
After checking the implementation, I found that BinarySearch internally uses CompareTo which I imagine isn't ideal (but, being a generic for an unmanaged type, it shouldn't be that bad...)
To "Improve" it, I dragged BinarySearch, kicking and screaming, and replaced the CompareTo with actual comparison operators. I named this benchmark MyImproved Here's the results:
Benchmark.NET results:
Interestingly, Benchmark.NET disregards common sense and puts MyImproved over Pointers. This may be due to some optimization which is turned off by the profiler.
Method
Complexity
Mean
Error
StdDev
Code Size
Pointers
O(n^2)???
76.76 ms
1.465 ms
1.628 ms
1,781 B
My
O(n^2 * log n)
93.08 ms
1.831 ms
3.980 ms
1,999 B
MyImproved
O(n^2 * log n)
62.53 ms
1.234 ms
2.226 ms
1,980 B
TL;DR:
.CompareTo() seemed to be bogging down the implementation of .BinarySearch(). Removing it and using actual integer comparison seemed to help a lot. Either that, or it's some funky interface stuff that I'm not prepared to investigate :)

Two tips:
Use sharplab.io to see your lowered code, it may reveal something (link)
try running these seperate tests through the dotnetBenchmark nuget package, it'll give you more accurate timings, and if the memory usage or allocations is considerably higher in one case, that could be your answer.
Anyway, are you running these tests in debug or release mode? I just had a thought that I haven't tested recently, but I believe that the debugger overhead can significantly affect the performance of a binary search.
Give it a go, and let me know

Why is HashSet<Point> so much slower than HashSet<string>?

I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is HashSet<Point> or similar classes. However this seems to be very slow compared to something like HashSet<string>.
For example, this code:
HashSet<Point> points = new HashSet<Point>();
using (Bitmap img = new Bitmap(1000, 1000))
{
for (int x = 0; x < img.Width; x++)
{
for (int y = 0; y < img.Height; y++)
{
points.Add(new Point(x, y));
}
}
}
takes about 22.5 seconds.
While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:
HashSet<string> points = new HashSet<string>();
using (Bitmap img = new Bitmap(1000, 1000))
{
for (int x = 0; x < img.Width; x++)
{
for (int y = 0; y < img.Height; y++)
{
points.Add(x + "," + y);
}
}
}
So, my questions are:
Is there a reason for that? I checked this answer, but 22.5 sec is way more than the numbers shown in that answer.
Is there a better way to store points without duplicates?

There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0)); to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".
At issue is that HashSet<T> needs an IEqualityComparer<T> to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default<T>(). That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.
The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.
You can solve both problems by providing the HashSet with a good comparer. Like this one:
class PointComparer : IEqualityComparer<Point> {
public bool Equals(Point x, Point y) {
return x.X == y.X && x.Y == y.Y;
}
public int GetHashCode(Point obj) {
// Perfect hash for practical bitmaps, their width/height is never >= 65536
return (obj.Y << 16) ^ obj.X;
}
}
And use it:
HashSet<Point> list = new HashSet<Point>(new PointComparer());
And it is now about 150 times faster, easily beating the string test.

The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).
Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj) thus increasing the amount of boxing conversions.
Also note that the hash code of Point is computed by x ^ y. This produces very little dispersion in your data range, and therefore the buckets of the HashSet are overpopulated — something that doesn't happen with string, where the dispersion of the hashes is much larger.
You can solve that problem by implementing your own Point struct (trivial) and using a better hash algorithm for your expected data range, e.g. by shifting the coordinates:
(x << 16) ^ y
For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.

What is wrong with this fourier transform implementation

I'm trying to implement a discrete fourier transform, but it's not working. I'm probably have written a bug somewhere, but I haven't found it yet.
Based on the following formula:
This function does the first loop, looping over X0 - Xn-1...
public Complex[] Transform(Complex[] data, bool reverse)
{
var transformed = new Complex[data.Length];
for(var i = 0; i < data.Length; i++)
{
//I create a method to calculate a single value
transformed[i] = TransformSingle(i, data, reverse);
}
return transformed;
}
And the actual calculating, this is probably where the bug is.
private Complex TransformSingle(int k, Complex[] data, bool reverse)
{
var sign = reverse ? 1.0: -1.0;
var transformed = Complex.Zero;
var argument = sign*2.0*Math.PI*k/data.Length;
for(var i = 0; i < data.Length; i++)
{
transformed += data[i]*Complex.FromPolarCoordinates(1, argument*i);
}
return transformed;
}
Next the explaination of the rest of the code:
var sign = reverse ? 1.0: -1.0; The reversed DFT will not have -1 in the argument, while a regular DFT does have a -1 in the argument.
var argument = sign*2.0*Math.PI*k/data.Length; is the argument of the algorithm. This part:
then the last part
transformed += data[i]*Complex.FromPolarCoordinates(1, argument*i);
I think I carefully copied the algorithm, so I don't see where I made the mistake...
Additional information
As Adam Gritt has shown in his answer, there is a nice implementation of this algorithm by AForge.net. I can probably solve this problem in 30 seconds by just copying their code. However, I still don't know what I have done wrong in my implementation.
I'm really curious where my flaw is, and what I have interpreted wrong.

My days of doing complex mathematics are a ways behind me right now so I may be missing something myself. However, it appears to me that you are doing the following line:
transformed += data[i]*Complex.FromPolarCoordinates(1, argument*i);
when it should probably be more like:
transformed += data[i]*Math.Pow(Math.E, Complex.FromPolarCoordinates(1, argument*i));
Unless you have this wrapped up into the method FromPolarCoordinates()
UPDATE:
I found the following bit of code in the AForge.NET Framework library and it shows additional Cos/Sin operations being done that are not being handled in your code. This code can be found in full context in the Sources\Math\FourierTransform.cs: DFT method.
for ( int i = 0; i < n; i++ )
{
dst[i] = Complex.Zero;
arg = - (int) direction * 2.0 * System.Math.PI * (double) i / (double) n;
// sum source elements
for ( int j = 0; j < n; j++ )
{
cos = System.Math.Cos( j * arg );
sin = System.Math.Sin( j * arg );
dst[i].Re += ( data[j].Re * cos - data[j].Im * sin );
dst[i].Im += ( data[j].Re * sin + data[j].Im * cos );
}
}
It is using a custom Complex class (as it was pre-4.0). Most of the math is similar to what you have implemented but the inner iteration is doing additional mathematical operations on the Real and Imaginary portions.
FURTHER UPDATE:
After some implementation and testing I found that the code above and the code provided in the question produce the same results. I also found, based on the comments what the difference is between what is generated from this code and what is produced by WolframAlpha. The difference in the results is that it would appear that Wolfram is applying a normalization of 1/sqrt(N) to the results. In the Wolfram Link provided if each value is multiplied by Sqrt(2) then the values are the same as those generated by the above code (rounding errors aside). I tested this by passing 3, 4, and 5 values into Wolfram and found that my results were different by Sqrt(3), Sqrt(4) and Sqrt(5) respectfully. Based on the Discrete Fourier Transform information provided by wikipedia it does mention a normalization to make the transforms for DFT and IDFT unitary. This might be the avenue that you need to look down to either modify your code or understand what Wolfram may be doing.

Your code is actually almost right (you are missing an 1/N on the inverse transform). The thing is, the formula you used is typically used for computations because it's lighter, but on purely theorical environments (and in Wolfram), you would use a normalization by 1/sqrt(N) to make the transforms unitary.
i.e. your formulas would be:
Xk = 1/sqrt(N) * sum(x[n] * exp(-i*2*pi/N * k*n))
x[n] = 1/sqrt(N) * sum(Xk * exp(i*2*pi/N * k*n))
It's just a matter of convention in normalisation, only amplitudes change so your results weren't that bad (if you hadn't forgotten the 1/N in the reverse transform).
Cheers

Dynamic compilation for performance

I have an idea of how I can improve the performance with dynamic code generation, but I'm not sure which is the best way to approach this problem.
Suppose I have a class
class Calculator
{
int Value1;
int Value2;
//..........
int ValueN;
void DoCalc()
{
if (Value1 > 0)
{
DoValue1RelatedStuff();
}
if (Value2 > 0)
{
DoValue2RelatedStuff();
}
//....
//....
//....
if (ValueN > 0)
{
DoValueNRelatedStuff();
}
}
}
The DoCalc method is at the lowest level and it is called many times during calculation. Another important aspect is that ValueN are only set at the beginning and do not change during calculation. So many of the ifs in the DoCalc method are unnecessary, as many of ValueN are 0. So I was hoping that dynamic code generation could help to improve performance.
For instance if I create a method
void DoCalc_Specific()
{
const Value1 = 0;
const Value2 = 0;
const ValueN = 1;
if (Value1 > 0)
{
DoValue1RelatedStuff();
}
if (Value2 > 0)
{
DoValue2RelatedStuff();
}
....
....
....
if (ValueN > 0)
{
DoValueNRelatedStuff();
}
}
and compile it with optimizations switched on the C# compiler is smart enough to only keep the necessary stuff. So I would like to create such method at run time based on the values of ValueN and use the generated method during calculations.
I guess that I could use expression trees for that, but expression trees works only with simple lambda functions, so I cannot use things like if, while etc. inside the function body. So in this case I need to change this method in an appropriate way.
Another possibility is to create the necessary code as a string and compile it dynamically. But it would be much better for me if I could take the existing method and modify it accordingly.
There's also Reflection.Emit, but I don't want to stick with it as it would be very difficult to maintain.
BTW. I'm not restricted to C#. So I'm open to suggestions of programming languages that are best suited for this kind of problem. Except for LISP for a couple of reasons.
One important clarification. DoValue1RelatedStuff() is not a method call in my algorithm. It's just some formula-based calculation and it's pretty fast. I should have written it like this
if (Value1 > 0)
{
// Do Value1 Related Stuff
}
I have run some performance tests and I can see that with two ifs when one is disabled the optimized method is about 2 times faster than with the redundant if.
Here's the code I used for testing:
public class Program
{
static void Main(string[] args)
{
int x = 0, y = 2;
var if_st = DateTime.Now.Ticks;
for (var i = 0; i < 10000000; i++)
{
WithIf(x, y);
}
var if_et = DateTime.Now.Ticks - if_st;
Console.WriteLine(if_et.ToString());
var noif_st = DateTime.Now.Ticks;
for (var i = 0; i < 10000000; i++)
{
Without(x, y);
}
var noif_et = DateTime.Now.Ticks - noif_st;
Console.WriteLine(noif_et.ToString());
Console.ReadLine();
}
static double WithIf(int x, int y)
{
var result = 0.0;
for (var i = 0; i < 100; i++)
{
if (x > 0)
{
result += x * 0.01;
}
if (y > 0)
{
result += y * 0.01;
}
}
return result;
}
static double Without(int x, int y)
{
var result = 0.0;
for (var i = 0; i < 100; i++)
{
result += y * 0.01;
}
return result;
}
}

I would usually not even think about such an optimization. How much work does DoValueXRelatedStuff() do? More than 10 to 50 processor cycles? Yes? That means you are going to build quite a complex system to save less then 10% execution time (and this seems quite optimistic to me). This can easily go down to less then 1%.
Is there no room for other optimizations? Better algorithms? An do you really need to eliminate single branches taking only a single processor cycle (if the branch prediction is correct)? Yes? Shouldn't you think about writing your code in assembler or something else more machine specific instead of using .NET?
Could you give the order of N, the complexity of a typical method, and the ratio of expressions usually evaluating to true?

It would surprise me to find a scenario where the overhead of evaluating the if statements is worth the effort to dynamically emit code.
Modern CPU's support branch prediction and branch predication, which makes the overhead for branches in small segments of code approach zero.
Have you tried to benchmark two hand-coded versions of the code, one that has all the if-statements in place but provides zero values for most, and one that removes all of those same if branches?

If you are really into code optimisation - before you do anything - run the profiler! It will show you where the bottleneck is and which areas are worth optimising.
Also - if the language choice is not limited (except for LISP) then nothing will beat assembler in terms of performance ;)
I remember achieving some performance magic by rewriting some inner functions (like the one you have) using assembler.

Before you do anything, do you actually have a problem?
i.e. does it run long enough to bother you?
If so, find out what is actually taking time, not what you guess. This is the quick, dirty, and highly effective method I use to see where time goes.
Now, you are talking about interpreting versus compiling. Interpreted code is typically 1-2 orders of magnitude slower than compiled code. The reason is that interpreters are continually figuring out what to do next, and then forgetting, while compiled code just knows.
If you are in this situation, then it may make sense to pay the price of translating so as to get the speed of compiled code.

Can this code be optimised?

I have some image processing code that loops through 2 multi-dimensional byte arrays (of the same size). It takes a value from the source array, performs a calculation on it and then stores the result in another array.
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int x = 0; x < xSize; x++)
{
for (int y = 0; y < ySize; y++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
The loop currently takes ~11ms, which I assume is mostly due to accessing the byte arrays values as the calculation is pretty simple (2 multiplications and 1 addition).
Is there anything I can do to speed this up? It is a time critical part of my program and this code gets called 80-100 times per second, so any speed gains, however small will make a difference. Also at the moment xSize = 768 and ySize = 576, but this will increase in the future.
Update: Thanks to Guffa (see answer below), the following code saves me 4-5ms per loop. Although it is unsafe code.
int size = ResultImageData.Length;
int counter = 0;
unsafe
{
fixed (byte* r = ResultImageData, c = CurrentImageData, a = AlphaImageData)
{
while (size > 0)
{
*(r + counter) = (byte)(*(c + counter) * AlphaValue +
*(a + counter) * OneMinusAlphaValue);
counter++;
size--;
}
}
}

To get any real speadup for this code you would need to use pointers to access the arrays, that removes all the index calculations and bounds checking.
int size = ResultImageData.Length;
unsafe
{
fixed(byte* rp = ResultImageData, cp = CurrentImageData, ap = AlphaImageData)
{
byte* r = rp;
byte* c = cp;
byte* a = ap;
while (size > 0)
{
*r = (byte)(*c * AlphaValue + *a * OneMinusAlphaValue);
r++;
c++;
a++;
size--;
}
}
}
Edit:
Fixed variables can't be changed, so I added code to copy the pointers to new pointers that can be changed.

These are all independent calculations so if you have a multicore CPU you should be able to gain some benefit by parallelizing the calculation. Note that you'd need to keep the threads around and just hand them work to do since the overhead of thread creation would probably make this slower rather than faster if the threads are recreated each time.
The other thing that may work is farming the work off to the graphics processor. Look at this question for some ideas, for example, using Accelerator.

An option would be to use unsafe code: fixing the array in memory and use pointer operations. I doubt the speed increase will be that dramatic though.
One note: how are you timing? If you are using DateTime then be aware that this class has poor resolution. You should add an outer loop and repeat the operation say ten times -- I bet the result is less than 110ms.
for (int outer = 0; outer < 10; ++outer)
{
for (int x = 0; x < xSize; x++)
{
for (int y = 0; y < ySize; y++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
}

Since it appears that each cell in the matrix is calculated entirely independent of the others. You may want to look into having more than one thread handle this. To avoid the cost of creating threads you could have a thread pool.
If the matrix is of sufficient size, it could be a very nice speed gain. On the other hand, if it is too small, it may not help (even hurt). Worth a try though.
An example (pseudo code) could be like this:
void process(int x, int y) {
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
ThreadPool pool(3); // 3 threads big
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int x = 0; x < xSize; x++) {
for (int y = 0; y < ySize; y++) {
pool.schedule(x, y); // this will add all tasks to the pool's work queue
}
}
pool.waitTilFinished(); // wait until all scheduled tasks are complete
EDIT: Michael Meadows mentioned in a comment that plinq may be a suitable alternative: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx

I'd recommend running a few empty tests to figure out what your theoretical bounds are. For example, take out the calculation from inside the loop and see how much time is saved. Try replacing the double loop with a single loop that runs the same number of times and see how much time that saves. Then you can be sure you are going down the right path for optimization (the two paths I see are flattening the double loop into a single loop and working with the multiplication [maybe using a lookup table would be faster]).

Just real quick, you can get an optimization by looping in reverse and comparing against 0. Most CPUs have a fast op for comparison to 0.
E.g.
int xSize = ResultImageData.GetLength(0) -1;
int ySize = ResultImageData.GetLength(1) -1; //minor optimization suggested by commenter
for (int x = xSize; x >= 0; --x)
{
for (int y = ySize; y >=0; --y)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
See http://dotnetperls.com/Content/Decrement-Optimization.aspx

You are probably suffering from Boundschecking. Like Jon Skeet states, a jagged array instead of a multidimensional (that is data[][] instead of data[,]) will be faster, strange as that may seem.
The compiler will optimize
for (int i = 0; i < data.Length; i++)
by eliminating the per-element range check. But it's some kind of special case, it won't do the same for Getlength().
For the same reason, caching or hoisting the Length property (putting it in a variable like xSize) also used to be a bad thing though I haven't been able to verify that with Framework 3.5

Try swapping the x and y for loops for a more linear memory access pattern and (thus) less cache misses, like so.
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int y = 0; y < ySize; y++)
{
for (int x = 0; x < xSize; x++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}

If you are using LockBits to get at the image buffer, you should loop through y in the outer loop and x in the inner loop as that is how it is stored in memory (by row, not column). I would say that 11ms is pretty darn fast though...

Does the image data have to be stored in a multi-dimensional (rectangular) array? If you use jagged arrays instead, you may well find the JIT has more optimizations available (including removing the bounds checking).

If CurrentImageData and/or AlphaImageData don't change every time you run your code snippet, you could store the product prior to running the code snippet you show and avoid that multiplication in your loops.
Edit: Another thing I just thought of: Sometimes int operations are quicker than byte operations. Offset this with your processor cache utilization (you'll increase the data size considerably and stand a greater risk of a cache miss).

442,368 additions and 884,736 multiplications for the calculation i would think 11ms is actually extremely slow on a modern CPU.
while i don't know much about the specifics of .net i do know high speed calculation is not its strong suit. In the past i've built java apps with similar problems, i've always used C libraries to do the image / audio processing.
coming from a hardware perspective you want to make sure the memory accesses are sequential, that is step through the buffer in the order it exists in memory. you also may need to reorder this such that the compiler takes advantage of available instructions such as SIMD. How to approach this will end up being dependent on your compiler and i can't help on vs.net.
on an embedded DSP i would break out
(AlphaImageData[x, y] * OneMinusAlphaValue) and (CurrentImageData[x, y] * AlphaValue) and use SIMD instructions to calculate buffers, possibly in parallel before performing the addition. perhaps doing small enough chunks to keep the buffers in cache on the cpu.
i believe anything you do will require more direct access to the memory/cpu than .net allows.

You may also want to take a look at the Mono runtime and its Simd extensions. Perhaps some of your calculations can make use of the SSE acceleration as I gather that you basically do vector calculations (I don't know up to which vector size there is acceleration for multiplication but there is for some sizes)
(Blog post announcing Mono.Simd: http://tirania.org/blog/archive/2008/Nov-03.html)
Of course, that wouldn't work on Microsoft .NET but maybe you are interested in some experimentation.

Interestingly, image data is frequently pretty similar, meaning that the calculations are likely very repetitive. Have you explored doing a lookup table for the calculations? So any time 0.8 was multiplied by 128 - value[80,128] which you've precalculated to 102.4, you simply looked that up? You're basically trading memory space for CPU speed, but it could work for you.
Of course, if your image data has too high a resolution (and goes to too significant a digit), this may not be practical.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Can managed code impact instruction level parallelism? - c#

ILP is a feature of the CPU. You have no way to control it. Compilers try their best to exploit it by breaking dependency chains. This may include the .Net JIT Compiler, however I have no evidence of this.

Related

3SUM - O(n^2 * log n) slower than O(n^2)?

Why is HashSet<Point> so much slower than HashSet<string>?

What is wrong with this fourier transform implementation

Dynamic compilation for performance

Can this code be optimised?

Categories

Resources