Matrix3x2 Performance - c#

In my graphics application, I can represent matrices using either SharpDX.Matrix3x2 or System.Numerics.Matrix3x2. However, upon running both matrices through a performance test, I found that SharpDX's matrices handily defeat System.Numerics.Matrix3x2 by a margin of up to 70% in terms of time. My test was a pretty simple repeated multiplication, here's the code:
var times1 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = SharpDX.Matrix3x2.Identity;
for (var j = 0; j < 10000; j++)
mat *= SharpDX.Matrix3x2.Rotation(13);
sw.Stop();
times1.Add(sw.ElapsedTicks);
}
var times2 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = System.Numerics.Matrix3x2.Identity;
for (var j = 0; j < 10000; j++)
mat *= System.Numerics.Matrix3x2.CreateRotation(13);
sw.Stop();
times2.Add(sw.ElapsedTicks);
}
TestContext.WriteLine($"SharpDX: {times1.Average()}\nSystem.Numerics: {times2.Average()}");
I ran these tests on an Intel i5-6200U processor.
Now, my question is, how can SharpDX's matrices possibly be faster? Isn't System.Numerics.Matrix3x2 supposed to utilise SIMD instructions to execute faster?
The implementation of SharpDX.Matrix3x2 is available here, and as you can see, it's written in plain C#.

It turns out that my testing logic was flawed - I was creating the rotation matrix inside the loop, which meant that I was testing the creation of rotation matrices and multiplication. I revised my testing code to look like this:
var times1 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = SharpDX.Matrix3x2.Identity;
var s = SharpDX.Matrix3x2.Scaling(13);
var r = SharpDX.Matrix3x2.Rotation(13);
var t = SharpDX.Matrix3x2.Translation(13, 13);
for (var j = 0; j < 10000; j++)
{
mat *= s;
mat *= r;
mat *= t;
}
sw.Stop();
times1.Add(sw.ElapsedTicks);
}
var times2 = new List<float>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
var mat = System.Numerics.Matrix3x2.Identity;
var s = System.Numerics.Matrix3x2.CreateScale(13);
var r = System.Numerics.Matrix3x2.CreateRotation(13);
var t = System.Numerics.Matrix3x2.CreateTranslation(13, 13);
for (var j = 0; j < 10000; j++)
{
mat *= s;
mat *= r;
mat *= t;
}
sw.Stop();
times2.Add(sw.ElapsedTicks);
}
So that the only thing performed inside the loop was multiplication, and I began to receive results indicating better performance from System.Numerics.Matrix3x2.
Another point: I didn't pay attention to the fact that SIMD optimisations only take effect in 64-bit code. These are my test results before and after changing the platform to x64:
Platform Target | System.Numerics.Matrix3x2 | SharpDX.Matrix3x2
---------------------------------------------------------------
AnyCPU | 168ms | 197ms
x64 | 1.40ms | 1.43ms
When I check Environment.Is64BitProcess under AnyCPU, it returns false - and the "Prefer 32-Bit" box in Visual Studio is greyed out, so I suspect that AnyCPU is just an alias for x86 in this case, which explains why the test is 2 orders of magnitude faster under x64.

There are a few other things you need to consider also with the testing. These are just side notes, and wont affect your current results. I've done some testing like this also.
Some corresponding functions in Sharpdx pass by object, not reference, there are corresponding by reference functions you might want to play with. You've used the operators in your testing (all fine, its a comparable test!). Just in some situations, use of operators is slower than the by reference functions.

Related

Vectorized C# code with SIMD using Vector<T> running slower than classic loop

I've seen a few articles describing how Vector<T> is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).
I decided to try to rewrite a method I have to see if I managed to get some speedup, but so far I failed and the vectorized code is running 3 times slower than the original, and I'm not exactly sure as to why. Here are two versions of a method checking if two Span<float> instances have all the pairs of items in the same position that share the same position relative to a threshold value.
// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
for (int i = 0; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
return true;
}
// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
// Setup the test vector
int l = Vector<float>.Count;
float* arr = stackalloc float[l];
for (int i = 0; i < l; i++)
arr[i] = threshold;
Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
{
// Iterate in chunks
int
div = x1.Length / l,
mod = x1.Length % l,
i = 0,
offset = 0;
for (; i < div; i += 1, offset += l)
{
Vector<float>
v1 = Unsafe.Read<Vector<float>>(px1 + offset),
v1cmp = Vector.GreaterThan<float>(v1, cmp),
v2 = Unsafe.Read<Vector<float>>(px2 + offset),
v2cmp = Vector.GreaterThan<float>(v2, cmp);
float*
pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
for (int j = 0; j < l; j++)
if (pcmp1[j] == 0 != (pcmp2[j] == 0))
return false;
}
// Test the remaining items, if any
if (mod == 0) return true;
for (i = x1.Length - mod; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
}
return true;
}
As I said, I've tested both versions using BenchmarkDotNet, and the one using Vector<T> is running around 3 times slower than the other one. I tried running the tests with spans of different length (from around 100 to over 2000), but the vectorized method keeps being much slower than the other one.
Am I missing something obvious here?
Thanks!
EDIT: the reason why I'm using unsafe code and trying to optimize this code as much as possible without parallelizing it is that this method is already being called from within a Parallel.For iteration.
Plus, having the ability to parallelize the code over multiple threads is generally not a good reason to leave the individual parallel tasks not optimized.
I had the same problem. The solution was to uncheck the Prefer 32-bit option at the project properties.
SIMD is only enabled for 64-bit processes. So make sure your app either is targeting x64 directly or is compiled as Any CPU and not marked as 32-bit preferred. [Source]
** EDIT ** After reading a blog post by Marc Gravell, I see that this can be achieved simply...
public static bool MatchElementwiseThresholdSIMD(ReadOnlySpan<float> x1, ReadOnlySpan<float> x2, float threshold)
{
if (x1.Length != x2.Length) throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<float, Vector<float>>();
var vx2 = x2.NonPortableCast<float, Vector<float>>();
var vthreshold = new Vector<float>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.Xor(v1cmp, v2cmp) != Vector<int>.Zero)
return false;
}
x1 = x1.Slice(Vector<float>.Count * vx1.Length);
x2 = x2.Slice(Vector<float>.Count * vx2.Length);
}
for (var i = 0; i < x1.Length; i++)
if (x1[i] > threshold != x2[i] > threshold)
return false;
return true;
}
Now this is not quite as quick as using array's directly (if that's what you have) but is still significantly faster than the non-SIMD version...
(Another edit...)
...and just for fun I thought I would see well this stuff handles works when fully generic, and the answer is very well... so you can write code like the following, and it is just as efficient as being specific (well except in the non-hardware accelerated case, in which case its a bit less than twice as slow - but not completely terrible...)
public static bool MatchElementwiseThreshold<T>(ReadOnlySpan<T> x1, ReadOnlySpan<T> x2, T threshold)
where T : struct
{
if (x1.Length != x2.Length)
throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<T, Vector<T>>();
var vx2 = x2.NonPortableCast<T, Vector<T>>();
var vthreshold = new Vector<T>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.AsVectorInt32(Vector.Xor(v1cmp, v2cmp)) != Vector<int>.Zero)
return false;
}
// slice them to handling remaining elementss
x1 = x1.Slice(Vector<T>.Count * vx1.Length);
x2 = x2.Slice(Vector<T>.Count * vx1.Length);
}
var comparer = System.Collections.Generic.Comparer<T>.Default;
for (int i = 0; i < x1.Length; i++)
if ((comparer.Compare(x1[i], threshold) > 0) != (comparer.Compare(x2[i], threshold) > 0))
return false;
return true;
}
A vector is just a vector. It doesn't claim or guarantee that SIMD extensions are used. Use
System.Numerics.Vector2
https://learn.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types

How can I specify a subset of a 1-D array as the send or receive buffer in MPI.NET for C#?

I am learning C# parallel programming using MPI.NET. In the example given below, I define a 1-D array (x) in each process and then I do some simple calculation on the corresponding part assigned to that process (i.e only the assigned part of x) to obtain its corresponding part in (y). My primary interest is to gather all these assigned parts (part of y calculated on each process) into the y-array on the root process to be able to finally calculate the sum. I mean, I want to copy each assigned part from all processes on the corresponding part on y-array located on root process.However, I could not do it; the only thing I could do was to gather 1-D arrays into a 2-D array or to gather all of them on a new defined 1-D array with the size of "comm.size*y.length". As I searched, using "MPI_Gather ($sendbuf , sendcnt, sendtype, &recbuf, recvcount, root, comm)" keyword in C++ we are able to do this task, HOWEVER, as it seams to me that "MPI.Gather" in C# is different and it does not have the flexibility of MPI_Gather in C++. I need to gather all the calculated parts of y in each process into the corresponding location in the y-array on root process. In other words, how can I specify a subset of an array as the send or receive buffer in MPI.NET for C#. I would appreciate it if you help me in this matter.
using (new MPI.Environment(ref args))
{
double sumSerial = 0;
double sumParallel = 0;
int arraySize = 100000;
double[] x = new double[arraySize];
double[] y = new double[arraySize];
Intracommunicator comm = Communicator.world;
int numProc = comm.Size;
int numItr = arraySize / numProc;
for (int i = 0; i < x.Length; i++)
{
x[i] = i;
sumSerial += i;
}
int firstInx = comm.Rank * numItr;
int lastInx = firstInx + numItr;
for (int i = firstInx; i < lastInx; i++)
{
y[i] = 5.0 * x[i];
}
//double[][] zz=comm.Gather<double[]>(y,0);
double[] z = comm.GatherFlattened(y, 0);
comm.Barrier();
if (comm.Rank==0)
{
//for (int i = 0; i < numProc; i++)
//{
// for (int j = 0; j < zz[0].Length; j++)
// {
// sumParallel += zz[i][j];
// }
//}
for (int i = 0; i < z.Length; i++)
{
sumParallel += z[i];
}
Console.WriteLine("sum_Parallel: {0}; sum_Serial= {1};
Ratio: {2}; z_length: {3}", sumParallel, sumSerial,
sumParallel / sumSerial, z.Length);
}
}

Parallelize transitive reduction

I have a Dictionary<int, List<int>>, where the Key represents an element of a set (or a vertex in an oriented graph) and the List is a set of other elements which are in relation with the Key (so there are oriented edges from Key to Values). The dictionary is optimized for creating a Hasse diagram, so the Values are always smaller than the Key.
I have also a simple sequential algorithm, that removes all transitive edges (e.g. I have relations 1->2, 2->3 and 1->3. I can remove the edge 1->3, because I have a path between 1 and 3 via 2).
for(int i = 1; i < dictionary.Count; i++)
{
for(int j = 0; j < i; j++)
{
if(dictionary[i].Contains(j))
dictionary[i].RemoveAll(r => dictionary[j].Contains(r));
}
}
Would it be possible to parallelize the algorithm? I could do Parallel.For for the inner loop. However, this is not recommended (https://msdn.microsoft.com/en-us/library/dd997392(v=vs.110).aspx#Anchor_2) and the resulting speed would not increase significantly (+ there might be problems with locking). Could I parallelize the outer loop?
There is simple way to solve the parallelization problem, separate data. Read from original data structure and write to new. That way You can run it in parallel without even need to lock.
But probably the parallelization is not even necessary, the data structures are not efficient. You use dictionary where array would be sufficient (as I understand the code You have vertices 0..result.Count-1). And List<int> for lookups. List.Contains is very inefficient. HashSet would be better. Or, for more dense graphs, BitArray. So instead of Dictionary<int, List<int>> You can use BitArray[].
I rewrote the algorithm and made some optimizations. It does not make plain copy of the graph and delete edges, it just construct the new graph from only the right edges. It uses BitArray[] for input graph and List<int>[] for final graph, as the latter one is far more sparse.
int sizeOfGraph = 1000;
//create vertices of a graph
BitArray[] inputGraph = new BitArray[sizeOfGraph];
for (int i = 0; i < inputGraph.Length; ++i)
{
inputGraph[i] = new BitArray(i);
}
//fill random edges
Random rand = new Random(10);
for (int i = 1; i < inputGraph.Length; ++i)
{
BitArray vertex_i = inputGraph[i];
for(int j = 0; j < vertex_i.Count; ++j)
{
if(rand.Next(0, 100) < 50) //50% fill ratio
{
vertex_i[j] = true;
}
}
}
//create transitive closure
for (int i = 0; i < sizeOfGraph; ++i)
{
BitArray vertex_i = inputGraph[i];
for (int j = 0; j < i; ++j)
{
if (vertex_i[j]) { continue; }
for (int r = j + 1; r < i; ++r)
{
if (vertex_i[r] && inputGraph[r][j])
{
vertex_i[j] = true;
break;
}
}
}
}
//create transitive reduction
List<int>[] reducedGraph = new List<int>[sizeOfGraph];
Parallel.ForEach(inputGraph, (vertex_i, state, ii) =>
{
{
int i = (int)ii;
List<int> reducedVertex = reducedGraph[i] = new List<int>();
for (int j = i - 1; j >= 0; --j)
{
if (vertex_i[j])
{
bool ok = true;
for (int x = 0; x < reducedVertex.Count; ++x)
{
if (inputGraph[reducedVertex[x]][j])
{
ok = false;
break;
}
}
if (ok)
{
reducedVertex.Add(j);
}
}
}
}
});
MessageBox.Show("Finished, reduced graph has "
+ reducedGraph.Sum(s => s.Count()) + " edges.");
EDIT
I wrote this:
The code has some problems. With the direction i goes now, You can delete edges You would need and the result would be incorrect. This turned out to be a mistake. I was thinking this way, lets have a graph
1->0
2->1, 2->0
3->2, 3->1, 3->0
Vertex 2 gets reduced by vertex 1, so we have
1->0
2->1
3->2, 3->1, 3->0
Now vertex 3 gets reduced by vertex 2
1->0
2->1
3->2, 3->0
And we have a problem, as we can not reduce 3->0 which stayed here because of reduced 2->0. But it is my mistake, this would never happen. The inner cycle goes strictly from lower to higher, so instead
Vertex 3 gets reduced by vertex 1
1->0
2->1
3->2, 3->1
and now by vertex 2
1->0
2->1
3->2
And the result is correct. I apologize for the error.

Why is my sine algorithm much slower than the default?

const double pi = 3.1415926535897;
static double mysin(double x) {
return ((((((-0.000140298 * x - 0.00021075890) * x + 0.008703147) * x -
0.0003853080) * x - 0.16641544) * x - 0.00010117316) * x +
1.000023121) * x;
}
static void Main(string[] args) {
Stopwatch sw = new Stopwatch();
double a = 0;
double[] arg = new double[1000000];
for (int i = 0; i < 1000000; i++) {
arg[i] = (pi / 2000000);
}
sw.Restart();
for (int i = 0; i < 1000000; i++) {
a = a + Math.Sin(arg[i]);
}
sw.Stop();
double t1 = (double)(sw.Elapsed.TotalMilliseconds);
a = 0;
sw.Restart();
for (int i = 0; i < 1000000; i++) {
a = a + mysin(arg[i]);
}
sw.Stop();
double t2 = (double)(sw.Elapsed.TotalMilliseconds);
Console.WriteLine("{0}\n{1}\n", t1,t2);
Console.Read();
}
This power series is valid for [0,pi/2] and it is 10 times slower than the built in sine function in release mode. 1ms vs 10ms.
But when I copy paste mysin code into the function I get practically the same time in release and my code is about 4 times faster when in debug mode.
a = 0;
sw.Restart();
for (int i = 0; i < 1000000; i++) {
double x = arg[i];
a = a + ((((((-0.000140298 * x - 0.00021075890) * x + 0.008703147) * x -
0.0003853080) * x - 0.16641544) * x - 0.00010117316) * x +
1.000023121) * x;
//a = a + mysin(arg[i]);
}
What is the deal here? How do I make this sort of calculations faster? I am guessing the code automatically recognizes that sin algorithm should not be called but copy paste into the loop. How do I make the compiler do the same for me.
One more question, would c++ do the same optimization for its default sin/cos function? If not how would I make sure that it does. Edit: I tested it and my sine function (with 4 extra if conditions added to expand the domain to all real) runs about 25% faster (albeit inaccurate) than the default sin function. And in fact, the copy pasted version runs slower than when I write it as a separate function.
I assume that you tested this on x86, because I cannot repro the numbers on x64. On x64, your code actually appears to be faster.
I disassembled the code for x86/release. The reason for the difference is that your method is just that, a method whereas Math.Sin is compiled to use the x86 fsin instruction directly thus eliminating a function call per invocation.
FWIW, the x64 code is quite different. Math.Sin is translated into clr!COMDouble::Sin.
See FSIN.

Image processing task: Erosion C#

I am doing an image processing assignment where I want to implement erosion and dilation algorithm. It needs to look for each pixel in all directions (in this case up, down, left and right), so i'm using a plus structuring element. Here is my problem: I've got 4 for loops nested, which makes this operation very slow.
Can anyone tell me how to make the erosion process quicker without using unsafe method?
Here is what I have:
colorlistErosion = new List<Color>();
int colorValueR, colorValueG, colorValueB;
int tel = 0;
for (int y = 0; y < bitmap.Height; y++)
{
for (int x = 0; x < bitmap.Width; x++)
{
Color col = bitmap.GetPixel(x, y);
colorValueR = col.R; colorValueG = col.G; colorValueB = col.B;
//Erosion
for (int a = -1; a < 2; a++)
{
for (int b = -1; b < 2; b++)
{
try
{
Color col2 = bitmap.GetPixel(x + a, y + b);
colorValueR = Math.Min(colorValueR, col2.R);
colorValueG = Math.Min(colorValueG, col2.G);
colorValueB = Math.Min(colorValueB, col2.B);
}
catch
{
}
}
}
colorlistErosion.Add(Color.FromArgb(0 + colorValueR, 0+colorValueG, 0+colorValueB));
}
}
for (int een = 0; een < bitmap.Height; een++)
for (int twee = 0; twee < bitmap.Width; twee++)
{
bitmap.SetPixel(twee, een, colorlistErosion[tel]);
tel++;
}
how to make the erosion process quicker without using unsafe method?
You can turn the inner loops into Parallel.For().
But I'm not 100% sure if GetPixel() and especially SetPixel() are thread-safe. That is a deal-breaker.
Your algorithm is inherently slow due to the 4 nested loops. You're also processing the bitmap using the slowest approach possible bitmap.GetPixel.
Take a look at SharpGL. If they don't have your filters you can download the source code and figure out how to make your own.

Categories

Resources