How to optimize this loop that purges duplicate vertices? - c#

I have this bit of code here
var vertexIndexDictionary = new Dictionary<Vector3, int>();
for (int i = 0; i < triangles.Length; i++)
{
for (int j = 0; j < 3; j++)
{
var vertex = triangles[i][j];
if (!vertexIndexDictionary.ContainsKey(vertex))
{
vertexIndexDictionary.Add(vertex, vertexIndexDictionary.Count);
}
}
}
var vertices = vertexIndexDictionary.Keys.ToArray();
which goes through the triangles array and gets rid of duplicate vertices. This triangle array can get very large, and so the running time gets really long as well. Is there some way I can achieve the same thing but faster? E.g. with another data type?
Edit:
Triangles array is initialized like this:
var triangles = new Triangle[count];
and triangle struct is
struct Triangle {
public Vector3 a;
public Vector3 b;
public Vector3 c;
public Vector3 this[int i]
{
get
{
switch (i)
{
case 0:
return a;
case 1:
return b;
default:
return c;
}
}
}
}

You could get direct access to the entry stored inside the vertexIndexDictionary, with the low-level CollectionsMarshal.GetValueRefOrAddDefault method. Using this API you can reduce the dictionary searching operations from 2-3 to just 1 per iteration. Also you could consider avoiding using the LINQ ToArray method, because LINQ in general is not the tool of choice when performance is of uttermost importance. In this particular case avoiding the ToArray is unlikely to provide any tangible benefit (because the LINQ happens to be internally optimized for sources that implement the ICollection<T> interface), but I'll show you how to do it anyway:
for (int i = 0; i < triangles.Length; i++)
{
for (int j = 0; j < 3; j++)
{
Vector3 vertex = triangles[i][j];
ref int valueRef = ref CollectionsMarshal.GetValueRefOrAddDefault(
vertexIndexDictionary, vertex, out bool exists);
if (!exists) valueRef = vertexIndexDictionary.Count;
indices[i * 3 + j] = valueRef;
}
}
Vector3[] vertices = new Vector3[vertexIndexDictionary.Keys.Count];
vertexIndexDictionary.Keys.CopyTo(vertices, 0);

Related

Vectorized C# code with SIMD using Vector<T> running slower than classic loop

I've seen a few articles describing how Vector<T> is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).
I decided to try to rewrite a method I have to see if I managed to get some speedup, but so far I failed and the vectorized code is running 3 times slower than the original, and I'm not exactly sure as to why. Here are two versions of a method checking if two Span<float> instances have all the pairs of items in the same position that share the same position relative to a threshold value.
// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
for (int i = 0; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
return true;
}
// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
// Setup the test vector
int l = Vector<float>.Count;
float* arr = stackalloc float[l];
for (int i = 0; i < l; i++)
arr[i] = threshold;
Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
{
// Iterate in chunks
int
div = x1.Length / l,
mod = x1.Length % l,
i = 0,
offset = 0;
for (; i < div; i += 1, offset += l)
{
Vector<float>
v1 = Unsafe.Read<Vector<float>>(px1 + offset),
v1cmp = Vector.GreaterThan<float>(v1, cmp),
v2 = Unsafe.Read<Vector<float>>(px2 + offset),
v2cmp = Vector.GreaterThan<float>(v2, cmp);
float*
pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
for (int j = 0; j < l; j++)
if (pcmp1[j] == 0 != (pcmp2[j] == 0))
return false;
}
// Test the remaining items, if any
if (mod == 0) return true;
for (i = x1.Length - mod; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
}
return true;
}
As I said, I've tested both versions using BenchmarkDotNet, and the one using Vector<T> is running around 3 times slower than the other one. I tried running the tests with spans of different length (from around 100 to over 2000), but the vectorized method keeps being much slower than the other one.
Am I missing something obvious here?
Thanks!
EDIT: the reason why I'm using unsafe code and trying to optimize this code as much as possible without parallelizing it is that this method is already being called from within a Parallel.For iteration.
Plus, having the ability to parallelize the code over multiple threads is generally not a good reason to leave the individual parallel tasks not optimized.
I had the same problem. The solution was to uncheck the Prefer 32-bit option at the project properties.
SIMD is only enabled for 64-bit processes. So make sure your app either is targeting x64 directly or is compiled as Any CPU and not marked as 32-bit preferred. [Source]
** EDIT ** After reading a blog post by Marc Gravell, I see that this can be achieved simply...
public static bool MatchElementwiseThresholdSIMD(ReadOnlySpan<float> x1, ReadOnlySpan<float> x2, float threshold)
{
if (x1.Length != x2.Length) throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<float, Vector<float>>();
var vx2 = x2.NonPortableCast<float, Vector<float>>();
var vthreshold = new Vector<float>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.Xor(v1cmp, v2cmp) != Vector<int>.Zero)
return false;
}
x1 = x1.Slice(Vector<float>.Count * vx1.Length);
x2 = x2.Slice(Vector<float>.Count * vx2.Length);
}
for (var i = 0; i < x1.Length; i++)
if (x1[i] > threshold != x2[i] > threshold)
return false;
return true;
}
Now this is not quite as quick as using array's directly (if that's what you have) but is still significantly faster than the non-SIMD version...
(Another edit...)
...and just for fun I thought I would see well this stuff handles works when fully generic, and the answer is very well... so you can write code like the following, and it is just as efficient as being specific (well except in the non-hardware accelerated case, in which case its a bit less than twice as slow - but not completely terrible...)
public static bool MatchElementwiseThreshold<T>(ReadOnlySpan<T> x1, ReadOnlySpan<T> x2, T threshold)
where T : struct
{
if (x1.Length != x2.Length)
throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<T, Vector<T>>();
var vx2 = x2.NonPortableCast<T, Vector<T>>();
var vthreshold = new Vector<T>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.AsVectorInt32(Vector.Xor(v1cmp, v2cmp)) != Vector<int>.Zero)
return false;
}
// slice them to handling remaining elementss
x1 = x1.Slice(Vector<T>.Count * vx1.Length);
x2 = x2.Slice(Vector<T>.Count * vx1.Length);
}
var comparer = System.Collections.Generic.Comparer<T>.Default;
for (int i = 0; i < x1.Length; i++)
if ((comparer.Compare(x1[i], threshold) > 0) != (comparer.Compare(x2[i], threshold) > 0))
return false;
return true;
}
A vector is just a vector. It doesn't claim or guarantee that SIMD extensions are used. Use
System.Numerics.Vector2
https://learn.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types

Counting sort in singly-linked list C#

Is there any way to make a counting sort in singly-linked list? I haven't seen any examples and it's quite hard to make it without them. I have example of it in array and would like to do it in singly-linked list.
Has anybody did it in singly-linked list?
public static int[] CountingSortArray(int[] array)
{
int[] aux = new int[array.Length];
// find the smallest and the largest value
int min = array[0];
int max = array[0];
for (int i = 1; i < array.Length; i++)
{
if (array[i] < min) min = array[i];
else if (array[i] > max) max = array[i];
}
int[] counts = new int[max - min + 1];
for (int i = 0; i < array.Length; i++)
{
counts[array[i] - min]++;
}
counts[0]--;
for (int i = 1; i < counts.Length; i++)
{
counts[i] = counts[i] + counts[i - 1];
}
for (int i = array.Length - 1; i >= 0; i--)
{
aux[counts[array[i] - min]--] = array[i];
}
return aux;
}
I found one that works on an array at: http://www.geeksforgeeks.org/counting-sort/
I think with minimal effort it could be changed to a linked list, the only problem is that you'll end up traversing the linked list many many times since you don't have random access eg.[] making it rather inefficient. Since you seem to have found the same thing i did before I could finish typing I think my answer is kinda pointless. However, I'm still a bit curious as to where you're having problems.
Heres a hint if figuring out where to start is the problem: Every time you see array[i] used, you will need to traverse your linked list first instead to get the i'th item first.
Edit: The only reason you would need to create a 2nd linked list of frequencies is if you needed to actually do work on the resulting linked list. If you just need a sorted list of the values inside the linked list for display purposes an array holding the frequencies would work (i suppose at the same time you could just create an array of all the values then do the counting sort you already have on it). I apologize if i have confused my c, c++, c++/cx, somewhere along the way (i don't have a compiler handy right now), but this should give you a good idea of how to do it.
public static node* FindMin(node* root){ //FindMax would be nearly identical
node* minValue = root;
while(node->Next){
if(node->Value < minValue->Value)
minValue = node;
}
return minValue;
}
public static node* CountingSortArray(node* linklist){
node* root = linkedlist
node* min = FindMin(linklist);
node* max = FindMax(linklist);
int[] counts = new int[max->Value - min->Value + 1];
while(root != NULL){
counts[root->Value] += 1;
root = root->Next;
}
int i = 0;
root = linkedlist;
while(ptr != NULL){
if(counts[i] == 0)
++i;
else{
root->Value = i;
--count[i];
root = root->Next;
}
}
}
void push(node** head, int new_data){
node* newNode = new node();
newNode->Value = new_data;
newNode->Next = (*head);
(*head) = newNode;
}
void printList(node* root){
while(root != NULL){
printf(%d ", root->Value);
root = root->Next;
}
printf("\n");
}
int main(void){
node* myLinkedList = NULL;
push(&head, 0);
push(&head, 1);
push(&head, 0);
push(&head, 2);
push(&head, 0);
push(&head, 2);
printList(myLinkedList);
CountingSortArray(myLinkedList);
printList(myLinkedList);
}
The example code is more like a radix sort with base (max-min+1). Usually a counting sort looks like the code below. Make a pass over the list to get min and max. Make a second pass to generate the counts. Make a pass over the counts to generate a new array based on the counts (instead of copying data). Example code fragment:
for (size_t i = 0; i < array.Length; i++)
counts[array[i]-min]++;
size_t i = 0;
for(size_t j = 0; j < counts.Length); j++){
for(size_t n = counts[j]; n; n--){
aux[i++] = j+min;
}
}

Slicing a 3d image

I'm trying to slice a 3d representation of a image. Following factors is known:
DimensionX
DimensionY
DimensionZ
voxels[]
Voxels[] is an array of ushorts representing the grayscale of the pixel.
Now i need to slice it in all possible directions. With let we say x, y or z.
I have tree implementations for this, but they have one problem, they don't work with dimensions that are not the same. (Except for get Z Slice, this is working perfectly).
These are the methods:
private ushort[] GetZSlice(int z, ushort[] voxels, int DimensionX, int DimensionY)
{
var res = new ushort[DimensionX * DimensionY];
for(int j = 0; j < DimensionY; j++)
{
for(int i = 0; i < DimensionX; i++)
{
res[j*DimensionX + i] = voxels[z*DimensionX*DimensionY + j*DimensionX + i];
}
}
return res;
}
This method is working perfectly, it does not mather wat i choose as the dimension.
The next two methods, with x or y as depth poses as a harder problem.
private ushort[] GetYSlice(int y, ushort[] voxels, int DimensionX, int DimensionY, int DimensionZ)
{
var res = new ushort[DimensionX * DimensionZ];
for( int i = 0; i < DimensionX; i++)
{
for( int j = 0; j < DimensionX; j++)
{
res[j + i*DimensionX] = voxels[j * DimensionZ * DimensionY + Y*DimensionZ + i]
}
}
return res;
}
private ushort[] GetXSlice(int x, ushort[voxels], int DimensionX, int DimensionY, int DimensionZ)
{
var res = new short[DimensionY * DimensionZ];
for(int i = 0; i < DimensionY; i++)
{
for(int j = 0; j < DimensionZ; j++)
{
res[j + i*DimensionZ] = voxels[i*DimensionY + j*DimensionZ*DimensionX + x]
}
}
return res;
}
How could I improve the last 2 methods so it works with dimensions that are not equal?
Why not make universal slice function using basis vectors? It will be far less code the manage.
Basicly you got U,V axises each mapped into X,Y or Z. And the Slice W is the third unused axis. So you just loop the U,V and leave W as is. Each U,V has a basis vector (ux,uy,uz) , (vx,vy,vz) that describes the increment change in x,y,z coordinates.
I encoded it into my LED cube class in C++...
//---------------------------------------------------------------------------
class LED_cube
{
public:
int xs,ys,zs,***map;
LED_cube() { xs=0; ys=0; zs=0; map=NULL; }
LED_cube(LED_cube& a) { xs=0; ys=0; zs=0; map=NULL; *this=a; }
~LED_cube() { _free(); }
LED_cube* operator = (const LED_cube *a) { *this=*a; return this; }
LED_cube* operator = (const LED_cube &a);
void _free();
void resize(int _xs,int _ys,int _zs);
void cls(int col); // clear cube with col 0x00BBGGRR
void sphere(int x0,int y0,int z0,int r,int col); // draws sphere surface with col 0x00BBGGRR
void slice (char *uv,int slice,int col); // draws (XY,XZ,YZ) slice with col 0x00BBGGRR
void glDraw(); // render cube by OpenGL as 1x1x1 cube at 0,0,0
};
//---------------------------------------------------------------------------
void LED_cube::slice(char *uv,int slice,int col)
{
// detect basis vectors from uv string
int ux=0,uy=0,uz=0,us=0;
int vx=0,vy=0,vz=0,vs=0;
int x=slice,y=slice,z=slice,u,v,x0,y0,z0;
if (uv[0]=='X') { x=0; ux=1; us=xs; }
if (uv[0]=='Y') { y=0; uy=1; us=ys; }
if (uv[0]=='Z') { z=0; uz=1; us=zs; }
if (uv[1]=='X') { x=0; vx=1; vs=xs; }
if (uv[1]=='Y') { y=0; vy=1; vs=ys; }
if (uv[1]=='Z') { z=0; vz=1; vs=zs; }
// render slice
if ((x>=0)&&(x<xs)&&(y>=0)&&(y<ys)&&(z>=0)&&(z<zs))
for (u=0;u<us;u++,x+=ux,y+=uy,z+=uz)
{
x0=x; y0=y; z0=z;
for (v=0;v<vs;v++,x+=vx,y+=vy,z+=vz)
map[x][y][z]=col;
x=x0; y=y0; z=z0;
}
}
//---------------------------------------------------------------------------
As you can see it is quite nice and simple ... and still can be optimized much more. here usage and output example:
cube.resize(32,16,20);
cube.cls(0x00202020);
cube.slice("XY",5,0x000000FF);
cube.slice("XZ",5,0x0000FF00);
cube.slice("YZ",5,0x00FF0000);
cube.glDraw();
As you got your voxels stored in 1D array then just compute the address from x,y,z. so map[x][y][z] will became your voxels[(x*ys*zs)+(y*zs)+z] or what ever combination of axis order you got. This can be totally encoded into the basis vectors so you can have du=(ux*ys*zs)+(uy*zs)+uz and dv=... and increment the address directly not needing any multiplication latter ...

Parallelize transitive reduction

I have a Dictionary<int, List<int>>, where the Key represents an element of a set (or a vertex in an oriented graph) and the List is a set of other elements which are in relation with the Key (so there are oriented edges from Key to Values). The dictionary is optimized for creating a Hasse diagram, so the Values are always smaller than the Key.
I have also a simple sequential algorithm, that removes all transitive edges (e.g. I have relations 1->2, 2->3 and 1->3. I can remove the edge 1->3, because I have a path between 1 and 3 via 2).
for(int i = 1; i < dictionary.Count; i++)
{
for(int j = 0; j < i; j++)
{
if(dictionary[i].Contains(j))
dictionary[i].RemoveAll(r => dictionary[j].Contains(r));
}
}
Would it be possible to parallelize the algorithm? I could do Parallel.For for the inner loop. However, this is not recommended (https://msdn.microsoft.com/en-us/library/dd997392(v=vs.110).aspx#Anchor_2) and the resulting speed would not increase significantly (+ there might be problems with locking). Could I parallelize the outer loop?
There is simple way to solve the parallelization problem, separate data. Read from original data structure and write to new. That way You can run it in parallel without even need to lock.
But probably the parallelization is not even necessary, the data structures are not efficient. You use dictionary where array would be sufficient (as I understand the code You have vertices 0..result.Count-1). And List<int> for lookups. List.Contains is very inefficient. HashSet would be better. Or, for more dense graphs, BitArray. So instead of Dictionary<int, List<int>> You can use BitArray[].
I rewrote the algorithm and made some optimizations. It does not make plain copy of the graph and delete edges, it just construct the new graph from only the right edges. It uses BitArray[] for input graph and List<int>[] for final graph, as the latter one is far more sparse.
int sizeOfGraph = 1000;
//create vertices of a graph
BitArray[] inputGraph = new BitArray[sizeOfGraph];
for (int i = 0; i < inputGraph.Length; ++i)
{
inputGraph[i] = new BitArray(i);
}
//fill random edges
Random rand = new Random(10);
for (int i = 1; i < inputGraph.Length; ++i)
{
BitArray vertex_i = inputGraph[i];
for(int j = 0; j < vertex_i.Count; ++j)
{
if(rand.Next(0, 100) < 50) //50% fill ratio
{
vertex_i[j] = true;
}
}
}
//create transitive closure
for (int i = 0; i < sizeOfGraph; ++i)
{
BitArray vertex_i = inputGraph[i];
for (int j = 0; j < i; ++j)
{
if (vertex_i[j]) { continue; }
for (int r = j + 1; r < i; ++r)
{
if (vertex_i[r] && inputGraph[r][j])
{
vertex_i[j] = true;
break;
}
}
}
}
//create transitive reduction
List<int>[] reducedGraph = new List<int>[sizeOfGraph];
Parallel.ForEach(inputGraph, (vertex_i, state, ii) =>
{
{
int i = (int)ii;
List<int> reducedVertex = reducedGraph[i] = new List<int>();
for (int j = i - 1; j >= 0; --j)
{
if (vertex_i[j])
{
bool ok = true;
for (int x = 0; x < reducedVertex.Count; ++x)
{
if (inputGraph[reducedVertex[x]][j])
{
ok = false;
break;
}
}
if (ok)
{
reducedVertex.Add(j);
}
}
}
}
});
MessageBox.Show("Finished, reduced graph has "
+ reducedGraph.Sum(s => s.Count()) + " edges.");
EDIT
I wrote this:
The code has some problems. With the direction i goes now, You can delete edges You would need and the result would be incorrect. This turned out to be a mistake. I was thinking this way, lets have a graph
1->0
2->1, 2->0
3->2, 3->1, 3->0
Vertex 2 gets reduced by vertex 1, so we have
1->0
2->1
3->2, 3->1, 3->0
Now vertex 3 gets reduced by vertex 2
1->0
2->1
3->2, 3->0
And we have a problem, as we can not reduce 3->0 which stayed here because of reduced 2->0. But it is my mistake, this would never happen. The inner cycle goes strictly from lower to higher, so instead
Vertex 3 gets reduced by vertex 1
1->0
2->1
3->2, 3->1
and now by vertex 2
1->0
2->1
3->2
And the result is correct. I apologize for the error.

The negamax algorithm..what's wrong?

I'm trying to program a chess game and have spent days trying to fix the code. I even tried min max but ended with the same result. The AI always starts at the corner, and moves a pawn out of the way then the rook just moves back and forth with each turn. If it get's eaten, the AI moves every piece from one side to the other until all are eaten. Do you know what could be wrong with the following code?
public Move MakeMove(int depth)
{
bestmove.reset();
bestscore = 0;
score = 0;
int maxDepth = depth;
negaMax(depth, maxDepth);
return bestmove;
}
public int EvalGame() //calculates the score from all the pieces on the board
{
int score = 0;
for (int i = 0; i < 8; i++)
{
for (int j = 0; j < 8; j++)
{
if (AIboard[i, j].getPiece() != GRID.BLANK)
{
score += EvalPiece(AIboard[i, j].getPiece());
}
}
}
return score;
}
private int negaMax(int depth, int maxDepth)
{
if (depth <= 0)
{
return EvalGame();
}
int max = -200000000;
for (int i = 0; i < 8; i++)
{
for (int j = 0; j < 8; j++)
{
for (int k = 0; k < 8; k++)
{
for (int l = 0; l < 8; l++)
{
if(GenerateMove(i, j, k, l)) //generates all possible moves
{
//code to move the piece on the board
board.makemove(nextmove);
score = -negaMax(depth - 1, maxDepth);
if( score > max )
{
max = score;
if (depth == maxDepth)
{
bestmove = nextmove;
}
}
//code to undo the move
board.undomove;
}
}
}
}
}
return max;
}
public bool GenerateMove(int i, int j, int k, int l)
{
Move move;
move.moveFrom.X = i;
move.moveFrom.Y = j;
move.moveTo.X = k;
move.moveTo.Y = l;
if (checkLegalMoves(move.moveTo, move.moveFrom)) //if a legal move
{
nextMove = move;
return true;
}
return false;
}
This code:
public Move MakeMove(int depth)
{
bestscore = 0;
score = 0;
int maxDepth = depth;
negaMax(depth, maxDepth);
return bestmove;
}
Notice that the best move is never set! The return score of negaMax is compared to move alternatives. You're not even looping over the possible moves.
Also, it's really hard to look for errors, when the code you submit is not fully consistent. The negaMax method takes two arguments one place in your code, then it take four arguments in the recursive call?
I also recommend better abstraction in your code. Separate board representation, move representation, move generation, and the search algorithm. That will help you a lot. As an example: Why do you need the depth counter in the move generation?
-Øystein
You have two possible issues:
It is somewhat ambiguous as you don't show us your variable declarations, but I think you are using too many global variables. Negamax works by calculating best moves at each node, and so while searching the values and moves should be local. In any case, it is good practice to keep the scope of variables as tight as possible. It is harder to reason about the code when traversing the game tree changes so many variables. However, your search looks like it should return the correct values.
Your evaluation does not appear to discriminate which side is playing. I don't know if EvalPiece handles this, but in any case evaluation should be from the perspective of whichever side currently has the right to move.
You also have other issues that are not directly to your problem:
Your move generation is scary. You're pairwise traversing every possible pair of from/to squares on the board. This is highly inefficient and I don't understand how such a method would even work. You need only to loop through all the pieces on the board, or for a slower method, every square on the board (instead of 4096 squares).
MakeMove seems like it may be the place for the root node. Right now, your scheme works, in that the last node the search exits from will be root. However, it is common to use special routines at the root such as iterative deepening, so it may be good to have a separate loop at the root.

Categories

Resources