I have a sparse-matrix from Extreme.Mathematics.LinearAlgebra like:
SparseMatrix<double> J = Matrix.CreateSparse<double>(amountI, amountJ);
Now I want to fill it in a parallel loop, since filling it in parallel shound be way faster.
Parallel.For(0, amountI, i =>
{
for (int j = 0; j < amountJ; j++)
J[i, j] = random.Next();
});
This gives me the error: out of range exception.
However, for a normal for loop, it works pretty fine.
for (int i = 0; i < amountI; i++)
{
for (int j = 0; j < amountJ; j++)
J[i, j] = random.Next();
}
Also, if I use a 2D array instead of a sparse matrix it works fine.
double[,] M = new double[amountI, amountJ];
Parallel.For(0, amountI, i =>
{
for (int j = 0; j < amountJ; j++)
M[i, j] = random.Next();
});
How do I achieve, to fill a sparse matrix in parallel without running into out of range exceptions?
I know this is a little bit late but better than nothing.
The Sparse matrix is something completely different from a normal array. It uses indices of rows and columns, to store only the non-zero values of the matrix. For further information I guess you should read the docs of Extreme.
In general: The backend of a sparse matrix has to reallocate memory every time you modify the underlying data. So given the case you loop normally, the allocations will happen in sync and you are fine. As soon as multiple threads try to modify the matrix, you will get into trouble, because on one thread you allocated new memory, which is immediately overwritten by another thread, giving you an exception on the original thread.
So: Parallel filling of sparse matrices wont work. I didn't find a proper library which gives the possibility of writing to a sparse matrix in parallel yet.
Related
I had a look and couldn't see anything quite answering my question.
I'm not exactly the best at creating accurate 'real life' tests, so i'm not sure if that's the problem here. Basically I want to create a few simple neural networks to create something to the effect of Gridworld. Performance of these neural networks will be critical and i dont want the hidden layer to be a bottleneck as much as possible.
I would rather use more memory and be faster, so I opted to use arrays instead of lists (due to lists having an extra bounds check over arrays). The arrays aren't always full, but because the if statement (check if the element is null) is the same until the end, it can be predicted and there is no performance drop from that at all.
My question comes from how I store the data for the network to process. I figured due to 2D arrays storing all the data together it would be better cache wise and would run faster. But from my mock up test that an array of arrays performs much better in this scenario.
Some Code:
private void RunArrayOfArrayTest(float[][] testArray, Data[] data)
{
for (int i = 0; i < testArray.Length; i++) {
for (int j = 0; j < testArray[i].Length; j++) {
var inputTotal = data[i].bias;
for (int k = 0; k < data[i].weights.Length; k++) {
inputTotal += testArray[i][k];
}
}
}
}
private void Run2DArrayTest(float[,] testArray, Data[] data, int maxI, int maxJ)
{
for (int i = 0; i < maxI; i++) {
for (int j = 0; j < maxJ; j++) {
var inputTotal = data[i].bias;
for (int k = 0; k < maxJ; k++) {
inputTotal += testArray[i, k];
}
}
}
}
These are the two functions that are timed. Each 'creature' has its own network (The first for loop), each network has hidden nodes (The second for loop) and i need to find the sum of the weights for each input (The third loop). In my test i stripped it so that it's not really what i am doing in my actual code, but the same amount of loops happen (The data variable would have it's own 2D array, but i didn't want to possibly skew the results). From this i was trying to get a feel for which one is faster, and to my surprise the array of arrays was.
Code to start the tests:
// Array of Array test
Stopwatch timer = Stopwatch.StartNew();
RunArrayOfArrayTest(arrayOfArrays, dataArrays);
timer.Stop();
Console.WriteLine("Array of Arrays finished in: " + timer.ElapsedTicks);
// 2D Array test
timer = Stopwatch.StartNew();
Run2DArrayTest(array2D, dataArrays, NumberOfNetworks, NumberOfInputNeurons);
timer.Stop();
Console.WriteLine("2D Array finished in: " + timer.ElapsedTicks);
Just wanted to show how i was testing it. The results from this in release mode give me values like:
Array of Arrays finished in: 8972
2D Array finished in: 16376
Can someone explain to me what i'm doing wrong? Why is an array of arrays faster in this situation by so much? Isn't a 2D array all stored together, meaning it would be more cache friendly?
Note i really do need this to be fast as it needs to sum up hundreds of thousands - millions of numbers per frame, and like i said i don't want this is be a problem. I know this can be multi threaded in the future quite easily because each network is completely separate and even each node is completely separate.
Last question i suppose, would something like this be possible to run on the GPU instead? I figure a GPU would not struggle to have much larger amounts of networks with much larger numbers of input/hidden neurons.
In the CLR, there are two different types of array:
Vectors, which are zero-based, single-dimensional arrays
Arrays, which can have non-zero bases and multiple dimensions
Your "array of arrays" is a "vector of vectors" in CLR terms.
Vectors are significantly faster than arrays, basically. It's possible that arrays could be optimized further in later CLR versions, but I doubt that there'll get the same amount of love as vectors, as they're so relatively rarely used. There's not a lot you can do to make CLR arrays faster. As you say, they'll be more cache friendly, but they have this CLR penalty.
You can improve your array-of-arrays code already, however, by only performing the first indexing operation once per row:
private void RunArrayOfArrayTest(float[][] testArray, Data[] data)
{
for (int i = 0; i < testArray.Length; i++) {
// These don't change in the loop below, so extract them
var row = testArray[i];
var inputTotal = data[i].bias;
var weightLength = data[i].weights.Length;
for (int j = 0; j < row.Length; j++) {
for (int k = 0; k < weightLength; k++) {
inputTotal += row[k];
}
}
}
}
If you want to get the cache friendliness and still use a vector, you could have a single float[] and perform the indexing yourself... but I'd probably start off with the array-of-arrays approach.
I have a loop that is too slow in C#. I want to know if there is a faster way to process through these arrays. I'm currently working in .NET 2.0. i'm not opposed to upgrading this project. This is part of a theoretical image processing concept involving gray levels.
Pixel count (PixCnt = 21144402)
g_len = 4625
list1d - 1Dimensional array of an image with upper bound of the above pixel count.
pg - gray level intensity holder.
This function creates an index of those values. hence pgidx.
int[] pgidx = new int[PixCnt];
sw = new Stopwatch();
sw.Start();
for (i = 0; i < PixCnt; i++)
{
j = 0;
pgidx[i] = 0;
while (list_1d[i] != pg[j] && j < g_len) j++;
if (list_id[i] == pg[j])
pgidx[i] = j
}
sw.stop();
Debug.WriteLine("PixCnt Loop took" + sw.ElapsedMilliseconds + " ms");
I think using a dictionary to store what's in the pg array will speed it up. g_len is 4625 elements, so you will likely average around 2312 iterations of the inner while loop. Replacing that with a single hashed look up in a dictionary should be faster. Since the outer loop executes 21 million times, speeding up the body of that loop should reap big rewards. I'm guessing the code below will speed up your time by 100 to 1000 time faster.
var pgDict = new Dictionary<int,int>(g_len);
for (int i = 0; i < g_len; i++) pgDict.Add(pg[i], i);
int[] pgidx = new int[PixCnt];
int value = 0;
for (int i = 0; i < PixCnt; i++) {
if (pgDict.TryGetValue(list_id[i], out value)) pgidx[i] = value;
}
Note that setting pgidx[i] to zero when a match isn't found is not necessary, because all elements of the array are already initialized to zero when the array is created.
If there is the possibility for a value in pg to appear more than once, you would want to check first to see if that key has already been added, and skip adding it to the dictionary if it has. That would mimic your current behavior of finding the first match. To do that replace the line where the dictionary is built with this:
for (int i = 0; i < g_len; i++) if (!pgDict.ContainsKey(pg[i])) pgDict.Add(pg[i], i);
If the range of the pixel values in pq allows it (say 16 bpp = 65536 entries), you can create an auxiliary array that maps all possible gray levels to the index value in pg. Filling this array is done with a single pass over pg (after initializing to all zeroes).
Then convert list_1d to pgidx with straight table lookups.
If the table is too big (bigger than the image), then do as #hatchet answered.
I'm facing a strange issue that I can't explain and I would like to know if some of you have the answer I'm lacking.
I have a small test app for testing multithreading modifications I'm making to a much larger code. In this app I've set up two functions, one that does a loop sequentially and one that uses the Task.Parallel.For . The two of them print out the time and final elements generated. What I'm seeing is that the function that executes the Parallel.For is generating less items than the sequential loop and this is huge problem for the real app(it's messing with some final results). So, my question is if someone has any idea why this could be happening and if so, if there's anyway to fix it.
Here is the code for the function that uses the parallel.for in my test app:
static bool[] values = new bool[52];
static List<int[]> combinations = new List<int[]>();
static void ParallelLoop()
{
combinations.Clear();
Parallel.For(0, 48, i =>
{
if (values[i])
{
for (int j = i + 1; j < 49; j++)
if (values[j])
{
for (int k = j + 1; k < 50; k++)
{
if (values[k])
{
for (int l = k + 1; l < 51; l++)
{
if (values[l])
{
for (int m = l + 1; m < 52; m++)
{
if (values[m])
{
int[] combination = { i, j, k, l, m };
combinations.Add(combination);
}
}
}
}
}
}
}
}
}); // Parallel.For
}
And here is the app output:
Executing sequential loop...
Number of elements generated: 1,712,304
Executing parallel loop...
Number of elements generated: 1,464,871
Thanks in advance and if you need some clarifications I'll do my best to explain in further detail.
You can't just add items in your list by multiple threads at the same time without any synchronization mechanism. List<T>.Add() actually does some none-trivial internal stuff (buffers...etc) so adding an item is not an atomic thread-safe operation.
Either:
Provide a way to synchronize your writes
Use a collection that supports concurrent writes (see System.Collections.Concurrent namespace)
Don't use multi-threading at all
I have the following code snippet:
// Initialise rectangular matrix with [][] instead of [,]
double data[][] = new double[m];
for (int i = 0; i < m; i++)
data[i] = new double[n];
// Populate data[][] here...
// Code to run in parallel:
for (int i = 0; i < m; i++)
data[i] = Process(data[i]);
If this makes sense, I have a matrix of doubles. I need to apply a transformation to each individual row of the matrix. It is "embarrassingly parallel", as there is no connection for the data from one row to another.
If I do something like:
data.AsParallel().ForAll(row => { row = Process[row]; });
First of all, I don't know whether data.AsParallel() knows to only look at the first subscript, or if it will enumerate all m * n doubles. Secondly, since row is the element I'm enumerating over, I have no idea if I can change it like this - I suspect not.
So, with or without PLINQ, what is a good way to parallelise this loop in C#?
Here are two ways to do it:
data.AsParallel().ForAll(row =>
{
Process(row);
});
Parallel.For(0, data.Length, rowIndex =>
{
Process(data[rowIndex]);
});
In both cases, the one-dimensional array of doubles is passed by reference and modifying values in your Process method will modify the data array.
We are working on a video processing application using EmguCV and recently had to do some pixel level operation. I initially wrote the loops to go across all the pixels in the image as follows:
for (int j = 0; j < Img.Width; j++ )
{
for (int i = 0; i < Img.Height; i++)
{
// Pixel operation code
}
}
The time to execute the loops was pretty bad. Then I posted on the EmguCV forum and got a suggestion to switch the loops like this:
for (int j = Img.Width; j-- > 0; )
{
for (int i = Img.Height; i-- > 0; )
{
// Pixel operation code
}
}
I was very surprised to find that the code executed in half the time!
The only thing I can think of is the comparison that takes place in the loops each time accesses a property, which it no longer has to. Is this the reason for the speed up? Or is there something else? I was thrilled to see this improvement. And would love it if someone could clarify the reason for this.
The difference isn't the cost of branching, it's the fact that you are fetching an object property Img.Width and Img.Height in the inner loop. The optimizer has no way of knowing that these are constants for purposes of that loop.
You should get the same performance speedup by doing this.
const int Width = Img.Width;
const int Height = Img.Height;
for (int j = 0; j < Width; j++ )
{
for (int i = 0; i < Height; i++)
{
// Pixel operation code
}
}
Edit:
As Joshua Suggests, putting Width in the inner loop will have you walking through the memory sequentially, which will be better cache coherency, and might be faster. (depends on how big your bitmap is).
const int Width = Img.Width;
const int Height = Img.Height;
for (int i = 0; i < Height; i++)
{
for (int j = 0; j < Width; j++ )
{
// Pixel operation code
}
}
I assume you are using the System.Drawing.Image class? Looking at the implementation of .Width and .Height I see they do a function call into GDI+ (GdipGetImageHeight and GdipGetImageWidth in gdiplus.dll), which seems to be rather expensive.
By going backwards you make that call once, rather than in every iteration.
It's not the loop reversal that speeds things up -- it's the fact that you're accessing the Width and Height properties far fewer times.
It's because the CPUs are like hockey players, they go faster when going backward ;-)
More seriously:
This is not related in the direction of the loop in any way, but rather to the fact that the in the original construct, the loop control conditions implied dereferencing the Img object to index to its Width or Height property (for each and single iteration in the loops), whereby the second construct evaluates these properties only once.
Also, the fact that the new condition tests against the value 0, saves even the loading of an immediate value.
This probably explains the difference (assuming the work done inside the inner was relatively minimal, i.e. +/- the same as work to test an Object.Property, since you indicate a roughly 50% gain).
Edit:
see Michael Stum's answer, which indicates that the Img.Width/Height reference is even more costly than thought. As it sometimes happens with properties, the implementation of the object may run a significant amount of code to produce the value (for example it may do a bunch of math to get to the width, each time, rather than somehow caching it etc..). This seems to be the case with this Img object, hence the interest to do this only once (if you are sure that the value will remain constant for the duration of the loop logic).