How do I parallelise this with or without PLINQ? - c#

I have the following code snippet:
// Initialise rectangular matrix with [][] instead of [,]
double data[][] = new double[m];
for (int i = 0; i < m; i++)
data[i] = new double[n];
// Populate data[][] here...
// Code to run in parallel:
for (int i = 0; i < m; i++)
data[i] = Process(data[i]);
If this makes sense, I have a matrix of doubles. I need to apply a transformation to each individual row of the matrix. It is "embarrassingly parallel", as there is no connection for the data from one row to another.
If I do something like:
data.AsParallel().ForAll(row => { row = Process[row]; });
First of all, I don't know whether data.AsParallel() knows to only look at the first subscript, or if it will enumerate all m * n doubles. Secondly, since row is the element I'm enumerating over, I have no idea if I can change it like this - I suspect not.
So, with or without PLINQ, what is a good way to parallelise this loop in C#?

Here are two ways to do it:
data.AsParallel().ForAll(row =>
{
Process(row);
});
Parallel.For(0, data.Length, rowIndex =>
{
Process(data[rowIndex]);
});
In both cases, the one-dimensional array of doubles is passed by reference and modifying values in your Process method will modify the data array.

Related

List of Lists in C#, problem with indexes

In C#, I create a list of lists and try to access/modify the elements. There is a problem in that it seems that an operation (e.g. adding a constant) seems to apply not only to the wanted index but to all elements.
Here is the piece of code:
List<List<double>> D = new List<List<double>>();
List<double> ttmp = new List<double>(new double[256]);
for (int i = 0; i < 100; i++)
{
D.Add(ttmp);
}
for (int i = 0; i < 100; i++)
{
D[i][0] = D[i][0] + 1;
}
D is a list of 100 lists, each of size 256. It initially contains only zeroes. In the second loop, I ask that the first element of every of the 100 lists be incremented by one.
As a results, the entire "matrix" is filled with ones, i.e. not only D[0][0], D[1][0] ... D[99][0], but also D[0][1], D[0][2] , etc.
Why is that?
NB: the C++ equivalent with vector<vector<double>> works perfectly fine...
When code is executed, result is not that
D[0][0], D[1][0] ... D[99][0], but also D[0][1], D[0][2] are modified.
Result is that inner lists have [0] element equal to 100.
Why so? Because you created one list, and added it 100 times into D. But that is the same list - when you modify it, you modify all instances (because this is reference data type).
Change it instead to:
List<List<double>> D = new List<List<double>>();
for (int i = 0; i < 100; i++)
{
D.Add(new List<double>(new double[256]));
}
foreach (var innerList in D)
{
innerList[0]++;
}

Quickly apply a known sort order (old index -> new index mapping) to an array

I am trying to performance tune a routine that needs to sort 8 large arrays "in tandem", where one of the arrays is the array to sort by.
I've already taken care of sorting the first array using a method of my choosing (I'm using TimSort)
I've already taken care of making sure my array of sorted objects have a property denoting their original index. (e.g. sortedArray[0].OriginalIndex would return 2983 if previously unsortedArry[2983] turned out to be the first item)
This means if I were to loop over my now sorted array of objects, I think I can just get all other arrays sorted in the same order in the following naïve way:
private List<object[]> SortInTandem(IndexedObj[] sortedArray, List<object[]> arraysToSort)
for(int i = 0; i < sortedArray.length; i++) {
int originalIndex = sortedArray[i].OriginalIndex;
// Swap the corresponding index from all other arrays to their new position
foreach(object[] array in arraysToSort) {
object temp = array[i];
array[i] = array[originalIndex];
array[originalIndex] = temp;
}
}
return arraysToSort; // Returning original arrays sorted in-place
}
I believe the above algorithm to have the desired result, but it feels less efficient than it could be. (3 times as many assignments as needed?)
I also considered the following approach which minimizes assignments, but requires allocating new arrays to store sorted items, and garbage collecting the old arrays (unless I come up with a way to recycle the allocations between calls):
private List<object[]> SortInTandem(IndexedObj[] sortedArray, List<object[]> arraysToSort) =>
arraysToSort.Select(array =>
{
object[] tandemArray = new object[array.length];
for(int i = 0; i < sortedArray.length; i++)
tandemArray[i] = array[sortedArray[i].OriginalIndex];
}); // Returning newly-allocated arrays
This sort of thing is done continuously in a performance-critical area of code, so I'm looking for thoughts on how I might get the best of both worlds.
Thinking more about the second solution above (allocating new arrays) - it occurred to me that the list of arrays passed in can also be "repurposed" once their sorted variant has been produced, so I actually only need to allocate one new array and then I can reuse the ones passed in to prepare additional results:
// Note the allocated arraysToSort passed in will be repurposed to produced a new set of sorted
// arrays, so the caller must be sure to discard their references and only use what is returned.
private List<object[]> SortInTandem(IndexedObj[] sortedArray, List<object[]> arraysToSort)
{
List<object[]> sortedArrays = new List<object[]>(arraysToSort.Count);
object[] tandemArray = new object[array.length];
for(int i = 0; i < arraysToSort.Count; i++)
{
for(int j = 0; j < sortedArray.length; j++)
tandemArray[j] = array[sortedArray[j].OriginalIndex];
sortedArrays.Add(tandemArray);
tandemArray = arraysToSort[i];
}
return sortedArrays; // Returning one newly-allocated + all but one original arrays repurposed
}

How to uodate array inside a for loop and add it to a list

I am trying to update an array and add it to a list if a certain condition is true. As you can see in my code the array "rows" is updated every time inside the if condition, and the it is added to "checkList".
The problem is that when I iterate through the list to check the values, it seems that only the last value of rows has been added in every entry in the list.
Here is some code to explain
int[] rows = new int[2];
List<int[]> checkList = new List<int[]>();
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
if (true)
{
rows[0] = i;
rows[1] = j;
checkList.Add(rows);
}
}
}
foreach (var row in checkList)
{
Console.WriteLine(row[0] + " " + row[1]);
}
Output:
I hope someone can explain this. Thanks
Most object types in .NET (including arrays) are passed by reference, so checkList.Add(rows); adds a reference to the same array to the list, multiple times.
Instead, you'll want to create a new array instance every time:
List<int[]> checkList = new List<int[]>();
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
if (true)
{
checkList.Add(new int[]{ i, j });
}
}
}
I believe the issue here is that when you are
checkList.Add(rows);
You are adding a reference to the rows array every time to the list, not a separate copy of it. This leads to your current behaviour.
A solution would be to instantiate the array inside the loop, so a new array is created every iteration.
List<int[]> checkList = new List<int[]>();
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
if (true)
{
int[] rows = new int[2];
rows[0] = i;
rows[1] = j;
checkList.Add(rows);
}
}
}
As a supplement to Matthias answer, one of the things that's perhaps not easy to appreciate about C# is that most variables you have and use are merely a reference to something else. When you assign some variable like this:
int[] rows = new int[2];
C# creates some space in memory to keep an array of 2 integers, it attaches a reference to it, and that thing becomes your variable that you use, named rows. If you then do:
int[] rows2 = rows;
It doesn't clone the memory space used and create a new array, it just creates another reference attached to the same data in memory. If the data were a dog, it now has 2 leads attached to its collar but there is still only one dog. You can pull on either lead to urge the dog to stop peeing on a car, but it's the same dog you're affecting.
Array/list slots are just like variables in this regard. To say you have:
List<int[]> checkList = new List<int[]>();
Means declare a list where each of its slots are a variable capable of referring to an int array. It's conceptually no different to saying:
int[] checkList0 = row;
int[] checkList1 = row;
int[] checkList2 = row;
int[] checkList3 = row;
It's just that those numbers are baked into the name, whereas a list permits you a way of varying the name programmatically (and having more than 4 slots):
checkList[0] = row;
checkList[1] = row;
checkList[2] = row;
checkList[3] = row;
checkList[0] is conceptually the entire variable name, just like checkList0 is a variable name, and remember that this is hence just another variable that is just a reference to that same array in memory.
By not making a new array each time, you attached every variable slot in the list to the same array in memory, and thus you ended up with something in memory that looks like:
The black is the list, the blue is the array. Every list slot is just a reference to the same array. You might have changed the numbers in the array 200 times, but at the end of the oepration, because there was only ever one array, you only see the final set of numbers you wrote into the array. You might have attached 20 leads to your dog and pulled each of them once, but it's still just the same dog that has 20 times been stopped from peeing on 20 cars.
Matthias answer works (and is how it should be done) because you concretely make a new array each time
Numbers in blue are fabricated and not intended to represent the answers you should see printed; the concept being explained is that of linking to new array objects in memory
You'd be forgiven for thinking that a clone would be made, bcause it is for int. int is a value type, whcih means the value is copied when it's used:
int x = 1;
int y = x;
y = y + 1;
y is now 2, but x is still 1. It'd be pretty hard work to write C# if it wasn't this way i.e. if every time you incremented some int variable, every other variable that had touched the variable that it came from was also affected.. So I think it's perhaps intrinsically reasonable to assume that whenever an assignment of anything is made, changes that affect the value of the assigned variable don't affect earlier iterations of it.. but that's not the case. There's this clear divide between value types (types whose data is copied/cloned when they're assigned) and reference types (types whose data is not copied/cloned). While int is a value type (cloned), an int[] is a reference type (not cloned)
..and that's something you'll really need to get down with and remember
Roll on the what's ref/out for? query :D

Binary search slower, what am I doing wrong?

EDIT: so it looks like this is normal behavior, so can anyone just recommend a faster way to do these numerous intersections?
so my problem is this. I have 8000 lists (strings in each list). For each list (ranging from size 50 to 400), I'm comparing it to every other list and performing a calculation based on the intersection number. So I'll do
list1(intersect)list1= number
list1(intersect)list2= number
list1(intersect)list888= number
And I do this for every list. Previously, I had HashList and my code was essentially this: (well, I was actually searching through properties of an object, so I
had to modify the code a bit, but it's basically this:
I have my two versions below, but if anyone knows anything faster, please let me know!
Loop through AllLists, getting each list, starting with list1, and then do this:
foreach (List list in AllLists)
{
if (list1_length < list_length) //just a check to so I'm looping through the
//smaller list
{
foreach (string word in list1)
{
if (block.generator_list.Contains(word))
{
//simple integer count
}
}
}
// a little more code, but the same, but looping through the other list if it's smaller/bigger
Then I make the lists into regular lists, and applied Sort(), which changed my code to
foreach (List list in AllLists)
{
if (list1_length < list_length) //just a check to so I'm looping through the
//smaller list
{
for (int i = 0; i < list1_length; i++)
{
var test = list.BinarySearch(list1[i]);
if (test > -1)
{
//simple integer count
}
}
}
The first version takes about 6 seconds, the other one takes more than 20 (I just stop there cuz otherwise it would take more than a minute!!!) (and this is for a smallish subset of the data)
I'm sure there's a drastic mistake somewhere, but I can't find it.
Well I have tried three distinct methods for achieving this (assuming I understood the problem correctly). Please note I have used HashSet<int> in order to more easily generate random input.
setting up:
List<HashSet<int>> allSets = new List<HashSet<int>>();
Random rand = new Random();
for(int i = 0; i < 8000; ++i) {
HashSet<int> ints = new HashSet<int>();
for(int j = 0; j < rand.Next(50, 400); ++j) {
ints.Add(rand.Next(0, 1000));
}
allSets.Add(ints);
}
the three methods I checked (code is what runs in the inner loop):
the loop:
note that you are getting duplicated results in your code (intersecting set A with set B and later intersecting set B with set A).
It won't affect your performance thanks to the list length check you are doing. But iterating this way is clearer.
for(int i = 0; i < allSets.Count; ++i) {
for(int j = i + 1; j < allSets.Count; ++j) {
}
}
first method:
used IEnumerable.Intersect() to get the intersection with the other list and checked IEnumerable.Count() to get the size of the intersection.
var intersect = allSets[i].Intersect(allSets[j]);
count = intersect.Count();
this was the slowest one averaging 177s
second method:
cloned the smaller set of the two sets I was intersecting, then used ISet.IntersectWith() and checked the resulting sets Count.
HashSet<int> intersect;
HashSet<int> intersectWith;
if(allSets[i].Count < allSets[j].Count) {
intersect = new HashSet<int>(allSets[i]);
intersectWith = allSets[j];
} else {
intersect = new HashSet<int>(allSets[j]);
intersectWith = allSets[i];
}
intersect.IntersectWith(intersectWith);
count = intersect.Count;
}
}
this one was slightly faster, averaging 154s
third method:
did something very similar to what you did iterated over the shorter set and checked ISet.Contains on the longer set.
for(int i = 0; i < allSets.Count; ++i) {
for(int j = i + 1; j < allSets.Count; ++j) {
count = 0;
if(allSets[i].Count < allSets[j].Count) {
loopingSet = allSets[i];
containsSet = allSets[j];
} else {
loopingSet = allSets[j];
containsSet = allSets[i];
}
foreach(int k in loopingSet) {
if(containsSet.Contains(k)) {
++count;
}
}
}
}
this method was by far the fastest (as expected), averaging 66s
conclusion
the method you're using is the fastest of these three. I certainly can't think of a faster single threaded way to do this. Perhaps there is a better concurrent solution.
I've found that one of the most important considerations in iterating/searching any kind of collection is to choose the collection type very carefully. To iterate through a normal collection for your purposes will not be the most optimal. Try using something like:
System.Collections.Generic.HashSet<T>
Using the Contains() method while iterating over the shorter list of two (as you mentioned you're already doing) should give close to O(1) performance, the same as key lookups in the generic Dictionary type.

What's the best way to implement an unfixed multi-deminsional array in C#.NET?

For example: an array of varying-length arrays of integers.
In C++, we are used to doing things like:
int * * TwoDimAry = new int * [n] ;
for ( int i ( 0 ) ; i < n ; i ++ )
{
TwoDimAry[i] = new int [i + n] ;
}
In this case, if n == 3 then the result would be an array of three pointers to arrays of integers, and would appear like this:
http://img263.imageshack.us/img263/4149/multidimarray.png
Of course, .NET arrays are managed collections, so you don't have to deal with the manual allocation/deletion.
But declaring:
int[][] TwoDimAry ;
... in C# does not appear to have the same effect - namely, you have to innitialize ALL of the sub-arrays at the same time, and they have to be the same length.
I need my sub-arrays to be independent of each-other, as they are in native C++.
What's the best way to implement this using managed collections? Are there any drawbacks I should be aware of?
Like C++, you need to initialize every subarray in an int[][].
However, they don't need to have the same length. (That's why it's called a jagged array)
For example:
int[][] jagged1 = new int[][] { new int[1], new int[2], new int[3] };
Your C++ code can be translated directly to C#:
int[][] TwoDimAry = new int[n][];
for(int i = 0; i < n; i++) {
TwoDimAry[i] = new int[i + n];
}
Here is an example with a jagged array initialized with 1, 2, 3, .. elements for each row
int N = 20;
int[][] array = new int[N][]; // First index is rows, second is columns
for(int i=0; i < N; i++)
{
array[i] = new int[i+1]; // Initialize i-th row with 'i' columns
for( int j = 0; j <= i; j++)
{
array[i][j] = N*j+i; // Set a value for each column in the row
}
}
I have use this enough to know that there aren't many drawbacks overall. Hybrid appraches with List<int[]> or List<int>[] also work.
In .Net, most of the time you don't want to use arrays this way at all. This is because in .Net, arrays are thought of as a different animal from a collection. Managed, yes. Collection? Well, maybe, but it confuses terms because that means something special. If you want a collection (hint: most of the time you do), look in the Systems.Collections namespace, particularly Systems.Collections.Generic. It sounds like you really want either a List<List<int>> or a List<int[]>.

Categories

Resources