I've found two diferent methods to get a Max value from an array but I'm not really fond of parallel programing, so I really don't understand it.
I was wondering do this methods do the same or am I missing something?
I really don't have much information about them. Not even comments...
The first method:
int[] vec = ... (I guess the content doesn't matter)
static int naiveMax()
{
int max = vec[0];
object obj = new object();
Parallel.For(0, vec.Length, i =>
{
lock (obj) {
if (vec[i] > max) max = vec[i];
}
});
return max;
}
And the second one:
static int Max()
{
int max = vec[0];
object obj = new object();
Parallel.For(0, vec.Length, //could be Parallel.For<int>
() => vec[0],
(i, loopState, partial) =>
{
if(vec[i]>partial) partial = vec[i];
return partial;
},
partial => {
lock (obj) {
if( partial > max) max = partial;
}
});
return max;
}
Do these do the same or something diferent and what? Thanks ;)
Both find the maximum value in an array of integers. In an attempt to find the maximum value faster, they do it "in parallel" using the Parallel.For Method. Both methods fail at this, though.
To see this, we first need a sufficiently large array of integers. For small arrays, parallel processing doesn't give us a speed-up anyway.
int[] values = new int[100000000];
Random random = new Random();
for (int i = 0; i < values.Length; i++)
{
values[i] = random.Next();
}
Now we can run the two methods and see how long they take. Using an appropriate performance measurement setup (Stopwatch, array of 100,000,000 integers, 100 iterations, Release build, no debugger attached, JIT warm-up) I get the following results on my machine:
naiveMax 00:06:03.3737078
Max 00:00:15.2453303
So Max is much much better than naiveMax (6 minutes! cough).
But how does it compare to, say, PLINQ?
static int MaxPlinq(int[] values)
{
return values.AsParallel().Max();
}
MaxPlinq 00:00:11.2335842
Not bad, saved a few seconds. Now, what about a plain, old, sequential for loop for comparison?
static int Simple(int[] values)
{
int result = values[0];
for (int i = 0; i < values.Length; i++)
{
if (result < values[i]) result = values[i];
}
return result;
}
Simple 00:00:05.7837002
I think we have a winner.
Lesson learned: Parallel.For is not pixie dust that you can sprinkle over your code to
make it magically run faster. If performance matters, use the right tools and measure, measure, measure, ...
They appear to do the same thing, however they are very inefficient. The point of parallelization is to improve the speed of code that can be executed independently. Due to race conditions, discovering the maximum (as implemented here) requires an atomic semaphore/lock on the actual logic... Which means you're spinning up many threads and related resources simply to do the code sequentially anyway... Defeating the purpose of parallelization entirely.
Related
This is my first attempt at parallel programming.
I'm writing a test console app before using this in my real app and I can't seem to get it right. When I run this, the parallel search is always faster than the sequential one, but the parallel search never finds the correct value. What am I doing wrong?
I tried it without using a partitioner (just Parallel.For); it was slower than the sequential loop and gave the wrong number. I saw a Microsoft doc that said for simple computations, using Partitioner.Create can speed things up. So I tried that but still got the wrong values. Then I saw Interlocked, but I think I'm using it wrong.
Any help would be greatly appreciated
Random r = new Random();
Stopwatch timer = new Stopwatch();
do {
// Make and populate a list
List<short> test = new List<short>();
for (int x = 0; x <= 10000000; x++)
{
test.Add((short)(r.Next(short.MaxValue) * r.NextDouble()));
}
// Initialize result variables
short rMin = short.MaxValue;
short rMax = 0;
// Do min/max normal search
timer.Start();
foreach (var amp in test)
{
rMin = Math.Min(rMin, amp);
rMax = Math.Max(rMax, amp);
}
timer.Stop();
// Display results
Console.WriteLine($"rMin: {rMin} rMax: {rMax} Time: {timer.ElapsedMilliseconds}");
// Initialize parallel result variables
short pMin = short.MaxValue;
short pMax = 0;
// Create list partioner
var rangePortioner = Partitioner.Create(0, test.Count);
// Do min/max parallel search
timer.Restart();
Parallel.ForEach(rangePortioner, (range, loop) =>
{
short min = short.MaxValue;
short max = 0;
for (int i = range.Item1; i < range.Item2; i++)
{
min = Math.Min(min, test[i]);
max = Math.Max(max, test[i]);
}
_ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMin), Math.Min(pMin, min));
_ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMax), Math.Max(pMax, max));
});
timer.Stop();
// Display results
Console.WriteLine($"pMin: {pMin} pMax: {pMax} Time: {timer.ElapsedMilliseconds}");
Console.WriteLine("Press enter to run again; any other key to quit");
} while (Console.ReadKey().Key == ConsoleKey.Enter);
Sample output:
rMin: 0 rMax: 32746 Time: 106
pMin: 0 pMax: 32679 Time: 66
Press enter to run again; any other key to quit
The correct way to do a parallel search like this is to compute local values for each thread used, and then merge the values at the end. This ensures that synchronization is only needed at the final phase:
var items = Enumerable.Range(0, 10000).ToList();
int globalMin = int.MaxValue;
int globalMax = int.MinValue;
Parallel.ForEach<int, (int Min, int Max)>(
items,
() => (int.MaxValue, int.MinValue), // Create new min/max values for each thread used
(item, state, localMinMax) =>
{
var localMin = Math.Min(item, localMinMax.Min);
var localMax = Math.Max(item, localMinMax.Max);
return (localMin, localMax); // return the new min/max values for this thread
},
localMinMax => // called one last time for each thread used
{
lock(items) // Since this may run concurrently, synchronization is needed
{
globalMin = Math.Min(globalMin, localMinMax.Min);
globalMax = Math.Max(globalMax, localMinMax.Max);
}
});
As you can see this is quite a bit more complex than a regular loop, and this is not even doing anything fancy like partitioning. An optimized solution would work over larger blocks to reduce overhead, but this is omitted for simplicity, and it looks like the OP is aware such issues already.
Be aware that multi threaded programming is difficult. While it is a great idea to try out such techniques in a playground rather than a real program, I would still suggest that you should start by studying the potential dangers of thread safety, there is fairly easy to find good resources about this.
Not all problems will be as obviously wrong like this, and it is quite easy to cause issues that breaks once in a million, or only when the cpu load is high, or only on single CPU systems, or issues that are only detected long after the code is put into production. It is a good practice to be paranoid whenever multiple threads may read and write the same memory concurrently.
I would also recommend learning about immutable data types, and pure functions, since these are much safer and easier to reason about once multiple threads are involved.
Interlocked.Exchange is thread safe only for Exchange, every Math.Min and Math.Max can be with race condition. You should compute min/max for every batch separately and then join results.
Using low-lock techniques like the Interlocked class is tricky and advanced. Taking into consideration that your experience in multithreading is not excessive, I would say go with a simple and trusty lock:
object locker = new object();
//...
lock (locker)
{
pMin = Math.Min(pMin, min);
pMax = Math.Max(pMax, max);
}
I am trying to figure out what the difference between the following for loops is.
The first is code that I wrote while practicing algorithms on codewars.com. It times out when attempting the larger test cases.
The second is one of the top solutions. It seems functionally similar (obviously its more concise) but runs much faster and does not time out. Can anyone explain to me what the difference is? Also, the return statement in the second snippet is confusing to me. What exactly does this syntax mean? Maybe this is where it is more efficient.
public static long findNb(long m)
{
int sum = 0;
int x = new int();
for (int n = 0; sum < m; n++)
{
sum += n*n*n;
x = n;
System.Console.WriteLine(x);
}
if (sum == m)
{
return x;
}
return -1;
}
vs
public static long findNb(long m) //seems similar but doesnt time out
{
long total = 1, i = 2;
for(; total < m; i++) total += i * i * i;
return total == m ? i - 1 : -1;
}
The second approach uses long for the total value. Chances are that you're using an m value that's high enough to exceed the number of values representable by int. So your math overflows and the n value becomes a negative number. You get caught in an infinite loop, where n can never get as big as m.
And, like everyone else says, get rid of the WriteLine.
Also, the return statement in the second snippet is confusing to me. What exactly does this syntax mean?
It's a ternary conditional operator.
Both approaches are roughly the same, except unwanted System.Console.WriteLine(x); which spolis the fun: printing on the Console (UI!) is a slow operation.
If you are looking for a fast solution (esp. for the large m and long loop) you can just precompute all (77936) values:
public class Solver {
static Dictionary<long, long> s_Sums = new Dictionary<long, long>();
private static void Build() {
long total = 0;
for (long i = 0; i <= 77936; ++i) {
total += i * i * i;
s_Sums.Add(total, i);
}
}
static Solver()
Build();
}
public static long findNb(long m) {
return s_Sums.TryGetValue(m, out long result)
? result
: -1;
}
}
When I run into micro optimisation challenges like this, I always use BenchmarkDotnet. It's the tool to use to get all the insights to performance, memory allocations, deviations in .NET Framework versions, 64bit vs 32 bit etc. etc.
But as others write - remember to remove the WriteLine() statement :)
I made some tests of code performance, and I would like to know how the CPU cache works in this kind of situation:
Here is a classic example for a loop:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
}
Here is the specific situation to get the same thing, but much more performant:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
short max = Math.Max(max1, max2);
}
So I am interested to know why the second is more efficient as the first one.
I understand it's a story of CPU cache, but I don't get really how it happens (like values are not read twice between loops).
EDIT:
.NET Core 4.6.27617.04
2.1.11
Intel Core i7-7850HQ 2.90GHz 64-bit
Calling 50 Million of times:
MyClass1:
=> 00:00:06.0702028
MyClass2:
=> 00:00:03.8563776 (-36 %)
The last metric are the one with the Loop unrolling.
The difference in performance in this case is not related to caching - you have just 100 values - they fit entirely in the L2 cache already at the time you generated them.
The difference is due to out-of-order execution.
A modern CPU has multiple execution units and can perform more than one operation at the same time even in a single-threaded application.
But your loop is problematic for a modern CPU because it has a dependency:
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
Here each subsequent iteration is dependent on the value max from the previous one, so the CPU is forced to compute them sequentially.
Your revised loop adds a degree of freedom for the CPU; since max1 and max2 are independent, they can be computed in parallel.
So essentially the revised loop can run equally fast per iteration as the first one:
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
But it has half the iterations, so in the end you get a significant speedup (not 2x because out-of-order execution is not perfect).
Caching
Caching in the cpu works such as it pre-loads the next few lines of code from memory and stores it in the CPU Cache, This may be data, pointers, variable values, etc. etc.
Code Blocks
between your two blocks of code, the difference may not appear in the syntax, try converting your Code to IL (intermediate runtime language for c# which is executed by JIT(just-in-time compiler)) see ref for tools and resources.
or just decompiler your built/compiled code and check how the compiler "optimized it" when making the dll/exe files using the decompiler below.
other performance optimization
Loop Unrolling
CPU Caching
Refs:
C# Decompiler
JIT
I am building software to evaluate many possible solutions and am trying to introduce parallel processing to speed up the calculations. My first attempt was to build a datatable with each row being a solution to evaluate but building the datatable takes quite some time and I am running into memory issues when the number of possible solutions goes into the millions.
The problem which warrants these solutions is structured as follows:
There is a range dates for x number of events which must be done in order. The solutions to evaluate could look as follows with each solution being a row, the events being the columns and the day number being the values.
Given 3 days (0 to 2) and three events:
0 0 0
0 0 1
0 0 2
0 1 1
0 1 2
0 2 2
1 1 1
1 1 2
1 2 2
2 2 2
My new plan was to use recursion and evaluate the solutions as I go rather than build a solution set to then evaluate.
for(int day = 0; day < maxdays; day++)
{
List<int> mydays = new List<int>();
mydays.Add(day);
EvalEvent(0,day,mydays);
}
private void EvalEvent(int eventnum,
int day, List<int> mydays)
{
Parallel.For(day,maxdays, day2 =>
// events must be on same day or after previous events
{
List<int> mydays2 = new List<int>();
for(int a = 0; a <mydays.Count;a++)
{
mydays2.Add(mydays[a]);
}
mydays2.Add(day2);
if(eventnum< eventcount - 1) // proceed to next event
{
EvalEvent(eventnum+1, day2,mydays2);
}
else
{
EvalSolution(mydays2);
}
});
}
My question is if this is actually an efficient use of parallel processing or will too many threads be spawned and slow it down? Should the parallel loop only be done on the last or maybe last few values of eventnum or is there a better way to approach the problem?
Requested old code pretty much is as follows:
private int daterange;
private int events;
private void ScheduleIt()
{
daterange = 10;
events = 6;
CreateSolutions();
int best = GetBest();
}
private DataTable Options();
private bool CreateSolutions()
{
Options= new DataTable();
Options.Columns.Add();
for (int day1=0;day1<=daterange ;day1++)
{
Options.Rows.Add(day1);
}
for (int event =1; event<events; event++)
{
Options.Columns.Add();
foreach(DataRow dr in Options.Rows)
{dr[Options.Columns.Count-1] = dr[Options.Columns.Count-2] ;}
int rows = Options.Rows.Count;
for (int day1=1;day1<=daterange ;day1++)
{
for(int i = 0; i <rows; i++)
{
if(day1 > Convert.ToInt32(Options.Rows[i][Options.Columns.Count-2]))
{
try{
Options.Rows.Add();
for (int col=0;col<Options.Columns.Count-1;col++)
{
Options.Rows[Options.Rows.Count-1][col] =Options.Rows[i][col];
}
Options.Rows[Options.Rows.Count-1][Options.Columns.Count-1] = day1;
}
catch(Exception ex)
{
return false;
}
}
}
}
}
return true;
}
private intGetBest()
{
int bestopt = 0;
double bestscore =999999999;
Parallel.For( 0, Options.Rows.Count,opt =>
{
double score = 0;
for(int i = 0; i <Options.Columns.Count;i++)
{score += Options.Rows[opt][i]}// just a stand in calc for a score
if (score < bestscore)
{bestscore = score;
bestopt = opt;
}
});
return bestopt;
}
Even if done without errors it can not significantly speed up your solution.
It looks like each level of recursion tries to start multiple (let say up to "k") next level calls for let's "n" level. This essentially mean code is O(k ^ n) which grows very fast. Non-algorithmic speedup of such O(k^n) solution is essentially useless (unless both k and n are very small). In particular, running code in parallel will only give you constant factor of speed up (roughly number of threads supported by your CPUs) which really does not change complexity at all.
Indeed creation of exponentially large number of requests for new threads will likely cause more problems by itself for just managing threads.
In addition to not significantly improving performance parallel code is harder to write as it needs proper synchronization or cleaver data partitioning - neither seem to be present in your case.
Parallelization works best when the workload is bulky and balanced. Ideally you would like your work splited to as many independent partitions as the logical processors of your machine, ensuring that all partitions have approximately the same size. This way all available processors will work with the maximum efficiency for approximately the same duration, and you'll get the results after the shortest time possible.
Of course you should start with a working and bug-free serial implementation, and then think about ways to partition your work. The easiest way is usually not optimal. For example an easy path is to convert your work to a LINQ query, and then parallelize it with AsParallel() (making it PLINQ). This usually results to a too granular partitioning, that introduces too much overhead. If you can't find ways to improve it you then can go the way of Parallel.For or Parallel.ForEach, which is a bit more complex.
A LINQ implementation should probably start by creating an iterator that produces all your units of work (Events or Solutions, it's not very clear to me).
public static IEnumerable<Solution> GetAllSolutions()
{
for (int day = 0; day < 3; day++)
{
for (int ev = 0; ev < 3; ev++)
{
yield return new Solution(); // ???
}
}
}
It will certainly be helpful if you have created concrete classes to represent the entities you are dealing with.
I need to create an array of boolean values, which could be on the scale of 100,000s or even millions of entries. It also needs to be super-fast, so every millisecond per iteration counts.
At the time of beginning the loop, I will already know how many entries there are going to be in the array. The question is, will it be faster to create a bool array up front and fill in the values by index (which is random access - could be slow?), or should I create a List<bool>, keep adding entries to the list, and at the end return .ToArray()?
In other words:
Option 1
var array = new bool[size];
for (var n=0; n<size; n++)
array[n] = GetValue(n);
return array;
Option 2
var list = new List<bool>();
for (var n=0; n<size; n++)
list.Add(GetValue(n));
return list.ToArray();
Or maybe there's a 3rd way that's even faster?
Use a System.Collections.BitArray and don't worry about speed.
What you are suggesting above will only waste your memory. This is optimizes both for speed and size, and will pack your bool values nicely (8 per byte, as the gods intended :).
Reply to below comments: If you use a BitArray, everything will be zero at first. Set only those bits for which you have GetValue == true.
The following code seems to show (at least to me) that of the methods discussed on this page, the simple allocation to a bool[] using a loop is quickest.
The code also seems to show me that unless GetValue(n) is computationally trivial, the overhead of allocating the bytes is not the part of the process I would be hoping to optimise.
Hope this helps in some way.
edit: added the results from the run (on my machine)
-- 187ms BitArray
-- 171ms List<bool>().ToArray
-- 168ms bool[] set only if true
-- 130ms bool[] always set
--11460ms bool[] always set with 'complex' GetValue()
class Program
{
static void Main(string[] args)
{
BitArray bitArray = new BitArray(10000000);
bool[] boolArray = new bool[10000000];
Stopwatch sw1 = new Stopwatch();
sw1.Start();
for (int i = 0; i < 10000000; i++)
{
bitArray[i] = GetMod2(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
var list = new List<bool>();
for (int i = 0; i < 10000000; i++)
list.Add(GetMod2(i));
var boolArray2 = list.ToArray();
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
bool nextVal = GetMod2(i);
if (nextVal)
bitArray[i] = true;
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
boolArray[i] = GetMod2(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
boolArray[i] = GetRand(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
Console.ReadLine();
}
static bool GetMod2(int i)
{
return (i % 2) == 1;
}
static bool GetRand(int i)
{
return new Random().Next(2) == 1;
}
}
Go with the first. The only reason it might be. "slow" is if it keeps paging data from outside the processor cache.
The list will have exactly the same problem, except it will also need to perform several memory allocations and copies.
Now here's a funny old thing. Inspired by #paul, I ran these benchmark tests myself, on 10,000,000 booleans. The results (in milliseconds) are very surprising, given the discussion in the comments to this question:
BitArray: 517
BitArray + CopyTo(array): 536
List + ToArray(): 455
bool array: 483
And what a turnout for the books! Despite the fact that the List<Bool> is inserting a new record every time, while the bool[] and BitArray are initialized to false on every record, and I only updated them where the value should be true, the List<bool> comes out tops, consistently, even including the .ToArray() call.
Yet another case where practical application is better than textbook knowledge, it seems... :)