Which is fast : Query Syntax vs. Loops

Which is fast : Query Syntax vs. Loops - c#

The following code provides two approaches that generate pairs of integers whose sum is less than 100, and they're arranged in descending order based on their distance from (0,0).
//approach 1
private static IEnumerable<Tuple<int,int>> ProduceIndices3()
{
var storage = new List<Tuple<int, int>>();
for (int x = 0; x < 100; x++)
{
for (int y = 0; y < 100; y++)
{
if (x + y < 100)
storage.Add(Tuple.Create(x, y));
}
}
storage.Sort((p1,p2) =>
(p2.Item1 * p2.Item1 +
p2.Item2 * p2.Item2).CompareTo(
p1.Item1 * p1.Item1 +
p1.Item2 * p1.Item2));
return storage;
}
//approach 2
private static IEnumerable<Tuple<int, int>> QueryIndices3()
{
return from x in Enumerable.Range(0, 100)
from y in Enumerable.Range(0, 100)
where x + y < 100
orderby (x * x + y * y) descending
select Tuple.Create(x, y);
}
This code is taken from the book Effective C# by Bill Wagner, Item 8. In the entire article, the author has focused more on the syntax, the compactness and the readability of the code, but paid very little attention to the performance, and almost didn't discuss it.
So I basically want to know, which approach is faster? And what is usually better at performance (in general) : Query Syntax or Manual Loops?
Please discuss them in detail, providing references if any. :-)

Profiling is truth, but my gut feeling would be that the loops are probably faster. The important thing is that 99 times out of 100 the performance difference just doesn't matter in the grand scheme of things. Use the more readable version and your future self will thank you when you need to maintain it later.

Running each function 1000 times:
for loop: 2623 ms
query: 2821 ms
looks logic since the second one is just syntaxic sugar for the first one. But i would use the second one for its readability.

Though this doesn't strictly answer your question, performance-wise I would suggest merging that x+y logic into the iteration, thus:
for (int x = 0; x < 100; x++)
for (int y = 0; y < 100 - x; y++)
storage.Add(Tuple.Create(x, y));

Related

Why is HashSet<Point> so much slower than HashSet<string>?

I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is HashSet<Point> or similar classes. However this seems to be very slow compared to something like HashSet<string>.
For example, this code:
HashSet<Point> points = new HashSet<Point>();
using (Bitmap img = new Bitmap(1000, 1000))
{
for (int x = 0; x < img.Width; x++)
{
for (int y = 0; y < img.Height; y++)
{
points.Add(new Point(x, y));
}
}
}
takes about 22.5 seconds.
While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:
HashSet<string> points = new HashSet<string>();
using (Bitmap img = new Bitmap(1000, 1000))
{
for (int x = 0; x < img.Width; x++)
{
for (int y = 0; y < img.Height; y++)
{
points.Add(x + "," + y);
}
}
}
So, my questions are:
Is there a reason for that? I checked this answer, but 22.5 sec is way more than the numbers shown in that answer.
Is there a better way to store points without duplicates?

There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0)); to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".
At issue is that HashSet<T> needs an IEqualityComparer<T> to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default<T>(). That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.
The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.
You can solve both problems by providing the HashSet with a good comparer. Like this one:
class PointComparer : IEqualityComparer<Point> {
public bool Equals(Point x, Point y) {
return x.X == y.X && x.Y == y.Y;
}
public int GetHashCode(Point obj) {
// Perfect hash for practical bitmaps, their width/height is never >= 65536
return (obj.Y << 16) ^ obj.X;
}
}
And use it:
HashSet<Point> list = new HashSet<Point>(new PointComparer());
And it is now about 150 times faster, easily beating the string test.

The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).
Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj) thus increasing the amount of boxing conversions.
Also note that the hash code of Point is computed by x ^ y. This produces very little dispersion in your data range, and therefore the buckets of the HashSet are overpopulated — something that doesn't happen with string, where the dispersion of the hashes is much larger.
You can solve that problem by implementing your own Point struct (trivial) and using a better hash algorithm for your expected data range, e.g. by shifting the coordinates:
(x << 16) ^ y
For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.

What wrong with this implement of this arcsine approximate in C#

This is a formula to approximate arcsine(x) using Taylor series from this blog
This is my implementation in C#, I don't know where is the wrong place, the code give wrong result when running:
When i = 0, the division will be 1/x. So I assign temp = 1/x at startup. For each iteration, I change "temp" after "i".
I use a continual loop until the two next value is very "near" together. When the delta of two next number is very small, I will return the value.
My test case:
Input is x =1, so excected arcsin(X) will be arcsin (1) = PI/2 = 1.57079633 rad.
class Arc{
static double abs(double x)
{
return x >= 0 ? x : -x;
}
static double pow(double mu, long n)
{
double kq = mu;
for(long i = 2; i<= n; i++)
{
kq *= mu;
}
return kq;
}
static long fact(long n)
{
long gt = 1;
for (long i = 2; i <= n; i++) {
gt *= i;
}
return gt;
}
#region arcsin
static double arcsinX(double x) {
int i = 0;
double temp = 0;
while (true)
{
//i++;
var iFactSquare = fact(i) * fact(i);
var tempNew = (double)fact(2 * i) / (pow(4, i) * iFactSquare * (2*i+1)) * pow(x, 2 * i + 1) ;
if (abs(tempNew - temp) < 0.00000001)
{
return tempNew;
}
temp = tempNew;
i++;
}
}
public static void Main(){
Console.WriteLine(arcsin());
Console.ReadLine();
}
}

In many series evaluations, it is often convenient to use the quotient between terms to update the term. The quotient here is
(2n)!*x^(2n+1) 4^(n-1)*((n-1)!)^2*(2n-1)
a[n]/a[n-1] = ------------------- * --------------------- -------
(4^n*(n!)^2*(2n+1)) (2n-2)!*x^(2n-1)
=(2n(2n-1)²x²)/(4n²(2n+1))
= ((2n-1)²x²)/(2n(2n+1))
Thus a loop to compute the series value is
sum = 1;
term = 1;
n=1;
while(1 != 1+term) {
term *= (n-0.5)*(n-0.5)*x*x/(n*(n+0.5));
sum += term;
n += 1;
}
return x*sum;
The convergence is only guaranteed for abs(x)<1, for the evaluation at x=1 you have to employ angle halving, which in general is a good idea to speed up convergence.

You are saving two different temp values (temp and tempNew) to check whether or not continuing computation is irrelevant. This is good, except that you are not saving the sum of these two values.
This is a summation. You need to add every new calculated value to the total. You are only keeping track of the most recently calculated value. You can only ever return the last calculated value of the series. So you will always get an extremely small number as your result. Turn this into a summation and the problem should go away.

NOTE: I've made this a community wiki answer because I was hardly the first person to think of this (just the first to put it down in a comment). If you feel that more needs to be added to make the answer complete, just edit it in!
The general suspicion is that this is down to Integer Overflow, namely one of your values (probably the return of fact() or iFactSquare()) is getting too big for the type you have chosen. It's going to negative because you are using signed types — when it gets to too large a positive number, it loops back into the negative.
Try tracking how large n gets during your calculation, and figure out how big a number it would give you if you ran that number through your fact, pow and iFactSquare functions. If it's bigger than the Maximum long value in 64-bit like we think (assuming you're using 64-bit, it'll be a lot smaller for 32-bit), then try using a double instead.

Is there a better performing functional version of this iterative algorithm in C#?

I was hoping to figure out a way to write the below in a functional style with extension functions. Ideally this functional style would perform well compared to the iterative/loop version. I'm guessing that there isn't a way. Probably because of the many additional function calls and stack allocations, etc.
Fundamentally I think the pattern which is making it troublesome is that it both calculates a value to use for the Predicate and then needs that calculated value again as part of the resulting collection.
// This is what is passed to each function.
// Do not assume the array is in order.
var a = (0).To(999999).ToArray().Shuffle();
// Approx times in release mode (on my machine):
// Functional is avg 20ms per call
// Iterative is avg 5ms per call
// Linq is avg 14ms per call
private static List<int> Iterative(int[] a)
{
var squares = new List<int>(a.Length);
for (int i = 0; i < a.Length; i++)
{
var n = a[i];
if (n % 2 == 0)
{
int square = n * n;
if (square < 1000000)
{
squares.Add(square);
}
}
}
return squares;
}
private static List<int> Functional(int[] a)
{
return
a
.Where(x => x % 2 == 0 && x * x < 1000000)
.Select(x => x * x)
.ToList();
}
private static List<int> Linq(int[] a)
{
var squares =
from num in a
where num % 2 == 0 && num * num < 1000000
select num * num;
return squares.ToList();
}

An alternative to Konrad's suggestion. This avoids the double calculation, but also avoids even calculating the square when it doesn't have to:
return a.Where(x => x % 2 == 0)
.Select(x => x * x)
.Where(square => square < 1000000)
.ToList();
Personally, I wouldn't sweat the difference in performance until I'd seen it be significant in a larger context.
(I'm assuming that this is just an example, by the way. Normally you'd possibly compute the square root of 1000000 once and then just compare n with that, to shave off a few milliseconds. It does require two comparisons or an Abs operation though, of course.)
EDIT: Note that a more functional version would avoid using ToList at all. Return IEnumerable<int> instead, and let the caller transform it into a List<T> if they want to. If they don't, they don't take the hit. If they only want the first 5 values, they can call Take(5). That laziness could be a big performance win over the original version, depending on the context.

Just solving your problem of the double calculation:
return (from x in a
let sq = x * x
where x % 2 == 0 && sq < 1000000
select sq).ToList();
That said, I’m not sure that this will lead to much performance improvement. Is the functional variant actually noticeably faster than the iterative one? The code offers quite a lot of potential for automated optimisation.

How about some parallel processing? Or does the solution have to be LINQ (which I consider to be slow).
var squares = new List<int>(a.Length);
Parallel.ForEach(a, n =>
{
if(n < 1000 && n % 2 == 0) squares.Add(n * n);
}
The Linq version would be:
return a.AsParallel()
.Where(n => n < 1000 && n % 2 == 0)
.Select(n => n * n)
.ToList();

I don't think there's a functional solution that will be completely on-par with the iterative solution performance-wise. In my timings (see below) the 'functional' implementation from the OP appears to be around twice as slow as the iterative implementation.
Micro-benchmarks like this one are prone to all manner of issues. A common tactic in dealing with variability problems is to repeatedly call the method being timed and compute an average time per call - like this:
// from main
Time(Functional, "Functional", a);
Time(Linq, "Linq", a);
Time(Iterative, "Iterative", a);
// ...
static int reps = 1000;
private static List<int> Time(Func<int[],List<int>> func, string name, int[] a)
{
var sw = System.Diagnostics.Stopwatch.StartNew();
List<int> ret = null;
for(int i = 0; i < reps; ++i)
{
ret = func(a);
}
sw.Stop();
Console.WriteLine(
"{0} per call timings - {1} ticks, {2} ms",
name,
sw.ElapsedTicks/(double)reps,
sw.ElapsedMilliseconds/(double)reps);
return ret;
}
Here are the timings from one session:
Functional per call timings - 46493.541 ticks, 16.945 ms
Linq per call timings - 46526.734 ticks, 16.958 ms
Iterative per call timings - 21971.274 ticks, 8.008 ms
There are a host of other challenges as well: strobe-effects with the timer use, how and when the just-in-time compiler does its thing, the garbage collector running its collections, the order that competing algorithms are run, the type of cpu, the OS swapping other processes in and out, etc.
I tried my hand at a little optimization. I removed the square from the test (num * num < 1000000) - changing it to (num < 1000) - which seemed safe since there are no negatives in the input - that is, I took the square root of both sides of the inequality. Surprisingly, I got different results as compared to the methods in the OP - there were only 500 items in my optimized output as compared to the 241,849 from the three implementations in the OP implementations. So why the difference? Much of the input when squared overflows 32 bit integers, so those extra 241,349 items came from numbers that when squared overflowed to either negative numbers or numbers under 1 million while still passing our evenness test.
optimized (functional) timing:
Optimized per call timings - 16849.529 ticks, 6.141 ms
This was one of the functional implementations altered as suggested. It output the 500 items passing the criteria as expected. It is deceptively "faster" only because it output fewer items than the iterative solution.
We can make the original implementations blow up with an OverflowException by adding a checked block around their implementations. Here is a checked block added to the "Iterative" method:
private static List<int> Iterative(int[] a)
{
checked
{
var squares = new List<int>(a.Length);
// rest of method omitted for brevity...
return squares;
}
}

Modify items in List<T> fast

I tried to make this code perform faster using Parallel.ForEach and ConcurrentBag but it's still running way to long (esp. when having in mind that in my scenario i may also be 1.000.000++):
List<Point> points = new List<Point>();
for(int i = 0; i<100000;i++) {
Point point = new Point {X = i-50000, Y = i+50000, CanDelete = false};
points.Add(point);
}
foreach (Point point in points) {
foreach (Point innerPoint in points) {
if (innerPoint.CanDelete == false && (point.X - innerPoint.X) < 2) {
innerPoint.Y = point.Y;
point.CanDelete = true;
}
}
}

That code will perform WORSE in parallel, due to the data access patterns.
The best way to speed it up is to recognize that you don't need to consider all O(N^2) pairs of points, but only the ones having nearby X-coordinates.
First, sort the list by X-coordinate, O(N log N), then process forward and backward in the list from each point until you leave the neighborhood. You'll need to use indexing and not foreach.
If your sample data, the list is already sorted.
Since your distance test is symmetric, and removes matching points from consideration, you can skip looking at earlier points.
for (int j = 0; j < points.Length; ++j) {
int x1 = points[j].X;
//for (int k = j; k >= 0 && points[k].X > x1 - 2; --k ) { /* merge points */ }
for (int k = j + 1; k < points.Length && points[k].X < x1 + 2; ++k ) { /* merge points */ }
}
Not only is the complexity better, the cache behavior is far superior. And it can be split among multiple threads with far less cache contention.

Well, I don't know exactly what do you want, but let's try.
First, when creating the List, you might want to set it's desired initial size, since you know how many items it will hold. So it does not need to grow all the time.
List<Point> points = new List<Point>(100000);
Next, you could sort the list by the X property. So you would only compare each point with the points that are near it: when you find the first, forward or backward, that is too distant, you can stop comparing.

Can this code be optimised?

I have some image processing code that loops through 2 multi-dimensional byte arrays (of the same size). It takes a value from the source array, performs a calculation on it and then stores the result in another array.
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int x = 0; x < xSize; x++)
{
for (int y = 0; y < ySize; y++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
The loop currently takes ~11ms, which I assume is mostly due to accessing the byte arrays values as the calculation is pretty simple (2 multiplications and 1 addition).
Is there anything I can do to speed this up? It is a time critical part of my program and this code gets called 80-100 times per second, so any speed gains, however small will make a difference. Also at the moment xSize = 768 and ySize = 576, but this will increase in the future.
Update: Thanks to Guffa (see answer below), the following code saves me 4-5ms per loop. Although it is unsafe code.
int size = ResultImageData.Length;
int counter = 0;
unsafe
{
fixed (byte* r = ResultImageData, c = CurrentImageData, a = AlphaImageData)
{
while (size > 0)
{
*(r + counter) = (byte)(*(c + counter) * AlphaValue +
*(a + counter) * OneMinusAlphaValue);
counter++;
size--;
}
}
}

To get any real speadup for this code you would need to use pointers to access the arrays, that removes all the index calculations and bounds checking.
int size = ResultImageData.Length;
unsafe
{
fixed(byte* rp = ResultImageData, cp = CurrentImageData, ap = AlphaImageData)
{
byte* r = rp;
byte* c = cp;
byte* a = ap;
while (size > 0)
{
*r = (byte)(*c * AlphaValue + *a * OneMinusAlphaValue);
r++;
c++;
a++;
size--;
}
}
}
Edit:
Fixed variables can't be changed, so I added code to copy the pointers to new pointers that can be changed.

These are all independent calculations so if you have a multicore CPU you should be able to gain some benefit by parallelizing the calculation. Note that you'd need to keep the threads around and just hand them work to do since the overhead of thread creation would probably make this slower rather than faster if the threads are recreated each time.
The other thing that may work is farming the work off to the graphics processor. Look at this question for some ideas, for example, using Accelerator.

An option would be to use unsafe code: fixing the array in memory and use pointer operations. I doubt the speed increase will be that dramatic though.
One note: how are you timing? If you are using DateTime then be aware that this class has poor resolution. You should add an outer loop and repeat the operation say ten times -- I bet the result is less than 110ms.
for (int outer = 0; outer < 10; ++outer)
{
for (int x = 0; x < xSize; x++)
{
for (int y = 0; y < ySize; y++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
}

Since it appears that each cell in the matrix is calculated entirely independent of the others. You may want to look into having more than one thread handle this. To avoid the cost of creating threads you could have a thread pool.
If the matrix is of sufficient size, it could be a very nice speed gain. On the other hand, if it is too small, it may not help (even hurt). Worth a try though.
An example (pseudo code) could be like this:
void process(int x, int y) {
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
ThreadPool pool(3); // 3 threads big
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int x = 0; x < xSize; x++) {
for (int y = 0; y < ySize; y++) {
pool.schedule(x, y); // this will add all tasks to the pool's work queue
}
}
pool.waitTilFinished(); // wait until all scheduled tasks are complete
EDIT: Michael Meadows mentioned in a comment that plinq may be a suitable alternative: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx

I'd recommend running a few empty tests to figure out what your theoretical bounds are. For example, take out the calculation from inside the loop and see how much time is saved. Try replacing the double loop with a single loop that runs the same number of times and see how much time that saves. Then you can be sure you are going down the right path for optimization (the two paths I see are flattening the double loop into a single loop and working with the multiplication [maybe using a lookup table would be faster]).

Just real quick, you can get an optimization by looping in reverse and comparing against 0. Most CPUs have a fast op for comparison to 0.
E.g.
int xSize = ResultImageData.GetLength(0) -1;
int ySize = ResultImageData.GetLength(1) -1; //minor optimization suggested by commenter
for (int x = xSize; x >= 0; --x)
{
for (int y = ySize; y >=0; --y)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
See http://dotnetperls.com/Content/Decrement-Optimization.aspx

You are probably suffering from Boundschecking. Like Jon Skeet states, a jagged array instead of a multidimensional (that is data[][] instead of data[,]) will be faster, strange as that may seem.
The compiler will optimize
for (int i = 0; i < data.Length; i++)
by eliminating the per-element range check. But it's some kind of special case, it won't do the same for Getlength().
For the same reason, caching or hoisting the Length property (putting it in a variable like xSize) also used to be a bad thing though I haven't been able to verify that with Framework 3.5

Try swapping the x and y for loops for a more linear memory access pattern and (thus) less cache misses, like so.
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int y = 0; y < ySize; y++)
{
for (int x = 0; x < xSize; x++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}

If you are using LockBits to get at the image buffer, you should loop through y in the outer loop and x in the inner loop as that is how it is stored in memory (by row, not column). I would say that 11ms is pretty darn fast though...

Does the image data have to be stored in a multi-dimensional (rectangular) array? If you use jagged arrays instead, you may well find the JIT has more optimizations available (including removing the bounds checking).

If CurrentImageData and/or AlphaImageData don't change every time you run your code snippet, you could store the product prior to running the code snippet you show and avoid that multiplication in your loops.
Edit: Another thing I just thought of: Sometimes int operations are quicker than byte operations. Offset this with your processor cache utilization (you'll increase the data size considerably and stand a greater risk of a cache miss).

442,368 additions and 884,736 multiplications for the calculation i would think 11ms is actually extremely slow on a modern CPU.
while i don't know much about the specifics of .net i do know high speed calculation is not its strong suit. In the past i've built java apps with similar problems, i've always used C libraries to do the image / audio processing.
coming from a hardware perspective you want to make sure the memory accesses are sequential, that is step through the buffer in the order it exists in memory. you also may need to reorder this such that the compiler takes advantage of available instructions such as SIMD. How to approach this will end up being dependent on your compiler and i can't help on vs.net.
on an embedded DSP i would break out
(AlphaImageData[x, y] * OneMinusAlphaValue) and (CurrentImageData[x, y] * AlphaValue) and use SIMD instructions to calculate buffers, possibly in parallel before performing the addition. perhaps doing small enough chunks to keep the buffers in cache on the cpu.
i believe anything you do will require more direct access to the memory/cpu than .net allows.

You may also want to take a look at the Mono runtime and its Simd extensions. Perhaps some of your calculations can make use of the SSE acceleration as I gather that you basically do vector calculations (I don't know up to which vector size there is acceleration for multiplication but there is for some sizes)
(Blog post announcing Mono.Simd: http://tirania.org/blog/archive/2008/Nov-03.html)
Of course, that wouldn't work on Microsoft .NET but maybe you are interested in some experimentation.

Interestingly, image data is frequently pretty similar, meaning that the calculations are likely very repetitive. Have you explored doing a lookup table for the calculations? So any time 0.8 was multiplied by 128 - value[80,128] which you've precalculated to 102.4, you simply looked that up? You're basically trading memory space for CPU speed, but it could work for you.
Of course, if your image data has too high a resolution (and goes to too significant a digit), this may not be practical.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Which is fast : Query Syntax vs. Loops - c#

Running each function 1000 times: for loop: 2623 ms query: 2821 ms looks logic since the second one is just syntaxic sugar for the first one. But i would use the second one for its readability.

Though this doesn't strictly answer your question, performance-wise I would suggest merging that x+y logic into the iteration, thus: for (int x = 0; x < 100; x++) for (int y = 0; y < 100 - x; y++) storage.Add(Tuple.Create(x, y));

Related

Why is HashSet<Point> so much slower than HashSet<string>?

What wrong with this implement of this arcsine approximate in C#

Is there a better performing functional version of this iterative algorithm in C#?

Modify items in List<T> fast

Can this code be optimised?

Categories

Resources