Why is this faster on 64 bit than 32 bit?

Why is this faster on 64 bit than 32 bit? - c#

I've been doing some performance testing, mainly so I can understand the difference between iterators and simple for loops. As part of this I created a simple set of tests and was then totally surprised by the results. For some methods, 64 bit was nearly 10 times faster than 32 bit.
What I'm looking for is some explanation for why this is happening.
[The answer below states this is due to 64 bit arithmetic in a 32 bit app. Changing the longs to ints results in good performance on 32 and 64 bit systems.]
Here are the 3 methods in question.
private static long ForSumArray(long[] array)
{
var result = 0L;
for (var i = 0L; i < array.LongLength; i++)
{
result += array[i];
}
return result;
}
private static long ForSumArray2(long[] array)
{
var length = array.LongLength;
var result = 0L;
for (var i = 0L; i < length; i++)
{
result += array[i];
}
return result;
}
private static long IterSumArray(long[] array)
{
var result = 0L;
foreach (var entry in array)
{
result += entry;
}
return result;
}
I have a simple test harness that tests this
var repeat = 10000;
var arrayLength = 100000;
var array = new long[arrayLength];
for (var i = 0; i < arrayLength; i++)
{
array[i] = i;
}
Console.WriteLine("For: {0}", AverageRunTime(repeat, () => ForSumArray(array)));
repeat = 100000;
Console.WriteLine("For2: {0}", AverageRunTime(repeat, () => ForSumArray2(array)));
Console.WriteLine("Iter: {0}", AverageRunTime(repeat, () => IterSumArray(array)));
private static TimeSpan AverageRunTime(int count, Action method)
{
var stopwatch = new Stopwatch();
stopwatch.Start();
for (var i = 0; i < count; i++)
{
method();
}
stopwatch.Stop();
var average = stopwatch.Elapsed.Ticks / count;
return new TimeSpan(average);
}
When I run these, I get the following results:
32 bit:
For: 00:00:00.0006080
For2: 00:00:00.0005694
Iter: 00:00:00.0001717
64 bit
For: 00:00:00.0007421
For2: 00:00:00.0000814
Iter: 00:00:00.0000818
The things I read from this are that using LongLength is slow. If I use array.Length, performance for the first for loop is pretty good in 64 bit, but not 32 bit.
The other thing I read from this is that iterating over an array is as efficient as a for loop, and the code is much cleaner and easier to read!

x64 processors contain 64 bit general purpose registers with which they can calculate operations on 64 bit integers in a single instruction. 32 bit processors does not have that. This is especially relevant to your program as it's heavily using long (64-bit integer) variables.
For instance, in x64 assembly, to add a couple 64 bit integers stored in registers, you can simply do:
; adds rbx to rax
add rax, rbx
To do the same operation on a 32 bit x86 processor, you'll have to use two registers and manually use the carry of the first operation in the second operation:
; adds ecx:ebx to edx:eax
add eax, ebx
adc edx, ecx
More instructions and less registers mean more clock cycles, memory fetches, ... which will ultimately result in reduced performance. The difference is very notable in number crunching applications.
For .NET applications, it seems that the 64-bit JIT compiler performs more aggressive optimizations improving overall performance.
Regarding your point about array iteration, the C# compiler is clever enough to recognize foreach over arrays and treat them specially. The generated code is identical to using a for loop and it's in recommended that you use foreach if you don't need to change the array element in the loop. Besides that, the runtime recognizes the pattern for (int i = 0; i < a.Length; ++i) and omits the bound checks for array accesses inside the loop. This will not happen in the LongLength case and will result in decreased performance (both for 32 bit and 64 bit case); and since you'll be using long variables with LongLength, the 32 bit performance will get degraded even more.

The long datatype is 64-bits and in a 64-bit process, it is processed as a single native-length unit. In a 32-bit process, it is treated as 2 32-bit units. Math, especially on these "split" types will be processor-intensive.

Not sure of "why" but I would make sure to call your "method" at least once outside your timer loop so you're not counting 1st-time jitting. (Since this looks like C# to me).

Oh, that's easy.
I assume that you are using x86 technology. What do you need for doing the loops in assembler ?
One index variable i
One result variable result
An long array of results.
So you need three variables. Variable access is fastest if you can store them in registers; if you need to move them in and out to memory, you are losing speed.
For 64bit longs you need two registers on 32bit and we have only four registers, so chances are high that all variables cannot be stored in registers, but must be stored in intermediate storage like the stack. This alone will slow down access considerably.
Addition of numbers:
Addition must be two times; the first time without carry bit and the second time with carry bit. 64bit can it do in one cycle.
Moving/Loading:
For every 1-cycle 64bit var you need two cycles for 32bit to load/unload a long integer into memory.
Every component datatype (datatypes which consists of more bits than register/address bits)
will lose considerable speed. The speed gains of an order of magnitude is the reason GPUs still prefer floats (32bit) instead of doubles (64bit).

As others said, doing 64-bit arithmetic on a 32-bit machine is going to take some extra manipulation, more-so if doing multiplication or division.
Back to your concern about iterators vs. simple for loops, iterators can have fairly complex definitions, and they will only be fast if inlining and compiler-optimization is capable of replacing them with the equivalent simple form. It really depends on the type of iterator and the underlying container implementation. The simplest way to tell if it has been optimized reasonably well is to examine the generated assembly code. Another way is to put it in a long-running loop, pause it, and look at the stack to see what it's doing.

Related

Reading time of arrays with an equal number of elements but of different dimensional

I have 3 array that keep integer values. A array of 4 -dimensional, a array of 2-dimensional, a array of single-dimensional. But the total number of elements is equal to each. I'm going to print on console all the elements in these array. Which one prints the fastest? Or is it equal to printing times?
int[,,,] Q = new int[4, 4, 4, 4];
int[,] W = new int[16,16];
int[] X = new int[256];

Unless I'm missing something, there are two main ways you could be iterating over the multi-dimensional arrays.
The first is:
int[,] W = new int[16,16];
for(int i = 0; i < 16; i++)
{
for(int j = 0; j < 16; j++)
Console.WriteLine(W[i][j]);
}
This method is slower than iterating over the single-dimensional array, as the only difference is that for every 16 members, you need to start a new iteration of the outside loop and re-initiate the inner loop.
The second is:
for(int i = 0; i < 256; i++)
{
Console.WriteLine(W[i / 16][i % 16]);
}
This method is slower because every iteration you need to calculate both (i / 16) and (i % 16).
Ignoring the iteration factor, there is also the time it takes to access another pointer every iteration.
To the extent of my knowledge in boolean functions*, given two sets of two integers, one of them bigger numbers but both having the same size in memory (as is the case for all numbers of type int in c#), the time to compute the addition of the two sets would be exactly the same (as in the number of clock ticks, but it's not something I'd expect everyone who stumbles upon this question to be familiar with). This being the case, the time for calculating the address of an array member is not dependent upon how big its index is.
So to summarize, unless I'm missing something or I'm way rustier than I think, there is one factor that is guaranteed to lengthen the time it takes for iterating over multidimensional arrays (the extra pointers to access), another factor that is guaranteed to do the same, but you can choose one of two options for (multiple loops or additional calculations every iteration of the loop), and there are no factors that would slow down the single-dimensional array approach (no "tax" for an extra long index).
CONCLUSIONS:
That makes it two factors working for a single-dimensional array, and none for a multi-dimensional one.
Thus, I would assume the single-dimensional array would be faster
That being said, you're using C#, so you're probably not really looking for that insignificant an edge or you'd use a low-level language. And if you are, you should probably either switch to a low-level language or really contemplate whether you are doing whatever it is you're trying to in the best way possible (the only case where this could make an actual difference, that I can think of, is if you load into your code a whole 1 million record plus database, and that's really bad practice).
However, if you're just starting out in C# then you're probably just overthinking it.
Whichever it is, this was a fun hypothetical, so thanks for asking it!
*by boolean functions, I mean functions at the binary level, not C# functions returning a bool value

Most efficient way to store and retrieve a 512-bit number?

I have a String of 512 characters that contains only 0, 1. I'm trying to represent it into a data structure that can save the space. Is BitArray the most efficient way?
I'm also thinking about using 16 int32 to store the number, which would then be 16 * 4 = 64 bytes.

Most efficient can mean many different things...
Most efficient from a memory management perspective?
Most efficient from a CPU calculation perspective?
Most efficient from a usage perspective? (In respect to writing code that uses the numbers for calculations)
For 1 - use byte[64] or long[8] - if you aren't doing calculations or don't mind writing your own calculations.
For 3 definitely BigInteger is the way to go. You have your math functions already defined and you just need to turn your binary number into a decimal representation.
EDIT: Sounds like you don't want BigInteger due to size concerns... however I think you are going to find that you will of course have to parse this as an enumerable / yield combo where you are parsing it a bit at a time and don't hold the entire data structure in memory at the same time.
That being said... I can help you somewhat with parsing your string into array's of Int64's... Thanks King King for part of this linq statement here.
// convert string into an array of int64's
// Note that MSB is in result[0]
var result = input.Select((x, i) => i)
.Where(i => i % 64 == 0)
.Select(i => input.Substring(i, input.Length - i >= 64 ?
64 : input.Length - i))
.Select(x => Convert.ToUInt64(x, 2))
.ToArray();
If you decide you want a different array structure byte[64] or whatever it should be easy to modify.
EDIT 2: OK I got bored so I wrote an EditDifference function for fun... here you go...
static public int GetEditDistance(ulong[] first, ulong[] second)
{
int editDifference = 0;
var smallestArraySize = Math.Min(first.Length, second.Length);
for (var i = 0; i < smallestArraySize; i++)
{
long signedDifference;
var f = first[i];
var s = second[i];
var biggest = Math.Max(f, s);
var smallest = Math.Min(f, s);
var difference = biggest - smallest;
if (difference > long.MaxValue)
{
editDifference += 1;
signedDifference = Convert.ToInt64(difference - long.MaxValue - 1);
}
else
signedDifference = Convert.ToInt64(difference);
editDifference += Convert.ToString(signedDifference, 2)
.Count(x => x == '1');
}
// if arrays are different sizes every bit is considered to be different
var differenceOfArraySize =
Math.Max(first.Length, second.Length) - smallestArraySize;
if (differenceOfArraySize > 0)
editDifference += differenceOfArraySize * 64;
return editDifference;
}

Use BigInteger from .NET. It can easily support 512-bit numbers as well as operations on those numbers.
BigInteger.Parse("your huge number");

BitArray (with 512 bits), byte[64], int[16], long[8] (or List<> variants of those), or BigInteger will all be much more efficient than your String. I'd say that byte[] is the most idiomatic/typical way of representing data such as this, in general. For example, ComputeHash uses byte[] and Streams deal with byte[]s, and if you store this data as a BLOB in a DB, byte[] will be the most natural way to work with that data. For that reason, it'd probably make sense to use this.
On the other hand, if this data represents a number that you might do numeric things to like addition and subtraction, you probably want to use a BigInteger.
These approaches have roughly the same performance as each other, so you should choose between them based primarily on things like what makes sense, and secondarily on performance benchmarked in your usage.

The most efficient would be having eight UInt64/ulong or Int64/long typed variables (or a single array), although this might not be optimal for querying/setting. One way to get around this is, indeed, to use a BitArray (which is basically a wrapper around the former method, including additional overhead [1]). It's a matter of choice either for easy use or efficient storage.
If this isn't sufficient, you can always choose to apply compression, such as RLE-encoding or various other widely available encoding methods (gzip/bzip/etc...). This will require additional processing power though.
It depends on your definition of efficient.
[1] Addtional overhead, as in storage overhead. BitArray internally uses an Int32-array to store values. In addition to that BitArray stores its current mutation version, the number of ints 'allocated' and a syncroot. Even though the overhead is negligible for smaller amount of values, it can be an issue if you keep a lot of these in memory.

best algorithm to reconcile 3 lists

i am looking for a way to reconcile elements from 3 different sources. i've simplified the elements to having just a key (string) and version (long).
the lists are attained concurrently (2 from separate database queries, and 1 from a memory cache on another system).
for my end result, i only care about elements that are not identical versions across all 3 sources. So the result i care about would be a list of keys, with corresponding versions in each system.
Element1 | system1:v100 | system2:v100 | system3:v101 |
Element2 | system1:missing | system2:v200 | system3:v200 |
and the elements with identical versions can be discarded.
The 2 ways of achieving this i thought of are
wait for all datasources to finish retrieving, and than loop through each list to aggregate a master list with a union of keys + all 3 versions (discarding all identical items).
as soon as the first list is done being retrieved, put it into a concurrent collection such as dictionary (offered in .net 4.0), and start aggregating remaining lists (into the concurrent collection) as soon as they are available.
my thinking is that second approach will be a little quicker, but probably not by much. i can't really do much until all 3 sources are present, so not much is gained from 2nd approach and contention is introduced.
maybe there is a completely other way to go about this? Also, since versions are stored using longs, and there will be 100's of thousands (possibly millions) of elements, memory allocation could be of concern (tho probably not a big concern since these objects are short lived)

HashSet is an option as it has Union and Intersect methods
HashSet.UnionWith Method
To use this you must override Equals and GetHashCode.
A good (unique) hash is key to performance.
If the version is all v then numeric the could use the numeric to build the hash with missing as 0.
Have Int32 to play with so if version is Int10 or less can create a perfect hash.
Another option is ConcurrentDictionary (there is no concurrent HashSet) and have all three feed into it.
Still need to override Equals and GetHashCode.
My gut feel is three HashSets then Union would be faster.
If all versions are numeric and you can use 0 for missing then could just pack into UInt32 or UInt64 and put that directly in a HashSet. After Union then unpack. Use bit pushing << rather than math to pack an unpack.
This is just two UInt16 but it runs in 2 seconds.
This is going to be faster than Hashing classes.
If all three versions are long then HashSet<integral type> will not be an option.
long1 ^ long2 ^ long3; might be a good hash but the is not my expertise.
I know GetHashCode on a Tuple is bad.
class Program
{
static void Main(string[] args)
{
HashSetComposite hsc1 = new HashSetComposite();
HashSetComposite hsc2 = new HashSetComposite();
for (UInt16 i = 0; i < 100; i++)
{
for (UInt16 j = 0; j < 40000; j++)
{
hsc1.Add(i, j);
}
for (UInt16 j = 20000; j < 60000; j++)
{
hsc2.Add(i, j);
}
}
Console.WriteLine(hsc1.Intersect(hsc2).Count().ToString());
Console.WriteLine(hsc1.Union(hsc2).Count().ToString());
}
}
public class HashSetComposite : HashSet<UInt32>
{
public void Add(UInt16 u1, UInt16 u2)
{
UInt32 unsignedKey = (((UInt32)u1) << 16) | u2;
Add(unsignedKey);
}
//left over notes from long
//ulong unsignedKey = (long) key;
//uint lowBits = (uint) (unsignedKey & 0xffffffffUL);
//uint highBits = (uint) (unsignedKey >> 32);
//int i1 = (int) highBits;
//int i2 = (int) lowBits;
}
Tested using a ConcurrentDictionary and the above was over twice as fast.
Taking locks on the inserts is expensive.

Your problem seems to be suitable for an event based solution. Basically assign events for the completion of data for each of your sources. Keep a global concurrent hash with type . In your event handlers go over the completed data source and if your concurrent hash contains the key for the current element just add it to the list if not just insert a new list with given element.
But depending on your performance requirements this may overcomplicate your application. Your first method would be the simplest one to use.

Fast little-endian to big-endian conversion in ASM

I have an array of uint-types in C#, After checking if the program is working on a little-endian machine, I want to convert the data to a big-endian type. Because the amount of data can become very large but is always even, I was thinking to consider two uint types as an ulong type, for a better performance and program it in ASM, so I am searching for a very fast (the fastest if possible) Assembler-algorithm to convert little-endian in big-endian.

For a large amount of data, the bswap instruction (available in Visual C++ under the _byteswap_ushort, _byteswap_ulong, and _byteswap_uint64 intrinsics) is the way to go. This will even outperform handwritten assembly. These are not available in pure C# without P/Invoke, so:
Only use this if you have a lot of data to byte swap.
You should seriously consider writing your lowest level application I/O in managed C++ so you can do your swapping before ever bringing the data into a managed array. You already have to write a C++ library, so there's not much to lose and you sidestep all the P/Invoke-related performance issues for low-complexity algorithms operating on large datasets.
PS: Many people are unaware of the byte swap intrinsics. Their performance is astonishing, doubly so for floating point data because it processes them as integers. There is no way to beat it without hand coding your register loads for every single byte swap use case, and should you try that, you'll probably incur a bigger hit in the optimizer than you'll ever pick up.

You may want to simply rethink the problem, this should not be a bottleneck. Take the naive algorithm (written in CLI assembly, just for fun). lets assume the number we want is in local number 0
LDLOC 0
SHL 24
LDLOC 0
LDC.i4 0x0000ff00
SHL 8
OR
LDLOC 0
LDC.i4 0x00ff0000
SHL.UN 8
OR
LDLOC 0
SHL.UN 24
OR
At most that's 13 (x86) assembly instructions per number (and most likely the interpreter will be even smarter by using clever registers). And it doesn't get more naive than that.
Now, compare that to the costs of
Getting the data loaded in (including whatever peripherals you are working with!)
Maniuplation of the data (doing comparisons, for instance)
Outputting the result (whatever it is)
If 13 instructions per number is a significant chunk of your execution time, then you are doing a VERY high performance task and should have your input in the correct format! You also probably would not be using a managed language because you would want far more control over buffers of data and what-not, and no extra array bounds checking.
If that array of data comes across a network, I would expect there to be much greater costs from the managing of sockets than from a mere byte order flip, if its from disk, consider pre-flipping before executing this program.

I was thinking to consider two uint
types as an ulong type
Well, that would also swap the two uint values, which might not be desirable...
You could try some C# code in unsafe mode, which may actually perform well enough. Like:
public static unsafe void SwapInts(uint[] data) {
int cnt = data.Length;
fixed (uint* d = data) {
byte* p = (byte*)d;
while (cnt-- > 0) {
byte a = *p;
p++;
byte b = *p;
*p = *(p + 1);
p++;
*p = b;
p++;
*(p - 3) = *p;
*p = a;
p++;
}
}
}
On my computer the throughput is around 2 GB per second.

Is shifting bits faster than multiplying and dividing in Java? .NET? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Shifting bits left and right is apparently faster than multiplication and division operations on most, maybe even all, CPUs if you happen to be using a power of 2. However, it can reduce the clarity of code for some readers and some algorithms. Is bit-shifting really necessary for performance, or can I expect the compiler or VM to notice the case and optimize it (in particular, when the power-of-2 is a literal)? I am mainly interested in the Java and .NET behavior but welcome insights into other language implementations as well.

Almost any environment worth its salt will optimize this away for you. And if it doesn't, you've got bigger fish to fry. Seriously, do not waste one more second thinking about this. You will know when you have performance problems. And after you run a profiler, you will know what is causing it, and it should be fairly clear how to fix it.
You will never hear anyone say "my application was too slow, then I started randomly replacing x * 2 with x << 1 and everything was fixed!" Performance problems are generally solved by finding a way to do an order of magnitude less work, not by finding a way to do the same work 1% faster.

Most compilers today will do more than convert multiply or divide by a power-of-two to shift operations. When optimizing, many compilers can optimize a multiply or divide with a compile time constant even if it's not a power of 2. Often a multiply or divide can be decomposed to a series of shifts and adds, and if that series of operations will be faster than the multiply or divide, the compiler will use it.
For division by a constant, the compiler can often convert the operation to a multiply by a 'magic number' followed by a shift. This can be a major clock-cycle saver since multiplication is often much faster than a division operation.
Henry Warren's book, Hacker's Delight, has a wealth of information on this topic, which is also covered quite well on the companion website:
http://www.hackersdelight.org/
See also a discussion (with a link or two ) in:
Reading assembly code
Anyway, all this boils down to allowing the compiler to take care of the tedious details of micro-optimizations. It's been years since doing your own shifts outsmarted the compiler.

Humans are wrong in these cases.
99% when they try to second guess a modern (and all future) compilers.
99.9% when they try to second guess modern (and all future) JITs at the same time.
99.999% when they try to second guess modern (and all future) CPU optimizations.
Program in a way that accurately describes what you want to accomplish, not how to do it. Future versions of the JIT, VM, compiler, and CPU can all be independantly improved and optimized. If you specify something so tiny and specific, you lose the benefit of all future optimizations.

You can almost certainly depend on the literal-power-of-two multiplication optimisation to a shift operation. This is one of the first optimisations that students of compiler construction will learn. :)
However, I don't think there's any guarantee for this. Your source code should reflect your intent, rather than trying to tell the optimiser what to do. If you're making a quantity larger, use multiplication. If you're moving a bit field from one place to another (think RGB colour manipulation), use a shift operation. Either way, your source code will reflect what you are actually doing.

Note that shifting down and division will (in Java, certainly) give different results for negative, odd numbers.
int a = -7;
System.out.println("Shift: "+(a >> 1));
System.out.println("Div: "+(a / 2));
Prints:
Shift: -4
Div: -3
Since Java doesn't have any unsigned numbers it's not really possible for a Java compiler to optimise this.

On computers I tested, integer divisions are 4 to 10 times slower than other operations.
When compilers may replace divisions by multiples of 2 and make you see no difference, divisions by not multiples of 2 are significantly slower.
For example, I have a (graphics) program with many many many divisions by 255.
Actually my computation is :
r = (((top.R - bottom.R) * alpha + (bottom.R * 255)) * 0x8081) >> 23;
I can ensure that it is a lot faster than my previous computation :
r = ((top.R - bottom.R) * alpha + (bottom.R * 255)) / 255;
so no, compilers cannot do all the tricks of optimization.

I would ask "what are you doing that it would matter?". First design your code for readability and maintainability. The likelyhood that doing bit shifting verses standard multiplication will make a performance difference is EXTREMELY small.

It is hardware dependent. If we are talking micro-controller or i386, then shifting might be faster but, as several answers state, your compiler will usually do the optimization for you.
On modern (Pentium Pro and beyond) hardware the pipelining makes this totally irrelevant and straying from the beaten path usually means you loose a lot more optimizations than you can gain.
Micro optimizations are not only a waste of your time, they are also extremely difficult to get right.

If the compiler (compile-time constant) or JIT (runtime constant) knows that the divisor or multiplicand is a power of two and integer arithmetic is being performed, it will convert it to a shift for you.

According to the results of this microbenchmark, shifting is twice as fast as dividing (Oracle Java 1.7.0_72).

Most compilers will turn multiplication and division into bit shifts when appropriate. It is one of the easiest optimizations to do. So, you should do what is more easily readable and appropriate for the given task.

I am stunned as I just wrote this code and realized that shifting by one is actually slower than multiplying by 2!
(EDIT: changed the code to stop overflowing after Michael Myers' suggestion, but the results are the same! What is wrong here?)
import java.util.Date;
public class Test {
public static void main(String[] args) {
Date before = new Date();
for (int j = 1; j < 50000000; j++) {
int a = 1 ;
for (int i = 0; i< 10; i++){
a *=2;
}
}
Date after = new Date();
System.out.println("Multiplying " + (after.getTime()-before.getTime()) + " milliseconds");
before = new Date();
for (int j = 1; j < 50000000; j++) {
int a = 1 ;
for (int i = 0; i< 10; i++){
a = a << 1;
}
}
after = new Date();
System.out.println("Shifting " + (after.getTime()-before.getTime()) + " milliseconds");
}
}
The results are:
Multiplying 639 milliseconds
Shifting 718 milliseconds

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.