Fast little-endian to big-endian conversion in ASM

Fast little-endian to big-endian conversion in ASM - c#

I have an array of uint-types in C#, After checking if the program is working on a little-endian machine, I want to convert the data to a big-endian type. Because the amount of data can become very large but is always even, I was thinking to consider two uint types as an ulong type, for a better performance and program it in ASM, so I am searching for a very fast (the fastest if possible) Assembler-algorithm to convert little-endian in big-endian.

For a large amount of data, the bswap instruction (available in Visual C++ under the _byteswap_ushort, _byteswap_ulong, and _byteswap_uint64 intrinsics) is the way to go. This will even outperform handwritten assembly. These are not available in pure C# without P/Invoke, so:
Only use this if you have a lot of data to byte swap.
You should seriously consider writing your lowest level application I/O in managed C++ so you can do your swapping before ever bringing the data into a managed array. You already have to write a C++ library, so there's not much to lose and you sidestep all the P/Invoke-related performance issues for low-complexity algorithms operating on large datasets.
PS: Many people are unaware of the byte swap intrinsics. Their performance is astonishing, doubly so for floating point data because it processes them as integers. There is no way to beat it without hand coding your register loads for every single byte swap use case, and should you try that, you'll probably incur a bigger hit in the optimizer than you'll ever pick up.

You may want to simply rethink the problem, this should not be a bottleneck. Take the naive algorithm (written in CLI assembly, just for fun). lets assume the number we want is in local number 0
LDLOC 0
SHL 24
LDLOC 0
LDC.i4 0x0000ff00
SHL 8
OR
LDLOC 0
LDC.i4 0x00ff0000
SHL.UN 8
OR
LDLOC 0
SHL.UN 24
OR
At most that's 13 (x86) assembly instructions per number (and most likely the interpreter will be even smarter by using clever registers). And it doesn't get more naive than that.
Now, compare that to the costs of
Getting the data loaded in (including whatever peripherals you are working with!)
Maniuplation of the data (doing comparisons, for instance)
Outputting the result (whatever it is)
If 13 instructions per number is a significant chunk of your execution time, then you are doing a VERY high performance task and should have your input in the correct format! You also probably would not be using a managed language because you would want far more control over buffers of data and what-not, and no extra array bounds checking.
If that array of data comes across a network, I would expect there to be much greater costs from the managing of sockets than from a mere byte order flip, if its from disk, consider pre-flipping before executing this program.

I was thinking to consider two uint
types as an ulong type
Well, that would also swap the two uint values, which might not be desirable...
You could try some C# code in unsafe mode, which may actually perform well enough. Like:
public static unsafe void SwapInts(uint[] data) {
int cnt = data.Length;
fixed (uint* d = data) {
byte* p = (byte*)d;
while (cnt-- > 0) {
byte a = *p;
p++;
byte b = *p;
*p = *(p + 1);
p++;
*p = b;
p++;
*(p - 3) = *p;
*p = a;
p++;
}
}
}
On my computer the throughput is around 2 GB per second.

Related

from C++ to C# // Structure Memory Optimization

#Zbyl I have seen your answer in this thread (Bit fields in C#)
and i really like the Bitvector32 method , but for purpose of optimization : what if i have plenty of structures of a size of 8 bits / 12 bits (less than 32 bits) , is there anyway to do it in a manner less than Bitvector32 because that would be a lot of exagerated allocation of memory that would never be used : i will only use the first 8 bits of Bitvector32 .
Here is an example of structure that i want to make in C# :
struct iec_qualif
{
unsigned char var :2;
unsigned char res :2;
unsigned char bl :1; // blocked/not blocked
unsigned char sb :1; // substituted/not substituted
unsigned char nt :1; // not topical/topical
unsigned char iv :1; // valid/invalid
};

At first, I would consider the number of structs you are actually dealing with. Today's GB-memory-equipped hardware should easily be capable to cope with several thousands of your structs, even if you waste three bytes for each (1 000 000 of your structs would then occupy 4 MB of potentially > 1000 MB available). With this in mind, you could even live without the bit field entirely, having just normal bytes for each field (your setter should check range then, however), resulting in 6 bytes instead of four (and possibly two more for alignment issues), but giving you faster access to the values (getters), as no bit fiddling is necessary.
On the other hand: Bit fields in the end are nothing more than a convenient way to let the compiler write the code that you otherwise would have to write yourself. But with a little practise, it is not a too difficult task to handle the bits yourself, see this answer in the topic you referred to yourself (the 'handmade accessors' part), storing internally all data in a variable of type byte and accessing it via bitshifting and masking.

Why does everyone use 2^n numbers for allocation? -> new StringBuilder(256)

15 years ago, while programming with Pascal, I understood why to use power of two's for memory allocation. But this still seems to be state-of-the-art.
C# Examples:
new StringBuilder(256);
new byte[1024];
int bufferSize = 1 << 12;
I still see this thousands of times, I use this myself and I'm still questioning:
Do we need this in modern programming languages and modern hardware?
I guess its good practice, but what's the reason?
EDIT
For example a byte[] array, as stated by answers here, a power of 2 will make no sense: the array itself will use 16 bytes (?), so does it make sense to use 240 (=256-16) for the size to fit a total of 256 bytes?

Do we need this in modern programming languages and modern hardware? I guess its good practice, but what's the reason?
It depends. There are two things to consider here:
For sizes less than the memory page size, there's no appreciable difference between a power-of-two and an arbitrary number to allocate space;
You mostly use managed data structures with C#, so you won't even know how many bytes are really allocated underneath.
Assuming you're doing low-level allocation with malloc(), using multiples of the page size would be considered a good idea, i.e. 4096 or 8192; this is because it allows for more efficient memory management.
My advice would be to just allocate what you need and let C# handle the memory management and allocation for you.

Sadly, it's quite stupid if you want to keep a block of memory in a single memory page of 4k... And persons don't even know it :-) (I didn't until 10 minutes ago... I only had an hunch)... An example... It's unsafe code and implementation dependant (using .NET 4.5 at 32/64 bits)
byte[] arr = new byte[4096];
fixed (byte* p = arr)
{
int size = ((int*)p)[IntPtr.Size == 4 ? -1 : -2];
}
So the CLR has allocated at least 4096 + (1 or 2) sizeof(int)... So it has gone over one 4k memory page. This is logical... It has to keep the size of the array somewhere, and keeping it together with the array is the most intelligent thing (for those that know what Pascal Strings and BSTR are, yes, it's the same principle)
I'll add that all the objects in .NET have a syncblck number and a RuntimeType... They are at least int if not IntPtr, so a total of between 8 and 16 bytes/object (This is explained in various places... try looking for .net object header if you are interested)

It still makes sense in certain cases, but I would prefer to analyze case-by-case whether I need that kind of specification or not, rather than blindly use it as good practice.
For example, there might be cases where you want to use exactly 8 bits of information (1 byte) to address a table.
In that case, I would let the table have the size of 2^8.
Object table = new Object[256];
By this, you will be able to address any object of the table using only one byte.
Even if the table is actually smaller and doesn't use all 256 places, you still have the guarantee of bidirectional mapping from table to index and from index to table, which could prevent errors that would appear, for example, if you had:
Object table = new Object[100];
And then someone (probably someone else) accesses it with a byte value out of table's range.
Maybe this kind of bijective behavior could be good, maybe you could have other ways to guarantee your constraints.
Probably, given the increase in smartness of current compilers, it is not the only good practice anymore.

IMHO, anything ending in exact power of two's arithmeric operation is like a fast track. low level arithmeric operation for power of two takes less number of turns and bit manipulations than any other numbers need extra work for cpu.
And found this possible duplicate:Is it better to allocate memory in the power of two?

Yes, it's good practice, and it has at least one reason.
The modern processors have L1 cache-line size 64 bytes, and if you will use buffer size as 2^n (for example 1024, 4096,..), you will take fully cache-line, without wasted space.
In some cases, this will help prevent false sharing problem (http://en.wikipedia.org/wiki/False_sharing).

Manipulating very large binary strings in c#

I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.

What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.

If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.

Is the BitArray class no help?

A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?

Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.

Why is this faster on 64 bit than 32 bit?

I've been doing some performance testing, mainly so I can understand the difference between iterators and simple for loops. As part of this I created a simple set of tests and was then totally surprised by the results. For some methods, 64 bit was nearly 10 times faster than 32 bit.
What I'm looking for is some explanation for why this is happening.
[The answer below states this is due to 64 bit arithmetic in a 32 bit app. Changing the longs to ints results in good performance on 32 and 64 bit systems.]
Here are the 3 methods in question.
private static long ForSumArray(long[] array)
{
var result = 0L;
for (var i = 0L; i < array.LongLength; i++)
{
result += array[i];
}
return result;
}
private static long ForSumArray2(long[] array)
{
var length = array.LongLength;
var result = 0L;
for (var i = 0L; i < length; i++)
{
result += array[i];
}
return result;
}
private static long IterSumArray(long[] array)
{
var result = 0L;
foreach (var entry in array)
{
result += entry;
}
return result;
}
I have a simple test harness that tests this
var repeat = 10000;
var arrayLength = 100000;
var array = new long[arrayLength];
for (var i = 0; i < arrayLength; i++)
{
array[i] = i;
}
Console.WriteLine("For: {0}", AverageRunTime(repeat, () => ForSumArray(array)));
repeat = 100000;
Console.WriteLine("For2: {0}", AverageRunTime(repeat, () => ForSumArray2(array)));
Console.WriteLine("Iter: {0}", AverageRunTime(repeat, () => IterSumArray(array)));
private static TimeSpan AverageRunTime(int count, Action method)
{
var stopwatch = new Stopwatch();
stopwatch.Start();
for (var i = 0; i < count; i++)
{
method();
}
stopwatch.Stop();
var average = stopwatch.Elapsed.Ticks / count;
return new TimeSpan(average);
}
When I run these, I get the following results:
32 bit:
For: 00:00:00.0006080
For2: 00:00:00.0005694
Iter: 00:00:00.0001717
64 bit
For: 00:00:00.0007421
For2: 00:00:00.0000814
Iter: 00:00:00.0000818
The things I read from this are that using LongLength is slow. If I use array.Length, performance for the first for loop is pretty good in 64 bit, but not 32 bit.
The other thing I read from this is that iterating over an array is as efficient as a for loop, and the code is much cleaner and easier to read!

x64 processors contain 64 bit general purpose registers with which they can calculate operations on 64 bit integers in a single instruction. 32 bit processors does not have that. This is especially relevant to your program as it's heavily using long (64-bit integer) variables.
For instance, in x64 assembly, to add a couple 64 bit integers stored in registers, you can simply do:
; adds rbx to rax
add rax, rbx
To do the same operation on a 32 bit x86 processor, you'll have to use two registers and manually use the carry of the first operation in the second operation:
; adds ecx:ebx to edx:eax
add eax, ebx
adc edx, ecx
More instructions and less registers mean more clock cycles, memory fetches, ... which will ultimately result in reduced performance. The difference is very notable in number crunching applications.
For .NET applications, it seems that the 64-bit JIT compiler performs more aggressive optimizations improving overall performance.
Regarding your point about array iteration, the C# compiler is clever enough to recognize foreach over arrays and treat them specially. The generated code is identical to using a for loop and it's in recommended that you use foreach if you don't need to change the array element in the loop. Besides that, the runtime recognizes the pattern for (int i = 0; i < a.Length; ++i) and omits the bound checks for array accesses inside the loop. This will not happen in the LongLength case and will result in decreased performance (both for 32 bit and 64 bit case); and since you'll be using long variables with LongLength, the 32 bit performance will get degraded even more.

The long datatype is 64-bits and in a 64-bit process, it is processed as a single native-length unit. In a 32-bit process, it is treated as 2 32-bit units. Math, especially on these "split" types will be processor-intensive.

Not sure of "why" but I would make sure to call your "method" at least once outside your timer loop so you're not counting 1st-time jitting. (Since this looks like C# to me).

Oh, that's easy.
I assume that you are using x86 technology. What do you need for doing the loops in assembler ?
One index variable i
One result variable result
An long array of results.
So you need three variables. Variable access is fastest if you can store them in registers; if you need to move them in and out to memory, you are losing speed.
For 64bit longs you need two registers on 32bit and we have only four registers, so chances are high that all variables cannot be stored in registers, but must be stored in intermediate storage like the stack. This alone will slow down access considerably.
Addition of numbers:
Addition must be two times; the first time without carry bit and the second time with carry bit. 64bit can it do in one cycle.
Moving/Loading:
For every 1-cycle 64bit var you need two cycles for 32bit to load/unload a long integer into memory.
Every component datatype (datatypes which consists of more bits than register/address bits)
will lose considerable speed. The speed gains of an order of magnitude is the reason GPUs still prefer floats (32bit) instead of doubles (64bit).

As others said, doing 64-bit arithmetic on a 32-bit machine is going to take some extra manipulation, more-so if doing multiplication or division.
Back to your concern about iterators vs. simple for loops, iterators can have fairly complex definitions, and they will only be fast if inlining and compiler-optimization is capable of replacing them with the equivalent simple form. It really depends on the type of iterator and the underlying container implementation. The simplest way to tell if it has been optimized reasonably well is to examine the generated assembly code. Another way is to put it in a long-running loop, pause it, and look at the stack to see what it's doing.

Why should I use int instead of a byte or short in C#

I have found a few threads in regards to this issue. Most people appear to favor using int in their c# code accross the board even if a byte or smallint would handle the data unless it is a mobile app. I don't understand why. Doesn't it make more sense to define your C# datatype as the same datatype that would be in your data storage solution?
My Premise:
If I am using a typed dataset, Linq2SQL classes, POCO, one way or another I will run into compiler datatype conversion issues if I don't keep my datatypes in sync across my tiers. I don't really like doing System.Convert all the time just because it was easier to use int accross the board in c# code. I have always used whatever the smallest datatype is needed to handle the data in the database as well as in code, to keep my interface to the database clean. So I would bet 75% of my C# code is using byte or short as opposed to int, because that is what is in the database.
Possibilities:
Does this mean that most people who just use int for everything in code also use the int datatype for their sql storage datatypes and could care less about the overall size of their database, or do they do system.convert in code wherever applicable?
Why I care: I have worked on my own forever and I just want to be familiar with best practices and standard coding conventions.

Performance-wise, an int is faster in almost all cases. The CPU is designed to work efficiently with 32-bit values.
Shorter values are complicated to deal with. To read a single byte, say, the CPU has to read the 32-bit block that contains it, and then mask out the upper 24 bits.
To write a byte, it has to read the destination 32-bit block, overwrite the lower 8 bits with the desired byte value, and write the entire 32-bit block back again.
Space-wise, of course, you save a few bytes by using smaller datatypes. So if you're building a table with a few million rows, then shorter datatypes may be worth considering. (And the same might be good reason why you should use smaller datatypes in your database)
And correctness-wise, an int doesn't overflow easily. What if you think your value is going to fit within a byte, and then at some point in the future some harmless-looking change to the code means larger values get stored into it?
Those are some of the reasons why int should be your default datatype for all integral data. Only use byte if you actually want to store machine bytes. Only use shorts if you're dealing with a file format or protocol or similar that actually specifies 16-bit integer values. If you're just dealing with integers in general, make them ints.

I am only 6 years late but maybe I can help someone else.
Here are some guidelines I would use:
If there is a possibility the data will not fit in the future then use the larger int type.
If the variable is used as a struct/class field then by default it will be padded to take up the whole 32-bits anyway so using byte/int16 will not save memory.
If the variable is short lived (like inside a function) then the smaller data types will not help much.
"byte" or "char" can sometimes describe the data better and can do compile time checking to make sure larger values are not assigned to it on accident. e.g. If storing the day of the month(1-31) using a byte and try to assign 1000 to it then it will cause an error.
If the variable is used in an array of roughly 100 or more I would use the smaller data type as long as it makes sense.
byte and int16 arrays are not as thread safe as an int (a primitive).
One topic that no one brought up is the limited CPU cache. Smaller programs execute faster then larger ones because the CPU can fit more of the program in the faster L1/L2/L3 caches.
Using the int type can result in fewer CPU instructions however it will also force a higher percentage of the data memory to not fit in the CPU cache. Instructions are cheap to execute. Modern CPU cores can execute 3-7 instructions per clock cycle however a single cache miss on the other hand can cost 1000-2000 clock cycles because it has to go all the way to RAM.
When memory is conserved it also results in the rest of the application performing better because it is not squeezed out of the cache.
I did a quick sum test with accessing random data in random order using both a byte array and an int array.
const int SIZE = 10000000, LOOPS = 80000;
byte[] array = Enumerable.Repeat(0, SIZE).Select(i => (byte)r.Next(10)).ToArray();
int[] visitOrder = Enumerable.Repeat(0, LOOPS).Select(i => r.Next(SIZE)).ToArray();
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
int sum = 0;
foreach (int v in visitOrder)
sum += array[v];
sw.Stop();
Here are the results in time(ticks): (x86, release mode, without debugger, .NET 4.5, I7-3930k) (smaller is better)
________________ Array Size __________________
10 100 1K 10K 100K 1M 10M
byte: 549 559 552 552 568 632 3041
int : 549 566 552 562 590 1803 4206
Accessing 1M items randomly using byte on my CPU had a 285% performance increase!
Anything under 10,000 was hardly noticeable.
int was never faster then byte for this basic sum test.
These values will vary with different CPUs with different cache sizes.
One final note, Sometimes I look at the now open-source .NET framework to see what Microsoft's experts do. The .NET framework uses byte/int16 surprisingly little. I could not find any actually.

You would have to be dealing with a few BILLION rows before this makes any significant difference in terms of storage capacity. Lets say you have three columns, and instead of using a byte-equivalent database type, you use an int-equivalent.
That gives us 3 (columns) x 3 (bytes extra) per row, or 9 bytes per row.
This means, for "a few million rows" (lets say three million), you are consuming a whole extra 27 megabytes of disk space! Fortunately as we're no longer living in the 1970s, you shouldn't have to worry about this :)
As said above, stop micro-optimising - the performance hit in converting to/from different integer-like numeric types is going to hit you much, much harder than the bandwidth/diskspace costs, unless you are dealing with very, very, very large datasets.

For the most part, 'No'.
Unless you know upfront that you are going to be dealing with 100's of millions of rows, it's a micro-optimisation.
Do what fits the Domain model best. Later, if you have performance problems, benchmark and profile to pin-point where they are occuring.

Not that I didn't believe Jon Grant and others, but I had to see for myself with our "million row table". The table has 1,018,000. I converted 11 tinyint columns and 6 smallint columns into int, there were already 5 int & 3 smalldatetimes. 4 different indexes used a combo of the various data types, but obviously the new indexes are now all using int columns.
Making the changes only cost me 40 mb calculating base table disk usage with no indexes. When I added the indexes back in the overall change was only 30 mb difference overall. So I was suprised because I thought the index size would be larger.
So is 30 mb worth the hassle of using all the different data types, No Way! I am off to INT land, thanks everyone for setting this anal retentive programmer back on the straight and happy blissful life of no more integer conversions...yippeee!

The .NET runtime is optimised for Int32. See previous discussion at .NET Integer vs Int16?

If int is used everywhere, no casting or conversions are required. That is a bigger bang for the buck than the memory you will save by using multiple integer sizes.
It just makes life simpler.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.