Counting common bits in a sequence of unsigned longs

Counting common bits in a sequence of unsigned longs - c#

I am looking for a faster algorithm than the below for the following. Given a sequence of 64-bit unsigned integers, return a count of the number of times each of the sixty-four bits is set in the sequence.
Example:
4608 = 0000000000000000000000000000000000000000000000000001001000000000
4097 = 0000000000000000000000000000000000000000000000000001000000000001
2048 = 0000000000000000000000000000000000000000000000000000100000000000
counts 0000000000000000000000000000000000000000000000000002101000000001
Example:
2560 = 0000000000000000000000000000000000000000000000000000101000000000
530 = 0000000000000000000000000000000000000000000000000000001000010010
512 = 0000000000000000000000000000000000000000000000000000001000000000
counts 0000000000000000000000000000000000000000000000000000103000010010
Currently I am using a rather obvious and naive approach:
static int bits = sizeof(ulong) * 8;
public static int[] CommonBits(params ulong[] values) {
int[] counts = new int[bits];
int length = values.Length;
for (int i = 0; i < length; i++) {
ulong value = values[i];
for (int j = 0; j < bits && value != 0; j++, value = value >> 1) {
counts[j] += (int)(value & 1UL);
}
}
return counts;
}

A small speed improvement might be achieved by first OR'ing the integers together, then using the result to determine which bits you need to check. You would still have to iterate over each bit, but only once over bits where there are no 1s, rather than values.Length times.

I'll direct you to the classical: Bit Twiddling Hacks, but your goal seems slightly different than just typical counting (i.e. your 'counts' variable is in a really weird format), but maybe it'll be useful.

The best I can do here is just get silly with it and unroll the inner-loop... seems to have cut the performance in half (roughly 4 seconds as opposed to the 8 in yours to process 100 ulongs 100,000 times)... I used a qick command-line app to generate the following code:
for (int i = 0; i < length; i++)
{
ulong value = values[i];
if (0ul != (value & 1ul)) counts[0]++;
if (0ul != (value & 2ul)) counts[1]++;
if (0ul != (value & 4ul)) counts[2]++;
//etc...
if (0ul != (value & 4611686018427387904ul)) counts[62]++;
if (0ul != (value & 9223372036854775808ul)) counts[63]++;
}
that was the best I can do... As per my comment, you'll waste some amount (I know not how much) running this in a 32-bit environment. If your that concerned over performance it may benefit you to first convert the data to uint.
Tough problem... may even benefit you to marshal it into C++ but that entirely depends on your application. Sorry I couldn't be more help, maybe someone else will see something I missed.
Update, a few more profiler sessions showing a steady 36% improvement. shrug I tried.

Ok let me try again :D
change each byte in 64 bit integer into 64 bit integer by shifting each bit by n*8 in lef
for instance
10110101 -> 0000000100000000000000010000000100000000000000010000000000000001
(use the lookup table for that translation)
Then just sum everything togeter in right way and you got array of unsigned chars whit integers.
You have to make 8*(number of 64bit integers) sumations
Code in c
//LOOKTABLE IS EXTERNAL and has is int64[256] ;
unsigned char* bitcounts(int64* int64array,int len)
{
int64* array64;
int64 tmp;
unsigned char* inputchararray;
array64=(int64*)malloc(64);
inputchararray=(unsigned char*)input64array;
for(int i=0;i<8;i++) array64[i]=0; //set to 0
for(int j=0;j<len;j++)
{
tmp=int64array[j];
for(int i=7;tmp;i--)
{
array64[i]+=LOOKUPTABLE[tmp&0xFF];
tmp=tmp>>8;
}
}
return (unsigned char*)array64;
}
This redcuce speed compared to naive implemetaton by factor 8, becuase it couts 8 bit at each time.
EDIT:
I fixed code to do faster break on smaller integers, but I am still unsure about endianess
And this works only on up to 256 inputs, becuase it uses unsigned char to store data in. If you have longer input string, you can change this code to hold up to 2^16 bitcounts and decrease spped by 2

const unsigned int BYTESPERVALUE = 64 / 8;
unsigned int bcount[BYTESPERVALUE][256];
memset(bcount, 0, sizeof bcount);
for (int i = values.length; --i >= 0; )
for (int j = BYTESPERVALUE ; --j >= 0; ) {
const unsigned int jth_byte = (values[i] >> (j * 8)) & 0xff;
bcount[j][jth_byte]++; // count byte value (0..255) instances
}
unsigned int count[64];
memset(count, 0, sizeof count);
for (int i = BYTESPERVALUE; --i >= 0; )
for (int j = 256; --j >= 0; ) // check each byte value instance
for (int k = 8; --k >= 0; ) // for each bit in a given byte
if (j & (1 << k)) // if bit was set, then add its count
count[i * 8 + k] += bcount[i][j];

Another approach that might be profitable, would be to build an array of 256 elements,
which encodes the actions that you need to take in incrementing the count array.
Here is a sample for a 4 element table, which does 2 bits instead of 8 bits.
int bitToSubscript[4][3] =
{
{0}, // No Bits set
{1,0}, // Bit 0 set
{1,1}, // Bit 1 set
{2,0,1} // Bit 0 and bit 1 set.
}
The algorithm then degenerates to:
pick the 2 right hand bits off of the number.
Use that as a small integer to index into the bitToSubscriptArray.
In that array, pull off the first integer. That is the number of elements in the count array, that you need to increment.
Based on that count, Iterate through the remainder of the row, incrementing count, based on the subscript you pull out of the bitToSubscript array.
Once that loop is done, shift your original number two bits to the right.... Rinse Repeat as needed.
Now there is one issue I ignored, in that description. The actual subscripts are relative. You need to keep track of where you are in the count array. Every time you loop, you add two to an offset. To That offset, you add the relative subscript from the bitToSubscript array.
It should be possible to scale up to the size you want, based on this small example. I would think that another program could be used, to generate the source code for the bitToSubscript array, so that it can be simply hard coded in your program.
There are other variation on this scheme, but I would expect it to run faster on average than anything that does it one bit at a time.
Good Hunting.
Evil.

I believe this should give a nice speed improvement:
const ulong mask = 0x1111111111111111;
public static int[] CommonBits(params ulong[] values)
{
int[] counts = new int[64];
ulong accum0 = 0, accum1 = 0, accum2 = 0, accum3 = 0;
int i = 0;
foreach( ulong v in values ) {
if (i == 15) {
for( int j = 0; j < 64; j += 4 ) {
counts[j] += ((int)accum0) & 15;
counts[j+1] += ((int)accum1) & 15;
counts[j+2] += ((int)accum2) & 15;
counts[j+3] += ((int)accum3) & 15;
accum0 >>= 4;
accum1 >>= 4;
accum2 >>= 4;
accum3 >>= 4;
}
i = 0;
}
accum0 += (v) & mask;
accum1 += (v >> 1) & mask;
accum2 += (v >> 2) & mask;
accum3 += (v >> 3) & mask;
i++;
}
for( int j = 0; j < 64; j += 4 ) {
counts[j] += ((int)accum0) & 15;
counts[j+1] += ((int)accum1) & 15;
counts[j+2] += ((int)accum2) & 15;
counts[j+3] += ((int)accum3) & 15;
accum0 >>= 4;
accum1 >>= 4;
accum2 >>= 4;
accum3 >>= 4;
}
return counts;
}
Demo: http://ideone.com/eNn4O (needs more test cases)

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
One of them
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
v &= v - 1; // clear the least significant bit set
}
Keep in mind, that complexity of this method is aprox O(log2(n)) where n is the number to count bits in, so for 10 binary it need only 2 loops
You should probably take the metod for counting 32 bits whit 64 bit arithmetics and applying it on each half of word, what would take by 2*15 + 4 instructions
// option 3, for at most 32-bit values in v:
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
If you have sse4,3 capable processor you can use POPCNT instruction.
http://en.wikipedia.org/wiki/SSE4

Related

Optimal solution for "Bitwise AND" problem in C#

Problem statement:
Given an array of non-negative integers, count the number of unordered pairs of array elements, such that their bitwise AND is a power of 2.
Example:
arr = [10, 7, 2, 8, 3]
Answer: 6 (10&7, 10&2, 10&8, 10&3, 7&2, 2&3)
Constraints:
1 <= arr.Count <= 2*10^5
0 <= arr[i] <= 2^12
Here's my brute-force solution that I've come up with:
private static Dictionary<int, bool> _dictionary = new Dictionary<int, bool>();
public static long CountPairs(List<int> arr)
{
long result = 0;
for (var i = 0; i < arr.Count - 1; ++i)
{
for (var j = i + 1; j < arr.Count; ++j)
{
if (IsPowerOfTwo(arr[i] & arr[j])) ++result;
}
}
return result;
}
public static bool IsPowerOfTwo(int number)
{
if (_dictionary.TryGetValue(number, out bool value)) return value;
var result = (number != 0) && ((number & (number - 1)) == 0);
_dictionary[number] = result;
return result;
}
For small inputs this works fine, but for big inputs this works slow.
My question is: what is the optimal (or at least more optimal) solution for the problem? Please provide a graceful solution in C#. 😊

One way to accelerate your approach is to compute the histogram of your data values before counting.
This will reduce the number of computations for long arrays because there are fewer options for value (4096) than the length of your array (200000).
Be careful when counting bins that are powers of 2 to make sure you do not overcount the number of pairs by including cases when you are comparing a number with itself.

We can adapt the bit-subset dynamic programming idea to have a solution with O(2^N * N^2 + n * N) complexity, where N is the number of bits in the range, and n is the number of elements in the list. (So if the integers were restricted to [1, 4096] or 2^12, with n at 100,000, we would have on the order of 2^12 * 12^2 + 100000*12 = 1,789,824 iterations.)
The idea is that we want to count instances for which we have overlapping bit subsets, with the twist of adding a fixed set bit. Given Ai -- for simplicity, take 6 = b110 -- if we were to find all partners that AND to zero, we'd take Ai's negation,
110 -> ~110 -> 001
Now we can build a dynamic program that takes a diminishing mask, starting with the full number and diminishing the mask towards the left
001
^^^
001
^^
001
^
Each set bit on the negation of Ai represents a zero, which can be ANDed with either 1 or 0 to the same effect. Each unset bit on the negation of Ai represents a set bit in Ai, which we'd like to pair only with zeros, except for a single set bit.
We construct this set bit by examining each possibility separately. So where to count pairs that would AND with Ai to zero, we'd do something like
001 ->
001
000
we now want to enumerate
011 ->
011
010
101 ->
101
100
fixing a single bit each time.
We can achieve this by adding a dimension to the inner iteration. When the mask does have a set bit at the end, we "fix" the relevant bit by counting only the result for the previous DP cell that would have the bit set, and not the usual union of subsets that could either have that bit set or not.
Here is some JavaScript code (sorry, I do not know C#) to demonstrate with testing at the end comparing to the brute-force solution.
var debug = 0;
function bruteForce(a){
let answer = 0;
for (let i = 0; i < a.length; i++) {
for (let j = i + 1; j < a.length; j++) {
let and = a[i] & a[j];
if ((and & (and - 1)) == 0 && and != 0){
answer++;
if (debug)
console.log(a[i], a[j], a[i].toString(2), a[j].toString(2))
}
}
}
return answer;
}
function f(A, N){
const n = A.length;
const hash = {};
const dp = new Array(1 << N);
for (let i=0; i<1<<N; i++){
dp[i] = new Array(N + 1);
for (let j=0; j<N+1; j++)
dp[i][j] = new Array(N + 1).fill(0);
}
for (let i=0; i<n; i++){
if (hash.hasOwnProperty(A[i]))
hash[A[i]] = hash[A[i]] + 1;
else
hash[A[i]] = 1;
}
for (let mask=0; mask<1<<N; mask++){
// j is an index where we fix a 1
for (let j=0; j<=N; j++){
if (mask & 1){
if (j == 0)
dp[mask][j][0] = hash[mask] || 0;
else
dp[mask][j][0] = (hash[mask] || 0) + (hash[mask ^ 1] || 0);
} else {
dp[mask][j][0] = hash[mask] || 0;
}
for (let i=1; i<=N; i++){
if (mask & (1 << i)){
if (j == i)
dp[mask][j][i] = dp[mask][j][i-1];
else
dp[mask][j][i] = dp[mask][j][i-1] + dp[mask ^ (1 << i)][j][i - 1];
} else {
dp[mask][j][i] = dp[mask][j][i-1];
}
}
}
}
let answer = 0;
for (let i=0; i<n; i++){
for (let j=0; j<N; j++)
if (A[i] & (1 << j))
answer += dp[((1 << N) - 1) ^ A[i] | (1 << j)][j][N];
}
for (let i=0; i<N + 1; i++)
if (hash[1 << i])
answer = answer - hash[1 << i];
return answer / 2;
}
var As = [
[10, 7, 2, 8, 3] // 6
];
for (let A of As){
console.log(JSON.stringify(A));
console.log(`DP, brute force: ${ f(A, 4) }, ${ bruteForce(A) }`);
console.log('');
}
var numTests = 1000;
for (let i=0; i<numTests; i++){
const N = 6;
const A = [];
const n = 10;
for (let j=0; j<n; j++){
const num = Math.floor(Math.random() * (1 << N));
A.push(num);
}
const fA = f(A, N);
const brute = bruteForce(A);
if (fA != brute){
console.log('Mismatch:');
console.log(A);
console.log(fA, brute);
console.log('');
}
}
console.log("Done testing.");

int[] numbers = new[] { 10, 7, 2, 8, 3 };
static bool IsPowerOfTwo(int n) => (n != 0) && ((n & (n - 1)) == 0);
long result = numbers.AsParallel()
.Select((a, i) => numbers
.Skip(i + 1)
.Select(b => a & b)
.Count(IsPowerOfTwo))
.Sum();
If I understand the problem correctly, this should work and should be faster.
First, for each number in the array we grab all elements in the array after it to get a collection of numbers to pair with.
Then we transform each pair number with a bitwise AND, then counting the number that satisfy our 'IsPowerOfTwo;' predicate (implementation here).
Finally we simply get the sum of all the counts - our output from this case is 6.
I think this should be more performant than your dictionary based solution - it avoids having to perform a lookup each time you wish to check power of 2.
I think also given the numerical constraints of your inputs it is fine to use int data types.

Why is CRC16 calculation so slow?

I have the following CRC function:
public static ushort ComputeCRC16(byte[] data)
{
ushort i, j, crc = 0;
int size = data.Length;
for (i = 0; i < size - 2; i++)
{
crc ^= (ushort)(data[i] << 8);
for (j = 0; j < 8; j++)
{
if ((crc & 0x8000) != 0) /* Test for bit 15 */
{
crc = (ushort)((crc << 1) ^ 0x1234); /* POLYNOMIAL */
}
else
{
crc <<= 1;
}
}
}
return crc;
}
I have been trying to use it to calculate a CRC16 from a file that is around 800 Kb, but it takes forever, I mean after five minutes, the value of i was still at around 2 000, and it should go up to 800 000.
Can someone give me an explanation on why it is so slow and what I can do to solve this issue ?
I'm working with Visual Studio 2015 on a i7 processor and the computer is not old nor broken.

Replace the first lines with:
int i, j;
ushort crc = 0;
You were using a ushort for the for counter, but if size > 65535, the for cycle won't ever end.
The reason for this is that C# by default does not throw an Exception if an overflow occurs but just "silently" ignores it. Checkout the following code for a demonstration:
ushort i = ushort.MaxValue; //65535
i++; //0

How to make this small C# program which calculates prime numbers using the Sieve of Eratosthenes as efficient and economical as possible?

I've made a small C# program which calculates prime numbers using the Sieve of Eratosthenes.
long n = 100000;
bool[] p = new bool[n+1];
for(long i=2; i<=n; i++)
{
p[i]=true;
}
for(long i=2; i<=X; i++)
{
for(long j=Y; j<=Z; j++)
{
p[i*j]=false;
}
}
for(long i=0; i<=n; i++)
{
if(p[i])
{
Console.Write(" "+i);
}
}
Console.ReadKey(true);
My question is: which X, Y and Z should I choose to make my program as efficient and economical as possible?
Of course we can just take:
X = n
Y = 2
Z = n
But then the program won't be very efficient.
It seems we can take:
X = Math.Sqrt(n)
Y = i
Z = n/i
And apparently the first 100 primes that the program gives are all correct.

There are several optimisations that can be applied without making the program overly complicated.
you can start the crossing out at j = i (effectively i * i instead of 2 * i) since all lower multiples of i have already been crossed out
you can save some work by leaving all even numbers out of the array (remembering to produce the prime 2 out of thin air when needed); hence array cell k represents the odd integer 2 * k + 1
you can make things faster by turning repeated multiplication (i * j) into iterated addition (k += i); instead of looping over j in the inner loop you loop (k = i * i; k <= N; k += i)
in some cases it can be advantageous to initialise the array with 0 (false) and set cells to 1 (true) for composites; its meaning is thus 'is_composite' instead of 'is_prime'
Harvesting all the low-hanging fruit, the loops thus become (in C++, but C# should be sort of similar):
uint32_t max_factor_bit = uint32_t(sqrt(double(n))) >> 1;
uint32_t max_bit = n >> 1;
for (uint32_t i = 3 >> 1; i <= max_factor_bit; ++i)
{
if (composite[i]) continue;
uint32_t n = (i << 1) + 1;
uint32_t k = (n * n) >> 1;
for ( ; k <= max_bit; k += n)
{
composite[k] = true;
}
}
Regarding the computation of max_factor there are some caveats where the compiler can bite you, for larger values of n. There's a topic for that on Code Review.
A further, easy optimisation is to represent the bitmap as an array of bytes, with each byte standing for eight odd integers. For setting bit k in byte array a you would do a[k / CHAR_BIT] |= (1 << (k % CHAR_BIT)) where CHAR_BIT is the number of bits in a byte. However, such bit trickery is normally wrapped into an inline function to keep the code clean. E.g. in C++ I tell the compiler how to generate such functions using a template like this:
template<typename word_t>
inline
void set_bit (word_t *p, uint32_t index)
{
enum { BITS_PER_WORD = sizeof(word_t) * CHAR_BIT };
// we can trust the compiler to use masking and shifting instead of division; we cannot do that
// ourselves without having the log2 which cannot easily be computed as a constexpr
p[index / BITS_PER_WORD] |= word_t(1) << (index % BITS_PER_WORD);
}
This allows me to say set_bit(a, k) for any type of array - byte, integer, whatever - without having to write special code or use invocations; it's basically a type-safe equivalent to the old C-style macros. I'm not certain whether something similar is possible in C#. There is, however, the C# type BitArray where all that stuff is already done for you under the hood.
On pastebin there's a small demo .cpp for the segmented Sieve of Eratosthenes, where two further optimisations are applied: presieving by small integers, and sieving in small, cache friendly blocks so that the full range of 32-bit integers can be sieved in 2 seconds flat. This could give you some inspiration...
When doing the Sieve of Eratosthenes, memory savings easily translate to speed gains because the algorithm is memory-intensive and it tends to stride all over the memory instead of accessing it locally. That's why space savings due to compact representation (only odd integers, packed bits - i.e. BitArray) and localisation of access (by sieving in small blocks instead of the whole array in one go) can speed up the code by one or more orders of magnitude, without making the code significantly more complicated.
It is possible to go far beyond the easy optimisations mentioned here, but that tends to make the code increasingly complicated. One word that often occurs in this context is the 'wheel', which can save a further 50% of memory space. The wiki has an explanation of wheels here, and in a sense the odds-only sieve is already using a 'modulo 2 wheel'. Conversely, a wheel is the extension of the odds-only idea to dropping further small primes from the array, like 3 and 5 in the famous 'mod 30' wheel with modulus 2 * 3 * 5. That wheel effectively stuffs 30 integers into one 8-bit byte.
Here's a runnable rendition of the above code in C#:
static uint max_factor32 (double n)
{
double r = System.Math.Sqrt(n);
if (r < uint.MaxValue)
{
uint r32 = (uint)r;
return r32 - ((ulong)r32 * r32 > n ? 1u : 0u);
}
return uint.MaxValue;
}
static void sieve32 (System.Collections.BitArray odd_composites)
{
uint max_bit = (uint)odd_composites.Length - 1;
uint max_factor_bit = max_factor32((max_bit << 1) + 1) >> 1;
for (uint i = 3 >> 1; i <= max_factor_bit; ++i)
{
if (odd_composites[(int)i]) continue;
uint p = (i << 1) + 1; // the prime represented by bit i
uint k = (p * p) >> 1; // starting point for striding through the array
for ( ; k <= max_bit; k += p)
{
odd_composites[(int)k] = true;
}
}
}
static int Main (string[] args)
{
int n = 100000000;
System.Console.WriteLine("Hello, Eratosthenes! Sieving up to {0}...", n);
System.Collections.BitArray odd_composites = new System.Collections.BitArray(n >> 1);
sieve32(odd_composites);
uint cnt = 1;
ulong sum = 2;
for (int i = 1; i < odd_composites.Length; ++i)
{
if (odd_composites[i]) continue;
uint prime = ((uint)i << 1) + 1;
cnt += 1;
sum += prime;
}
System.Console.WriteLine("\n{0} primes, sum {1}", cnt, sum);
return 0;
}
This does 10^8 in about a second, but for higher values of n it gets slow. If you want to do faster then you have to employ sieving in small, cache-sized blocks.

Is there a good radixsort-implementation for floats in C#

I have a datastructure with a field of the float-type. A collection of these structures needs to be sorted by the value of the float. Is there a radix-sort implementation for this.
If there isn't, is there a fast way to access the exponent, the sign and the mantissa.
Because if you sort the floats first on mantissa, exponent, and on exponent the last time. You sort floats in O(n).

Update:
I was quite interested in this topic, so I sat down and implemented it (using this very fast and memory conservative implementation). I also read this one (thanks celion) and found out that you even dont have to split the floats into mantissa and exponent to sort it. You just have to take the bits one-to-one and perform an int sort. You just have to care about the negative values, that have to be inversely put in front of the positive ones at the end of the algorithm (I made that in one step with the last iteration of the algorithm to save some cpu time).
So heres my float radixsort:
public static float[] RadixSort(this float[] array)
{
// temporary array and the array of converted floats to ints
int[] t = new int[array.Length];
int[] a = new int[array.Length];
for (int i = 0; i < array.Length; i++)
a[i] = BitConverter.ToInt32(BitConverter.GetBytes(array[i]), 0);
// set the group length to 1, 2, 4, 8 or 16
// and see which one is quicker
int groupLength = 4;
int bitLength = 32;
// counting and prefix arrays
// (dimension is 2^r, the number of possible values of a r-bit number)
int[] count = new int[1 << groupLength];
int[] pref = new int[1 << groupLength];
int groups = bitLength / groupLength;
int mask = (1 << groupLength) - 1;
int negatives = 0, positives = 0;
for (int c = 0, shift = 0; c < groups; c++, shift += groupLength)
{
// reset count array
for (int j = 0; j < count.Length; j++)
count[j] = 0;
// counting elements of the c-th group
for (int i = 0; i < a.Length; i++)
{
count[(a[i] >> shift) & mask]++;
// additionally count all negative
// values in first round
if (c == 0 && a[i] < 0)
negatives++;
}
if (c == 0) positives = a.Length - negatives;
// calculating prefixes
pref[0] = 0;
for (int i = 1; i < count.Length; i++)
pref[i] = pref[i - 1] + count[i - 1];
// from a[] to t[] elements ordered by c-th group
for (int i = 0; i < a.Length; i++){
// Get the right index to sort the number in
int index = pref[(a[i] >> shift) & mask]++;
if (c == groups - 1)
{
// We're in the last (most significant) group, if the
// number is negative, order them inversely in front
// of the array, pushing positive ones back.
if (a[i] < 0)
index = positives - (index - negatives) - 1;
else
index += negatives;
}
t[index] = a[i];
}
// a[]=t[] and start again until the last group
t.CopyTo(a, 0);
}
// Convert back the ints to the float array
float[] ret = new float[a.Length];
for (int i = 0; i < a.Length; i++)
ret[i] = BitConverter.ToSingle(BitConverter.GetBytes(a[i]), 0);
return ret;
}
It is slightly slower than an int radix sort, because of the array copying at the beginning and end of the function, where the floats are bitwise copied to ints and back. The whole function nevertheless is again O(n). In any case much faster than sorting 3 times in a row like you proposed. I dont see much room for optimizations anymore, but if anyone does: feel free to tell me.
To sort descending change this line at the very end:
ret[i] = BitConverter.ToSingle(BitConverter.GetBytes(a[i]), 0);
to this:
ret[a.Length - i - 1] = BitConverter.ToSingle(BitConverter.GetBytes(a[i]), 0);
Measuring:
I set up some short test, containing all special cases of floats (NaN, +/-Inf, Min/Max value, 0) and random numbers. It sorts exactly the same order as Linq or Array.Sort sorts floats:
NaN -> -Inf -> Min -> Negative Nums -> 0 -> Positive Nums -> Max -> +Inf
So i ran a test with a huge array of 10M numbers:
float[] test = new float[10000000];
Random rnd = new Random();
for (int i = 0; i < test.Length; i++)
{
byte[] buffer = new byte[4];
rnd.NextBytes(buffer);
float rndfloat = BitConverter.ToSingle(buffer, 0);
switch(i){
case 0: { test[i] = float.MaxValue; break; }
case 1: { test[i] = float.MinValue; break; }
case 2: { test[i] = float.NaN; break; }
case 3: { test[i] = float.NegativeInfinity; break; }
case 4: { test[i] = float.PositiveInfinity; break; }
case 5: { test[i] = 0f; break; }
default: { test[i] = test[i] = rndfloat; break; }
}
}
And stopped the time of the different sorting algorithms:
Stopwatch sw = new Stopwatch();
sw.Start();
float[] sorted1 = test.RadixSort();
sw.Stop();
Console.WriteLine(string.Format("RadixSort: {0}", sw.Elapsed));
sw.Reset();
sw.Start();
float[] sorted2 = test.OrderBy(x => x).ToArray();
sw.Stop();
Console.WriteLine(string.Format("Linq OrderBy: {0}", sw.Elapsed));
sw.Reset();
sw.Start();
Array.Sort(test);
float[] sorted3 = test;
sw.Stop();
Console.WriteLine(string.Format("Array.Sort: {0}", sw.Elapsed));
And the output was (update: now ran with release build, not debug):
RadixSort: 00:00:03.9902332
Linq OrderBy: 00:00:17.4983272
Array.Sort: 00:00:03.1536785
roughly more than four times as fast as Linq. That is not bad. But still not yet that fast as Array.Sort, but also not that much worse. But i was really surprised by this one: I expected it to be slightly slower than Linq on very small arrays. But then I ran a test with just 20 elements:
RadixSort: 00:00:00.0012944
Linq OrderBy: 00:00:00.0072271
Array.Sort: 00:00:00.0002979
and even this time my Radixsort is quicker than Linq, but way slower than array sort. :)
Update 2:
I made some more measurements and found out some interesting things: longer group length constants mean less iterations and more memory usage. If you use a group length of 16 bits (only 2 iterations), you have a huge memory overhead when sorting small arrays, but you can beat Array.Sort if it comes to arrays larger than about 100k elements, even if not very much. The charts axes are both logarithmized:
(source: daubmeier.de)

There's a nice explanation of how to perform radix sort on floats here:
http://www.codercorner.com/RadixSortRevisited.htm
If all your values are positive, you can get away with using the binary representation; the link explains how to handle negative values.

By doing some fancy casting and swapping arrays instead of copying this version is 2x faster for 10M numbers as Philip Daubmeiers original with grouplength set to 8. It is 3x faster as Array.Sort for that arraysize.
static public void RadixSortFloat(this float[] array, int arrayLen = -1)
{
// Some use cases have an array that is longer as the filled part which we want to sort
if (arrayLen < 0) arrayLen = array.Length;
// Cast our original array as long
Span<float> asFloat = array;
Span<int> a = MemoryMarshal.Cast<float, int>(asFloat);
// Create a temp array
Span<int> t = new Span<int>(new int[arrayLen]);
// set the group length to 1, 2, 4, 8 or 16 and see which one is quicker
int groupLength = 8;
int bitLength = 32;
// counting and prefix arrays
// (dimension is 2^r, the number of possible values of a r-bit number)
var dim = 1 << groupLength;
int groups = bitLength / groupLength;
if (groups % 2 != 0) throw new Exception("groups must be even so data is in original array at end");
var count = new int[dim];
var pref = new int[dim];
int mask = (dim) - 1;
int negatives = 0, positives = 0;
// counting elements of the 1st group incuding negative/positive
for (int i = 0; i < arrayLen; i++)
{
if (a[i] < 0) negatives++;
count[(a[i] >> 0) & mask]++;
}
positives = arrayLen - negatives;
int c;
int shift;
for (c = 0, shift = 0; c < groups - 1; c++, shift += groupLength)
{
CalcPrefixes();
var nextShift = shift + groupLength;
//
for (var i = 0; i < arrayLen; i++)
{
var ai = a[i];
// Get the right index to sort the number in
int index = pref[( ai >> shift) & mask]++;
count[( ai>> nextShift) & mask]++;
t[index] = ai;
}
// swap the arrays and start again until the last group
var temp = a;
a = t;
t = temp;
}
// Last round
CalcPrefixes();
for (var i = 0; i < arrayLen; i++)
{
var ai = a[i];
// Get the right index to sort the number in
int index = pref[( ai >> shift) & mask]++;
// We're in the last (most significant) group, if the
// number is negative, order them inversely in front
// of the array, pushing positive ones back.
if ( ai < 0) index = positives - (index - negatives) - 1; else index += negatives;
//
t[index] = ai;
}
void CalcPrefixes()
{
pref[0] = 0;
for (int i = 1; i < dim; i++)
{
pref[i] = pref[i - 1] + count[i - 1];
count[i - 1] = 0;
}
}
}

You can use an unsafe block to memcpy or alias a float * to a uint * to extract the bits.

I think your best bet if the values aren't too close and there's a reasonable precision requirement, you can just use the actual float digits before and after the decimal point to do the sorting.
For example, you can just use the first 4 decimals (be they 0 or not) to do the sorting.

Converting a int to a BCD byte array

I want to convert an int to a byte[2] array using BCD.
The int in question will come from DateTime representing the Year and must be converted to two bytes.
Is there any pre-made function that does this or can you give me a simple way of doing this?
example:
int year = 2010
would output:
byte[2]{0x20, 0x10};

static byte[] Year2Bcd(int year) {
if (year < 0 || year > 9999) throw new ArgumentException();
int bcd = 0;
for (int digit = 0; digit < 4; ++digit) {
int nibble = year % 10;
bcd |= nibble << (digit * 4);
year /= 10;
}
return new byte[] { (byte)((bcd >> 8) & 0xff), (byte)(bcd & 0xff) };
}
Beware that you asked for a big-endian result, that's a bit unusual.

Use this method.
public static byte[] ToBcd(int value){
if(value<0 || value>99999999)
throw new ArgumentOutOfRangeException("value");
byte[] ret=new byte[4];
for(int i=0;i<4;i++){
ret[i]=(byte)(value%10);
value/=10;
ret[i]|=(byte)((value%10)<<4);
value/=10;
}
return ret;
}
This is essentially how it works.
If the value is less than 0 or greater than 99999999, the value won't fit in four bytes. More formally, if the value is less than 0 or is 10^(n*2) or greater, where n is the number of bytes, the value won't fit in n bytes.
For each byte:
Set that byte to the remainder of the value-divided-by-10 to the byte. (This will place the last digit in the low nibble [half-byte] of the current byte.)
Divide the value by 10.
Add 16 times the remainder of the value-divided-by-10 to the byte. (This will place the now-last digit in the high nibble of the current byte.)
Divide the value by 10.
(One optimization is to set every byte to 0 beforehand -- which is implicitly done by .NET when it allocates a new array -- and to stop iterating when the value reaches 0. This latter optimization is not done in the code above, for simplicity. Also, if available, some compilers or assemblers offer a divide/remainder routine that allows retrieving the quotient and remainder in one division step, an optimization which is not usually necessary though.)

Here's a terrible brute-force version. I'm sure there's a better way than this, but it ought to work anyway.
int digitOne = year / 1000;
int digitTwo = (year - digitOne * 1000) / 100;
int digitThree = (year - digitOne * 1000 - digitTwo * 100) / 10;
int digitFour = year - digitOne * 1000 - digitTwo * 100 - digitThree * 10;
byte[] bcdYear = new byte[] { digitOne << 4 | digitTwo, digitThree << 4 | digitFour };
The sad part about it is that fast binary to BCD conversions are built into the x86 microprocessor architecture, if you could get at them!

Here is a slightly cleaner version then Jeffrey's
static byte[] IntToBCD(int input)
{
if (input > 9999 || input < 0)
throw new ArgumentOutOfRangeException("input");
int thousands = input / 1000;
int hundreds = (input -= thousands * 1000) / 100;
int tens = (input -= hundreds * 100) / 10;
int ones = (input -= tens * 10);
byte[] bcd = new byte[] {
(byte)(thousands << 4 | hundreds),
(byte)(tens << 4 | ones)
};
return bcd;
}

maybe a simple parse function containing this loop
i=0;
while (id>0)
{
twodigits=id%100; //need 2 digits per byte
arr[i]=twodigits%10 + twodigits/10*16; //first digit on first 4 bits second digit shifted with 4 bits
id/=100;
i++;
}

More common solution
private IEnumerable<Byte> GetBytes(Decimal value)
{
Byte currentByte = 0;
Boolean odd = true;
while (value > 0)
{
if (odd)
currentByte = 0;
Decimal rest = value % 10;
value = (value-rest)/10;
currentByte |= (Byte)(odd ? (Byte)rest : (Byte)((Byte)rest << 4));
if(!odd)
yield return currentByte;
odd = !odd;
}
if(!odd)
yield return currentByte;
}

Same version as Peter O. but in VB.NET
Public Shared Function ToBcd(ByVal pValue As Integer) As Byte()
If pValue < 0 OrElse pValue > 99999999 Then Throw New ArgumentOutOfRangeException("value")
Dim ret As Byte() = New Byte(3) {} 'All bytes are init with 0's
For i As Integer = 0 To 3
ret(i) = CByte(pValue Mod 10)
pValue = Math.Floor(pValue / 10.0)
ret(i) = ret(i) Or CByte((pValue Mod 10) << 4)
pValue = Math.Floor(pValue / 10.0)
If pValue = 0 Then Exit For
Next
Return ret
End Function
The trick here is to be aware that simply using pValue /= 10 will round the value so if for instance the argument is "16", the first part of the byte will be correct, but the result of the division will be 2 (as 1.6 will be rounded up). Therefore I use the Math.Floor method.

I made a generic routine posted at IntToByteArray that you could use like:
var yearInBytes = ConvertBigIntToBcd(2010, 2);

static byte[] IntToBCD(int input) {
byte[] bcd = new byte[] {
(byte)(input>> 8),
(byte)(input& 0x00FF)
};
return bcd;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.