I'm working on a data structure which subdivides items into quadrants, and one of the bottlenecks I've identified is my method to select the quadrant of the point. Admittedly, it's fairly simple, but it's called so many times that it adds up. I imagine there's got to be an efficient way to bit twiddle this into what I want, but I can't think of it.
private int Quadrant(Point p)
{
if (p.X >= Center.X)
return p.Y >= Center.Y ? 0 : 3;
return p.Y >= Center.Y ? 1 : 2;
}
Center is of type Point, coordinates are ints. Yes, I've run a code profile, and no, this isn't premature optimization.
Because this is only used internally, I suppose my quadrants don't have to be in Cartesian order, as long as they range from 0-3.
The fastest way in C/C++ it would be
(((unsigned int)x >> 30) & 2) | ((unsigned int)y >> 31)
(30/31 or 62/63, depending on size of int).
This will give the quadrants in order 0, 2, 3, 1.
Edit for LBushkin:
(((unsigned int)(x - center.x) >> 30) & 2) | ((unsigned int)(y-center.y) >> 31)
I don't know that you can make this code dramatically faster in C#. What you may be able to do, however, it look at how you're processing points, and see if you can avoid making unecessary calls to this method. Perhaps you could create a QuadPoint structure that stores which quadrant a point is in (after you compute it once), so that you don't have to do so again.
But, admittedly, this depends on what your algorithm is doing, and whether it's possible to store/memoize the quadrant information. If every point is completely unique, this obviously won't help.
I've just been told about the solution which produces 0,1,2,3 quadrant results ordered correctly:
#define LONG_LONG_SIGN (sizeof(long long) * 8 - 1)
double dx = point.x - center.x;
double dy = point.y - center.y;
long long *pdx = (void *)&dx;
long long *pdy = (void *)&dy;
int quadrant = ((*pdy >> LONG_LONG_SIGN) & 3) ^ ((*pdx >> LONG_LONG_SIGN) & 1);
This solution is for x,y coordinates of double type.
I've done some performance testing of this method and the method with branching as in original question: my results are that the branching method is always a bit faster (currently I am having stable 160/180 relation), so I prefer the branching method over the method with bitwise operations.
UPDATE
If someone is interested, all three algorithms were merged into EKAlgorithms C/Objective-C repository as "Cartesian quadrant selection" algorithms:
Original branching algorithm
Bitwise algorithm by #ruslik from the accepted answer.
Alternative bitwise promoted by one of my colleagues which is a bit slower than second algorithm but returns quadrants in correct order.
All algorithms there are optimized to work with double-typed points.
Performance testing showed us that in general the first branching algorithm is the winner on Mac OS X, though on Linux machine we did see second algorithm performing a small bit faster than the branching one.
So, general conclusion is to stick with branching algorithm because bitwise versions do not give any performance gain.
My first try would be to get rid of the nested conditional.
int xi = p.X >= Center.X ? 1 : 0;
int yi = p.Y >= Center.Y ? 2 : 0;
int quadrants[4] = { ... };
return quadrants[xi+yi];
The array lookup in quadrants is optional if the quadrants are allowed to be renumbered. My code still needs two comparisons but they can be done in parallel.
I apologise in advance for any C# errors as I usually code C++.
Perhaps something more efficient is possible when two unsigned 31 bit coordinates are stored in a 64 bit unsigned long variable.
// The following two lines are unnecessary
// if you store your coordinated in unsigned longs right away
unsigned long Pxy = (((unsigned long)P.x) << 32) + P.y;
unsigned long Centerxy = (((unsigned long)Center.x) << 32) + Center.y;
// This is the actual calculation, only 1 subtraction is needed.
// The or-ing with ones hast only to be done once for a repeated use of Centerxy.
unsigned long diff = (Centerxy|(1<<63)|(1<<31))-Pxy;
int quadrant = ((diff >> 62)&2) | ((diff >> 31)&1);
Taking a step back, a different solution is possible. Do not arrange your data structure split into quadrants right away but alternately in both directions. This is also done in the related Kd-tree
Related
I'm trying to solve a simple question on leetcode.com (https://leetcode.com/problems/number-of-1-bits/) and I encounter a strange behavior which is probably my lack of understanding...
My solution to the question in the link is the following:
public int HammingWeight(uint n) {
int sum = 0;
while (n > 0) {
uint t = n % 10;
sum += t == 0 ? 0 : 1;
n /= 10;
}
return sum;
}
My solution was to isolate each number and if it's one increase the sum. When I ran this on my PC it worked (yes - I know it's not the optimal solution and there are more elegant solutions considering it's binary representation).
But when I tried running in the leetcode editor it returned a wrong answer for the following input (00000000000000000000000000001011).
No real easy way to debug other then printing to the console so I printed the value of n when entering the method and got the result of 11 instead of 1011 - on my PC I got 11. If I take a different solution - one that uses bitwise right shift or calculating mod by 2 then it works even when the printed n is still 11. And I would have expected those solutions to fail as well considering that n is "wrong" (different from my PC and the site as described).
Am I missing some knowledge regarding the representation of uint? Or binary number in a uint variable?
Your code appears to be processing it as base 10 (decimal), but hamming weight is about base 2 (i.e. binary). So: instead if doing % 10 and /= 10, you should be looking at % 2 and /= 2.
As for what uint looks like as binary: essentially like this, but ... the CPU is allowed to lie about where each of the octets actually is (aka "endianness"). The good news is: it doesn't usually expose that lie to you unless you cheat and look under the covers by looking at raw memory. As long as you use regular operators (include bitwise operators): the lie will remain undiscovered.
Side note: for binary work that is about checking a bit and shuffling the data down, & 1 and >> 1 would usually be preferable to % 2 and / 2. But as canton7 notes: there are also inbuilt operations for this specific scenario which uses the CPU intrinsic instruction when possible (however: using the built-in function doesn't help you increase your understanding!).
This Kata has a poor writing, in the examples the Inputs are printed in binary representation while the Outputs are in printed in decimal representation. And there is no clues to help understand that.
00000000000000000000000000001011b is 11 (in decimal, 8 + 2 + 1). That is why you get 11 as input for the first test case.
There is no numbers made of 0s and 1s in base 10 you have to decode as base 2 stuff here.
To solve the Kata, you just need to work in base 2 as you succeed to do and like #MarcGravell explained.
Please check below code, it will work for you.
Its very simple way to solve.
var result = 0;
for(var i = 0; i < 32; i++)
{
if ((n & 1) == 1) result++;
n = n >> 1;
}
return result;
I'm working on the following practice problem from GeeksForGeeks:
Write a function Add() that returns sum of two integers. The function should not use any of the arithmetic operators (+, ++, –, -, .. etc).
The given solution in C# is:
public static int Add(int x, int y)
{
// Iterate till there is no carry
while (y != 0)
{
// carry now contains common set bits of x and y
int carry = x & y;
// Sum of bits of x and y where at least one of the bits is not set
x = x ^ y;
// Carry is shifted by one so that adding it to x gives the required sum
y = carry << 1;
}
return x;
}
Looking at this solution, I understand how it is happening; I can follow along with the debugger and anticipate the value changes before they come. But after walking through it several times, I still don't understand WHY it is happening. If this was to come up in an interview, I would have to rely on memory to solve it, not actual understanding of how the algorithm works.
Could someone help explain why we use certain operators at certain points and what those totals are suppose to represent? I know there are already comments in the code, but I'm obviously missing something...
At each iteration, you have these steps:
carry <- x & y // mark every location where the addition has a carry
x <- x ^ y // sum without carries
y <- carry << 1 // shift the carry left one column
On the next iteration, x holds the entire sum except for the carry bits, which are in y. These carries are properly bumped one column to the left, just as if you were doing the addition on paper. Continue doing this until there are no more carry bits to worry about.
Very briefly, this does the addition much as you or I would do it on paper, except that, instead of working right to left, it does all the bits in parallel.
Decimal arithmetic is more complicated than binary arithmetic, but perhaps it helps to compare them.
The algorithm that is usually taught for addition is to go through the digits one by one, remembering to "carry a one" when necessary. In the above algorithm, that is not exactly what happens - rather, all digits are added and allowed to wrap, and all the carries are collected to be applied all at once in the next step. In decimal that would look like this:
123456
777777
------ +
890123
001111 << 1
011110
------ +
801233
010000 << 1
100000
------ +
901233
000000 done
In binary arithmetic, addition without carry is just XOR.
What you have here is a case of Binary Math on the Represenetation in memory:
https://www.wikihow.com/Add-Binary-Numbers
Generally when programming in C#, you do not bother with the "how is it represented in memory" level of things. 55% of the time it is not worth the effort, 40% it is worse then just using the builtin functions. And the remaing 5% you should ask yourself why you are not programming in Native C++, Assembler or something with similar low level capacities to begin with.
My apologies if this has been asked/answered before but I'm honestly not even sure how to word this as a question properly. I have the following bit pattern:
0110110110110110110110110110110110110110110110110110110110110110
I'm trying to perform a shift that'll preserve my underlying pattern; my first instinct was to use right rotation ((x >> count) | (x << (-count & 63))) but the asymmetry in my bit pattern results in:
0011011011011011011011011011011011011011011011011011011011011011 <--- wrong
The problem is that the most significant (far left) bit ends up being 0 instead of the desired 1:
1011011011011011011011011011011011011011011011011011011011011011 <--- right
Is there a colloquial name for this function I'm looking for? If not, how could I go about implementing this idea?
Additional Information:
While the question is language agnostic I'm currently trying to solve this using C#.
The bit patterns I'm using are entirely predictable and always have the same structure; the pattern starts with a single zero followed by n - 1 ones (where n is an odd number) and then repeats infinitely.
I'd like to accomplish this without conditional operations since they'd defeat the purpose of using bitwise manipulation in the first place but maybe I have no choice...
You've got a number structured like this:
B16 B15 B14 B13 B12 B11 B10 B09 B08 B07 B06 B05 B04 B03 B02 B01 B00
? 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
The ? needs to appear in the MSB (B15, or B63, or whatever) after the shift. Where does it come from? Well, the closest copy is found n places to the right:
B13 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
^--------------/
If your word has width w, this is 1 << (w-n)
*
So you can do:
var selector = 1 << (w-n);
var rotated = (val >> 1) | ((val & selector) << (n-1));
But you may want a multiple shift. Then we need to build a wider mask:
? 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
* * * * *
Here I've chosen to pretend n = 6, it just needs to be a multiple of the basic n, and larger than shift. Now:
var selector = ((1UL << shift) - 1) << (w - n);
var rotated = (val >> shift) | ((val & selector) << (n - shift));
Working demonstration using your pattern: http://rextester.com/UWYSW47054
It's easy to see that the output has period 3, as required:
1:B6DB6DB6DB6DB6DB
2:DB6DB6DB6DB6DB6D
3:6DB6DB6DB6DB6DB6
4:B6DB6DB6DB6DB6DB
5:DB6DB6DB6DB6DB6D
6:6DB6DB6DB6DB6DB6
7:B6DB6DB6DB6DB6DB
8:DB6DB6DB6DB6DB6D
9:6DB6DB6DB6DB6DB6
10:B6DB6DB6DB6DB6DB
11:DB6DB6DB6DB6DB6D
Instead of storing a lot of repetitions of a pattern, just store one recurrence and apply modulo operations on the indexes
byte[] pattern = new byte[] { 0, 1, 1 };
// Get a "bit" at index "i", shifted right by "shift"
byte bit = pattern[(i - shift + 1000000 * byte.Length) % byte.Length];
The + 1000000 * byte.Length must be greater than the greatest expected shift and ensures that we get a posistive sum.
This allows you to store patterns of virtually any length.
An optimization would be to store a mirrored version of the pattern. You could then shift left instead of right. This would simplify the index calculation
byte bit = pattern[(i + shift) % byte.Length];
Branchless Answer after a poke by #BenVoigt:
Get the last bit b by doing (n & 1);
Return n >> 1 | b << ((sizeof(n) - 1).
Original Answer:
Get the last bit b by doing (n & 1);
If b is 1, right shift the number by 1 bit and bitwise-OR it with 1 << (sizeof(n) - 1);
If b is 0, just right shift the number by 1 bit.
The problem was changed a bit through the comments.
For all reasonable n, the following problem can be solved efficiently after minimal pre-computation:
Given an offset k, get 64 bits starting at that position in the stream of bits that follows the pattern of (zero, n-1 ones) repeating.
Clearly the pattern repeats with a period of n, so only n different ulongs have to be produced for every given value of n. That could either be done explicitly, constructing all of them in pre-processing (they could be constructed in any obvious way, it doesn't really matter since that only happens once), or left more implicitly by storing only two ulongs per value for n (this works under the assumption that n < 64, see below) and then extracting a range from them with some shifting/ORing. Either way, use offset % n to compute which pattern to retrieve (since the offset is increasing in a predictable manner, no actual modulo operation is required[1]).
Even with the first method, memory consumption will be reasonable since this optimization is only an optimization for low n: in particular for n > 64 there will be fewer than 1 zero per word on average, so the "old fashioned way" of visiting every multiple of n and resetting that bit starts to skip work while the above trick would still visit every word and would not be able anymore to reset multiple bits at once.
[1]: if there are multiple n's in play at the same time, a possible strategy is keeping an array offsets where offsets[n] = offset % n, which could be updated according to: (not tested)
int next = offsets[n] + _64modn[n]; // 64 % n precomputed
offsets[n] = next - (((n - next - 1) >> 31) & n);
The idea being that n is subtracted whenever next >= n. Only one subtraction is needed since the offset and thing added to the offset are already reduced modulo n.
This offset-increment can be done with System.Numerics.Vectors, which is very feature-poor compared to actual hardware but is just about able to do this. It can't do the shift (yes, it's weird) but it can implement a comparison in a branchless way.
Doing one pass per value of n is easier, but touches lots of memory in a cache unfriendly manner. Doing lots of different n at the same time may not be great either. I guess you'd just have to bechmark that..
Also you could consider hard-coding it for some low numbers, something like offset % 3 is fairly efficient (unlike offset % variable). This does take manual loop-unrolling which is a bit annoying, but it's actually simpler, just big in terms of lines of code.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
In counting the number of bits in a word, a brute force would be something like this:
int CountNumSetBits(unsigned long n)
{
unsigned short num_setbits = 0;
while (n)
{
num_setbits += n & 1;
n >>= 1;
}
return num_setbits;
}
The big O speed would be O(n) where n is the number of bits in the Word.
I thought of another way of writing the algorithm taking advantage of the fact that we an optain the first occurance of a set bit using y = x&~(x-1)
int CountNumSetBitsMethod2(unsigned long n)
{
unsigned short num_setbits = 0;
int y = 0;
while (n)
{
y = n& ~(n - 1); // get first occurrence of '1'
if (y) // if we have a set bit inc our counter
++num_setbits;
n ^=y; // erase the first occurrence of '1'
}
return num_setbits;
}
If we assume that are inputs are 50% 1's and 50% 0's it appears that the second algorithm could be twice as fast. However, the actual complexity is greater:
In method one we do the following for each bit:
1 add
1 and
1 shift
In method two we do the following for each set bit:
1 and
1 complement
1 subtraction (the result of the subtraction has to be copied to another reg)
1 compare
1 increment (if compare is true)
1 XOR
Now, in practice one can determine which algorithm is faster by performing some profiling. That is, using a stop watch mechanism and some test data and call each algorithm say a million times.
What I want to do first, however, is see how well I can estimate the speed difference by eyeballing the code (given same number of set and unset bits).
If we assume that the subtraction takes the same amount cycles as the add (approximately), and all the other operations are equal cycle wise, can one conclude that each algorithm takes about the same amount of time?
Note: I am assuming here we cannot use lookup tables.
The second algorithm can be greatly simplified:
int CountNumSetBitsMethod2(unsigned long n) {
unsigned short num_setbits = 0;
while (n) {
num_setbits++;
n &= n - 1;
}
return num_setbits;
}
There are many more ways to compute the number of bits set in a word:
Using lookup tables for mutiple bits at a time
Using 64-bit multiplications
Using parallel addition
Using extra tricks to shave a few cycles.
Trying to determine empirically which is faster by counting cycles is not so easy because even looking at the assembly output, it is difficult to assess the impact of instruction parallelisation, pipelining, branch prediction, register renaming and contention... Modern CPUs are very sophisticated! Furthermore, the actual code generated depends on the compiler version and configuration and the timings depend on the CPU type and release... Not to mention the variability linked to the particular sets of values used (for algorithms with variable numbers of instructions).
Benchmarking is a necessary tool, but even careful benchmarking may fail to model the actual usage correctly.
Here is a great site for this kind of bit twiddling games:
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
I suggest you implement the different versions and perform comparative benchmarks on your system. There is no definite answer, only local optima for specific sets of conditions.
Some amazing finds:
// option 3, for at most 32-bit values in v:
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
A more classic one, usually considered the best method for counting bits in a 32-bit integer v:
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
first, the only way to know how fast things are is to measure them.
Second - to find the number of set bits in some bytes, build a lookup table for the number of set bits in a byte
0->0
1->1
2->1
3->2
4->1
etc.
This is a common method and very fast
You can code it by hand or create it at startup
I have an implementation of a pseudo random number generator, specifically of George Marsaglia's XOR-Shift RNG. My implementation is here:
FastRandom.cs
It turns out that the first random sample is very closely correlated with the seed, which is fairly obvious if you take a look at the Reinitialise(int seed) method. This is bad. My proposed solution is to mix up the bits of the seed as follows:
_x = (uint)( (seed * 2147483647)
^ ((seed << 16 | seed >> 48) * 28111)
^ ((seed << 32 | seed >> 32) * 69001)
^ ((seed << 48 | seed >> 16) * 45083));
So I have significantly weakened any correlation by multiplying the seed's bits with four primes and XORing back to form _x. I also rotate the seed's bits before multiplication to ensure that bits of varying magnitudes get mixed up across the full range of values for a 32 bit value.
The four-way rotation just seemed liked a nice balance between doing nothing and every possible rotation (32). The primes are 'finger in the air' - enough magnitude and bit structure to jumble up the bits and 'spread' them over the full 32 bits regardless of the starting seed.
Should I use bigger primes? Is there a standard approach to this problem, perhaps with a more formal basis? I am trying to do this with minimal CPU overhead.
Thanks
=== UPDATE ===
I decided to use some primes with set bits better distributed across all 32 bits. The result is that I can omit the shifts as the multiplications achieve the same effect (hashing bits across the full range of 32 bits), so I then just add the four products to give the final seed...
_x = (uint)( (seed * 1431655781)
+ (seed * 1183186591)
+ (seed * 622729787)
+ (seed * 338294347));
I could possibly get away with fewer primes/multiplciations. Two seemed too few (I could still see patterns in the first samples), three looked OK, so for a safety margin I made it four.
=== UPDATE 2 ===
FYI the above reduces to the functionally equivalent:
_x = seed * 3575866506U;
I didn't spot this initially and when I did I was wondering if overflowing at different stages in the calculation would cause a different result. I believe the answer is no - the two calculations always give the same answer.
According to some researchers, CrapWow, Crap8 and Murmur3 are the best non-cryptographic hash algorithms available today that are both fast, simple and statistically good.
More information is available at Non-Cryptographic Hash Function Zoo.
Edit: As of May, 2021 the floodberry.com links to the Non-Cryptographic Hash Function Zoo are not valid. The content can still be found on archive.org.