I wrote an Int128 type and it works great. I thought I could improve on its performance with a simple idea: Improve the shift operations which are a bit clumsy.
Because they are heavily used in multiplication and division, an improvement would have a ripple effect. So I began creating a dynamic method (to shift low and rotate high), only to discover that there are no OpCodes.Rol or OpCodes.Ror instructions.
Is this possible in IL?
No.
You need to implement it with bit shifts
UInt64 highBits = 0;
UInt64 lowBits = 1;
Int32 n = 63;
var altShift = (n - 63);
var lowShiftedOff = (n - 63) > 0 ? 0 : (lowBits << n);
var highShiftedOff = (n - 63) > 0 ? 0 : (highBits << n);
var highResult = (UInt64)(highShiftedOff | (altShift > 0 ? (lowBits << altShift - 1) : 0));
var lowResult= (UInt64)(lowShiftedOff | (altShift > 0 ? (highBits << altShift - 1) : 0));
To partially answer this question 7 years later, in case someone should need it.
You can use ROR/ROL in .Net.
MSIL doesn't directly contain ROR or ROL operations, but there are patterns that will make the JIT compiler generate ROR and ROL. RuyJIT (.Net and .Net core) supports this.
The details of improving .Net Core to use this pattern was discussed here and a month later .Net Core code was updated to use it.
Looking at the implementation of SHA512 we find examples of ROR:
public static UInt64 RotateRight(UInt64 x, int n) {
return (((x) >> (n)) | ((x) << (64-(n))));
}
And extending by same pattern to ROL:
public static UInt64 RotateLeft(UInt64 x, int n) {
return (((x) << (n)) | ((x) >> (64-(n))));
}
To do this on 128-bit integer you can process as two 64-bit, then AND to extract "carry", AND to clear destination and OR to apply. This has to be mirrored in both directions (low->high and high->low). I'm not goin to bother with an example since this question is a bit old.
Related
For example, is every 4th bit set.
1000.1000 true
1010.1000 true
0010.1000 false
with offset of 1
0100.0100 true
0101.0100 true
0001.0100 false
Currently I am doing this by looping through every 4 bits
int num = 170; //1010.1010
int N = 4;
int offset = 0; //[0, N-1]
bool everyNth = true;
for (int i = 0; i < intervals ; i++){
if(((num >> (N*i)) & ((1 << (N - 1)) >> offset)) == 0){
every4th = false;
break;
}
}
return everyNth;
EXPLANATION OF CODE:
num = 1010.1010
The loop makes it so I look at each 4 bits as a block by right shifting * 4.
num >> 4 = 0000.1010
Then an & for a specific bit that can be offset.
And to only look at a specific bit of the chunk, a mask is created by ((1 << (N - 1)) >> offset)
0000.1010
1000 (mask >> offset0)
OR 0100 (mask >> offset1)
OR 0010 (mask >> offset2)
OR 0001 (mask >> offset3)
Is there a purely computational way to do this? Like how you can XOR your way through to figure out parity. I am working with 64 bit integers for my case, but I am wondering this in a more general case.
Additionally, I am under the assumption that bit operators are one of the fastest methods for calculations or math in general. If this is not true, please feel free to correct me on what the time and place is for bit operators.
If we had a mask M in which every Nth bit is set, then testing whether every Nth bit in a given integer x is set could be calculated as (x & M) == M. Or with offset, you could use ((x << offset) & M) == M. Shifting M right is fine too.
If N is constant, that's all there is to it, just use the right M.
If N is variable, the question becomes, how do we get a mask in which every Nth bit is set.
Here is a simple way to do that:
Start by setting the Nth bit
"Double" the mask until done
For example,
ulong M = 1UL << (N - 1);
do
{
M |= M << N;
N += N;
} while (N < 64);
That is clearly still a loop. But it's not a bit-by-bit loop, it makes only a logarithmic number of iterations.
You could precompute the masks and store them in a small array, the range of N is necessarily small.
There may also be a way based on ulong.MaxValue / ((1UL << N) - 1) but that needs something more to "align" the mask and 64-bit division is not so great anyway. Perhaps there is a smarter way to get the mask.
I am under the assumption that bit operators are one of the fastest methods for calculations or math in general
Bitwise operations are some of the fastest operations, but addition is equally fast, and multiplication is not that far behind (and a multiplication can do a lot more work at once, compared to how much more it costs).
Ok, so let's start with a 32 bit integer:
int big = 536855551; // 00011111111111111100001111111111
Now, I want to set the last 10 bits to within this integer:
int little = 69; // 0001101001
So, my approach was this:
big = (big & 4294966272) & (little)
where 4294966272 is the first 22 bits, or 11111111111111111111110000000000.
But of course this isn't supported because 4294966272 is outside of the int range of 0x7FFFFFFF. Also, this isn't going to be my only operation. I also need to be able to set bits 11 through 14. My approach for that (with the same problem) was:
big = (big & 4294951935) | (little << 10)
So with the explanation out of the way, here is what I'm doing as alternative's for the above:
1: ((big >> 10) << 10) | (little)
2: (big & 1023) | ((big >> 14) << 14) | (little << 10)
I don't feel like my alternative's are the best, efficient way I could go. Is there any better ways to do this?
Sidenote: If C# supported binary literals, '0b', this would be a lot prettier.
Thanks.
4294966272 should actually be -1024, which is represented as 11111111111111111111110000000000.
For example:
int big = 536855551;
int little = 69;
var thing = Convert.ToInt32("11111111111111111111110000000000", 2);
var res = (big & thing) & (little);
Though, the result will always be 0
00011111111111111100001111111111
&
00000000000000000000000001101001
&
11111111111111111111110000000000
Bit shift is usually faster compared to bit-shift + mask (that is, &). I have a test case for it.
You should go with your first alternative.
1: ((big >> 10) << 10) | (little)
Just beware of a little difference between unsigned and signed int when it comes to bit-shifting.
Alternatively, you could define big and little as unsigned. Use uint instead of int.
The popcount function returns the number of 1's in an input. 0010 1101 has a popcount of 4.
Currently, I am using this algorithm to get the popcount:
private int PopCount(int x)
{
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
return (((x + (x >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
This works fine and the only reason I ask for more is because this operation is run awfully often and I am looking for additional performance gains.
I'm looking for a way to simplify the algorithm based on the fact that my 1's will always be right aligned. That is, the input will be something like 00000 11111 (returns 5) or 00000 11111 11111 (returns 10).
Is there a way to make a more efficient popcount based on this constraint? If the input was 01011 11101 10011, it would just return 2 because it only cares about the right-most ones. It seems any kind of looping is slower than the existing solution.
Here's a C# implementation that performs "find highest set" (binary logarithm). It may or may not be faster than your current PopCount, it surely is slower than using the real clz and/or popcnt CPU instructions:
static int FindMSB( uint input )
{
if (input == 0) return 0;
return (int)(BitConverter.DoubleToInt64Bits(input) >> 52) - 1022;
}
Test: http://rextester.com/AOXD85351
And a slight variation without a conditional branch:
/* precondition: ones are right-justified, e.g. 00000111 or 00111111 */
static int FindMSB( uint input )
{
return (int)(input & (int)(BitConverter.DoubleToInt64Bits(input) >> 52) - 1022);
}
I've been working to optimize the Lucas-Lehmer primality test using C# code (yes I'm doing something with Mersenne primes to calculate perfect numbers. I was wondering it is possible with the current code to make further improvements in speed. I use the System.Numerics.BigInteger class to hold the numbers, perhaps it is not the wisest, we'll see it then.
This code is actually based on the intelligence found on: http://en.wikipedia.org/wiki/Lucas%E2%80%93Lehmer_primality_test
This page (at the timestamp) section, some proof is given to optimize the division away.
The code for the LucasTest is:
public bool LucasLehmerTest(int num)
{
if (num % 2 == 0)
return num == 2;
else
{
BigInteger ss = new BigInteger(4);
for (int i = 3; i <= num; i++)
{
ss = KaratsubaSquare(ss) - 2;
ss = LucasLehmerMod(ss, num);
}
return ss == BigInteger.Zero;
}
}
Edit:
Which is faster than using ModPow from the BigInteger class as suggested by Mare Infinitus below. That implementation is:
public bool LucasLehmerTest(int num)
{
if (num % 2 == 0)
return num == 2;
else
{
BigInteger m = (BigInteger.One << num) - 1;
BigInteger ss = new BigInteger(4);
for (int i = 3; i <= num; i++)
ss = (BigInteger.ModPow(ss, 2, m) - 2) % m;
return ss == BigInteger.Zero;
}
}
The LucasLehmerMod method is implemented as follows:
public BigInteger LucasLehmerMod(BigInteger divident, int divisor)
{
BigInteger mask = (BigInteger.One << divisor) - 1; //Mask
BigInteger remainder = BigInteger.Zero;
BigInteger temporaryResult = divident;
do
{
remainder = temporaryResult & mask;
temporaryResult >>= divisor;
temporaryResult += remainder;
} while ( (temporaryResult >> divisor ) != 0 );
return (temporaryResult == mask ? BigInteger.Zero : temporaryResult);
}
What I am afraid of is that when using the BigInteger class from the .NET framework, I am bound to their calculations. Would it mean I have to create my own BigInteger class to improve it? Or can I sustain by using a KaratsubaSquare (derived from the Karatsuba algorithm) like this, what I found on Optimizing Karatsuba Implementation:
public BigInteger KaratsubaSquare(BigInteger x)
{
int n = BitLength(x);
if (n <= LOW_DIGITS) return BigInteger.Pow(x,2); //Standard square
BigInteger b = x >> n; //Higher half
BigInteger a = x - (b << n); //Lower half
BigInteger ac = KaratsubaSquare(a); // lower half * lower half
BigInteger bd = KaratsubaSquare(b); // higher half * higher half
BigInteger c = Karatsuba(a, b); // lower half * higher half
return ac + (c << (n + 1)) + (bd << (2 * n));
}
So basically, I want to look if it is possible to improve the Lucas-Lehmer test method by optimizing the for loop. However, I am a bit stuck there... Is it even possible?
Any thoughts are welcome of course.
Some extra thoughs:
I could use several threads to speed up the calculation on finding Perfect numbers. However, I have no experience (yet) with good partitioning.
I'll try to explain my thoughts (no code yet):
First I'll be generating a primetable with use of the sieve of Erathostenes. It takes about 25 ms to find primes within the range of 2 - 1 million single threaded.
What C# offers is quite astonishing. Using PLINQ with the Parallel.For method, I could run several calculations almost simultaneously, however, it chunks the primeTable array into parts which are not respected to the search.
I already figured out that the automatic load balancing of the threads is not sufficient for this task. Hence I need to try a different approach by dividing the loadbalance depending on the mersenne numbers to find and use to calculate a perfect number. Has anyone some experience with this? This page seems to be a bit helpful: http://www.drdobbs.com/windows/custom-parallel-partitioning-with-net-4/224600406
I'll be looking into it further.
As for now, my results are as following.
My current algorithm (using the standard BigInteger class from C#) can find the first 17 perfect numbers (see http://en.wikipedia.org/wiki/List_of_perfect_numbers) within 5 seconds on my laptop (an Intel I5 with 4 cores and 8GB of RAM). However, then it gets stuck and finds nothing within 10 minutes.
This is something I cannot match yet... My gut feeling (and common sense) tells me that I should look into the LucasLehmer test, since a for-loop calculating the 18th perfect number (using Mersenne Prime 3217) would run 3214 times. There is room for improvement I guess...
What Dinony posted below is a suggestion to rewrite it completely in C. I agree that would boost my performance, however I choose C# to find out it's limitations and benefits. Since it's widely used, and it's ability to rapidly develop applications, it seemed to me worthy of trying.
Could unsafe code provide benefits here as well?
One possible optimization is to use BigInteger ModPow
It really increases performance significantly.
Just a note for info...
In python, this
ss = KaratsubaSquare(ss) - 2
has worse performance than this:
ss = ss*ss - 2
What about adapting the code to C? I have no idea about the algorithm, but it is not that much code.. so the biggest run-time improvement could be adapting to C.
I'm trying to optimise the following C# code, which sets bytes to 0x00 or 0xFF based on a threshold.
for (int i = 0; i < veryLargeNumber; i++)
{
data[i] = (byte)(data[i] < threshold ? 0 : 255);
}
Visual Studio's performance profiler shows that the above code is rather expensive, taking nearly 8 seconds to compute - 98% of my total processing expense. I'm processing just under a thousand items, so that adds up to over two hours.
I think the issue is to do with the ternary conditional operator, since it causes a branch. I'd imagine a pure-math operation of some sort could be significantly faster, since it's CPU-cache friendly.
Is there a way to optimise this? It's possible for me to fix the threshold value, if that helps. I'd consider anything above a ~7% performance increase a win, since that's a whole 10 minutes shaved off the total processing time.
If you are using .NET 4.0 Framework, you could make use of Parallel Library in following link,
http://msdn.microsoft.com/en-us/library/dd460717
In Your case, you must have to verify the threshold, anyway it would take time. So make use of thread or lambda expressions
Just to suggest, use bitwise operators for this purpose because they are faster, together with parallel approach.
0x00 = 0000 0000
0xFF = 1111 1111
Try with OR operator(i.e. 0 | 1 = 1 where | stands for OR operator
EDIT:
This is how you could compare which number is bigger:
let a,b be numbers:
int temp= a ^ b;
temp|= temp>> 1;
temp|= temp>> 2;
temp|= temp>> 4;
temp|= temp>> 8;
temp|= temp>> 16;
temp&= ~(temp>> 1) | 0x80000000;
temp&= (a ^ 0x80000000) & (b ^ 0x7fffffff);
If you want a bit-wise solution -
int intSize = sizeof(int) * 8 - 1;
byte t = (byte)(threshold - 1);
for (....)
{
data[i] = (byte)(255 + 1 ^ ((t - data[i]) >> intSize));
}
Note: Wont work for corner case of 0. Sorry bout that
Also, try using an int array instead of byte and see if it is faster