Help with optimizing C# function via C and/or Assembly

Help with optimizing C# function via C and/or Assembly - c#

I have this C# method which I'm trying to optimize:
// assume arrays are same dimensions
private void DoSomething(int[] bigArray1, int[] bigArray2)
{
int data1;
byte A1, B1, C1, D1;
int data2;
byte A2, B2, C2, D2;
for (int i = 0; i < bigArray1.Length; i++)
{
data1 = bigArray1[i];
data2 = bigArray2[i];
A1 = (byte)(data1 >> 0);
B1 = (byte)(data1 >> 8);
C1 = (byte)(data1 >> 16);
D1 = (byte)(data1 >> 24);
A2 = (byte)(data2 >> 0);
B2 = (byte)(data2 >> 8);
C2 = (byte)(data2 >> 16);
D2 = (byte)(data2 >> 24);
A1 = A1 > A2 ? A1 : A2;
B1 = B1 > B2 ? B1 : B2;
C1 = C1 > C2 ? C1 : C2;
D1 = D1 > D2 ? D1 : D2;
bigArray1[i] = (A1 << 0) | (B1 << 8) | (C1 << 16) | (D1 << 24);
}
}
The function basically compares two int arrays. For each pair of matching elements, the method compares each individual byte value and takes the larger of the two. The element in the first array is then assigned a new int value constructed from the 4 largest byte values (irrespective of source).
I think I have optimized this method as much as possible in C# (probably I haven't, of course - suggestions on that score are welcome as well). My question is, is it worth it for me to move this method to an unmanaged C DLL? Would the resulting method execute faster (and how much faster), taking into account the overhead of marshalling my managed int arrays so they can be passed to the method?
If doing this would get me, say, a 10% speed improvement, then it would not be worth my time for sure. If it was 2 or 3 times faster, then I would probably have to do it.
Note: please, no "premature optimization" comments, thanks in advance. This is simply "optimization".
Update: I realized that my code sample didn't capture everything I'm trying to do in this function, so here is an updated version:
private void DoSomethingElse(int[] dest, int[] src, double pos,
double srcMultiplier)
{
int rdr;
byte destA, destB, destC, destD;
double rem = pos - Math.Floor(pos);
double recipRem = 1.0 - rem;
byte srcA1, srcA2, srcB1, srcB2, srcC1, srcC2, srcD1, srcD2;
for (int i = 0; i < src.Length; i++)
{
// get destination values
rdr = dest[(int)pos + i];
destA = (byte)(rdr >> 0);
destB = (byte)(rdr >> 8);
destC = (byte)(rdr >> 16);
destD = (byte)(rdr >> 24);
// get bracketing source values
rdr = src[i];
srcA1 = (byte)(rdr >> 0);
srcB1 = (byte)(rdr >> 8);
srcC1 = (byte)(rdr >> 16);
srcD1 = (byte)(rdr >> 24);
rdr = src[i + 1];
srcA2 = (byte)(rdr >> 0);
srcB2 = (byte)(rdr >> 8);
srcC2 = (byte)(rdr >> 16);
srcD2 = (byte)(rdr >> 24);
// interpolate (simple linear) and multiply
srcA1 = (byte)(((double)srcA1 * recipRem) +
((double)srcA2 * rem) * srcMultiplier);
srcB1 = (byte)(((double)srcB1 * recipRem) +
((double)srcB2 * rem) * srcMultiplier);
srcC1 = (byte)(((double)srcC1 * recipRem) +
((double)srcC2 * rem) * srcMultiplier);
srcD1 = (byte)(((double)srcD1 * recipRem) +
((double)srcD2 * rem) * srcMultiplier);
// bytewise best-of
destA = srcA1 > destA ? srcA1 : destA;
destB = srcB1 > destB ? srcB1 : destB;
destC = srcC1 > destC ? srcC1 : destC;
destD = srcD1 > destD ? srcD1 : destD;
// convert bytes back to int
dest[i] = (destA << 0) | (destB << 8) |
(destC << 16) | (destD << 24);
}
}
Essentially this does the same thing as the first method, except in this one the second array (src) is always smaller than the first (dest), and the second array is positioned fractionally relative to the first (meaning that instead of being position at, say, 10 relative to dest, it can be positioned at 10.682791).
To achieve this, I have to interpolate between two bracketing values in the source (say, 10 and 11 in the above example, for the first element) and then compare the interpolated bytes with the destination bytes.
I suspect here that the multiplication involved in this function is substantially more costly than the byte comparisons, so that part may be a red herring (sorry). Also, even if the comparisons are still somewhat expensive relative to the multiplications, I still have the problem that this system can actually be multi-dimensional, meaning that instead of comparing 1-dimensional arrays, the arrays could be 2-, 5- or whatever-dimensional, so that eventually the time taken to calculate interpolated values would dwarf the time taken by the final bytewise comparison of 4 bytes (I'm assuming that's the case).
How expensive is the multiplication here relative to the bit-shifting, and is this the kind of operation that could be sped up by being offloaded to a C DLL (or even an assembly DLL, although I'd have to hire somebody to create that for me)?

Yes, the _mm_max_epu8() intrinsic does what you want. Chews through 16 bytes at a time. The pain-point is the arrays. SSE2 instructions require their arguments to be aligned at 16-byte addresses. You cannot get that out of the garbage collected heap, it only promises 4-byte alignment. Even if you trick it by calculating an offset in the array that's 16-byte aligned then you'll lose when the garbage collector kicks in and moves the array.
You'll have to declare the arrays in the C/C++ code, using the __declspec(align(#)) declarator. Now you need to copy your managed arrays into those unmanaged ones. And the results back. Whether you are still ahead depends on details not easily seen in your question.

The function below uses unsafe code to treat the integer arrays as arrays of bytes so that there's no need for bit twiddling.
private static void DoOtherThing(int[] bigArray1, int[] bigArray2)
{
unsafe
{
fixed (int* p1 = bigArray1, p2=bigArray2)
{
byte* b1 = (byte*)p1;
byte* b2 = (byte*)p2;
byte* bend = (byte*)(&p1[bigArray1.Length]);
while (b1 < bend)
{
if (*b1 < *b2)
{
*b1 = *b2;
}
++b1;
++b2;
}
}
}
}
On my machine running under the debugger in Release mode against arrays of 25 million ints, this code is about 29% faster than your original. However, running standalone, there is almost no difference in runtime. Sometimes your original code is faster, and sometimes the new code is faster.
Approximate numbers:
Debugger Standalone
Original 1,400 ms 700 ms
My code 975 ms 700 ms
And, yes, I did compare the results to ensure that the functions do the same thing.
I'm at a loss to explain why my code isn't faster, since it's doing significantly less work.
Given these results, I doubt that you could improve things by going to native code. As you say, the overhead of marshaling the arrays would likely eat up any savings you might realize in the processing.
The following modification to your original code, though, is 10% to 20% faster.
private static void DoSomething(int[] bigArray1, int[] bigArray2)
{
for (int i = 0; i < bigArray1.Length; i++)
{
var data1 = (uint)bigArray1[i];
var data2 = (uint)bigArray2[i];
var A1 = data1 & 0xff;
var B1 = data1 & 0xff00;
var C1 = data1 & 0xff0000;
var D1 = data1 & 0xff000000;
var A2 = data2 & 0xff;
var B2 = data2 & 0xff00;
var C2 = data2 & 0xff0000;
var D2 = data2 & 0xff000000;
if (A2 > A1) A1 = A2;
if (B2 > B1) B1 = B2;
if (C2 > C1) C1 = C2;
if (D2 > D1) D1 = D2;
bigArray1[i] = (int)(A1 | B1 | C1 | D1);
}
}

What about this?
private void DoSomething(int[] bigArray1, int[] bigArray2)
{
for (int i = 0; i < bigArray1.Length; i++)
{
var data1 = (uint)bigArray1[i];
var data2 = (uint)bigArray2[i];
bigArray1[i] = (int)(
Math.Max(data1 & 0x000000FF, data2 & 0x000000FF) |
Math.Max(data1 & 0x0000FF00, data2 & 0x0000FF00) |
Math.Max(data1 & 0x00FF0000, data2 & 0x00FF0000) |
Math.Max(data1 & 0xFF000000, data2 & 0xFF000000));
}
}
It has a lot less bit shifting in it. You might find the calls to Math.Max aren't inlined if you profile it. In such a case, you'd just make the method more verbose.
I haven't tested this code as I don't have an IDE with me. I reckon it does what you want though.
If this still doesn't perform as you'd expect, you could try using pointer arithmetic in an unsafe block, but I seriously doubt that you'd see a gain. Code like this is unlikely to be faster if you extern to it, from everything I've read. But don't take my word for it. Measure, measure, measure.
Good luck.

I don't see any way of speeding up this code by means of clever bit tricks.
If you really want this code to be faster, the only way of significantly (>2x or so) speeding it up on x86 platform I see is to go for assembler/intrinsics implementation. SSE has the instruction PCMPGTB that
"Performs a SIMD compare for the greater value of the packed bytes, words, or doublewords in the destination operand (first operand) and the source operand (second operand). If a data element in the destination operand is greater than the corresponding date element in the source operand, the corresponding data element in the destination operand is set to all 1s; otherwise, it is set to all 0s."
XMM register would fit four 32-bit ints, and you could loop over your arrays reading the values, getting the mask and then ANDing the first input with the mask and the second one with inverted mask.
On the other hand, maybe you can reformulate your algorithm so that you don't need to to pick larger bytes, but maybe for example take AND of the operands? Just a thought, hard to see if it can work without seeing the actual algorithm.

Another option for you is, if you're able to run Mono, is to use the Mono.Simd package. This provides access into SIMD instruction set from within .NET. Unfortunately you can't just take the assembly and run it on MS's CLR, as the Mono runtime treats is in a special way at JIT time. The actual assembly contains regular IL (non-SIMD) 'simulations' of the SIMD operations as a fall-back, in case the hardware does not support SIMD instructions.
You also need to be able to express your problem using the types that the API consumes, as far as I can make out.
Here is the blog post in which Miguel de Icaza announced the capability back in November 2008. Pretty cool stuff. Hopefully it will be added to the ECMA standard and MS can add it to their CLR.

You might like to look at the BitConverter class - can't remember if it is the right endianness for the particular conversion you're trying to do, but worth knowing about anyway.

Related

Bit manipulation on large integers out of 'int' range

Ok, so let's start with a 32 bit integer:
int big = 536855551; // 00011111111111111100001111111111
Now, I want to set the last 10 bits to within this integer:
int little = 69; // 0001101001
So, my approach was this:
big = (big & 4294966272) & (little)
where 4294966272 is the first 22 bits, or 11111111111111111111110000000000.
But of course this isn't supported because 4294966272 is outside of the int range of 0x7FFFFFFF. Also, this isn't going to be my only operation. I also need to be able to set bits 11 through 14. My approach for that (with the same problem) was:
big = (big & 4294951935) | (little << 10)
So with the explanation out of the way, here is what I'm doing as alternative's for the above:
1: ((big >> 10) << 10) | (little)
2: (big & 1023) | ((big >> 14) << 14) | (little << 10)
I don't feel like my alternative's are the best, efficient way I could go. Is there any better ways to do this?
Sidenote: If C# supported binary literals, '0b', this would be a lot prettier.
Thanks.

4294966272 should actually be -1024, which is represented as 11111111111111111111110000000000.
For example:
int big = 536855551;
int little = 69;
var thing = Convert.ToInt32("11111111111111111111110000000000", 2);
var res = (big & thing) & (little);
Though, the result will always be 0
00011111111111111100001111111111
&
00000000000000000000000001101001
&
11111111111111111111110000000000

Bit shift is usually faster compared to bit-shift + mask (that is, &). I have a test case for it.
You should go with your first alternative.
1: ((big >> 10) << 10) | (little)
Just beware of a little difference between unsigned and signed int when it comes to bit-shifting.
Alternatively, you could define big and little as unsigned. Use uint instead of int.

C#: How to concatenate bits to create an UInt64?

I'm trying to create a hashing function for images in order to find similar ones from a database.
The hash is simply a series of bits (101110010) where each bit stands for one pixel. As there are about 60 pixels for each image I assume it would be best to save this as an UInt64.
Now, when looping through each pixel and calculating each bit, how can I concatenate those and save them as a UInt64?
Thanks for you help.

Use some bit twiddling:
long mask = 0;
// For each bit that is set, given its position (0-63):
mask |= 1 << position;

You use bitwise operators like this:
ulong it1 = 0;
ubyte b1 = 0x24;
ubyte b2 = 0x36;
...
it1 = (b1 << 48) | (b2 << 40) | (b3 << 32) .. ;
Alternatively you can use the BitConvert.Uint64() function to quickly convert a byte array to int64. But are you sure the target is of 8bytes long?

Setting all low order bits to 0 until two 1s remain (for a number stored as a byte array)

I need to set all the low order bits of a given BigInteger to 0 until only two 1 bits are left. In other words leave the highest and second-highest bits set while unsetting all others.
The number could be any combination of bits. It may even be all 1s or all 0s. Example:
MSB 0000 0000
1101 1010
0010 0111
...
...
...
LSB 0100 1010
We can easily take out corner cases such as 0, 1, PowerOf2, etc. Not sure how to apply popular bit manipulation algorithms on a an array of bytes representing one number.
I have already looked at bithacks but have the following constraints. The BigInteger structure only exposes underlying data through the ToByteArray method which itself is expensive and unnecessary. Since there is no way around this, I don't want to slow things down further by implementing a bit counting algorithm optimized for 32/64 bit integers (which most are).
In short, I have a byte [] representing an arbitrarily large number. Speed is the key factor here.
NOTE: In case it helps, the numbers I am dealing with have around 5,000,000 bits. They keep on decreasing with each iteration of the algorithm so I could probably switch techniques as the magnitude of the number decreases.
Why I need to do this: I am working with a 2D graph and am particularly interested in coordinates whose x and y values are powers of 2. So (x+y) will always have two bits set and (x-y) will always have consecutive bits set. Given an arbitrary coordinate (x, y), I need to transform an intersection by getting values with all bits unset except the first two MSB.

Try the following (not sure if it's actually valid C#, but it should be close enough):
// find the next non-zero byte (I'm assuming little endian) or return -1
int find_next_byte(byte[] data, int i) {
while (data[i] == 0) --i;
return i;
}
// find a bit mask of the next non-zero bit or return 0
int find_next_bit(int value, int b) {
while (b > 0 && ((value & b) == 0)) b >>= 1;
return b;
}
byte[] data;
int i = find_next_byte(data, data.Length - 1);
// find the first 1 bit
int b = find_next_bit(data[i], 1 << 7);
// try to find the second 1 bit
b = find_next_bit(data[i], b >> 1);
if (b > 0) {
// found 2 bits, removing the rest
if (b > 1) data[i] &= ~(b - 1);
} else {
// we only found 1 bit, find the next non-zero byte
i = find_next_byte(data, i - 1);
b = find_next_bit(data[i], 1 << 7);
if (b > 1) data[i] &= ~(b - 1);
}
// remove the rest (a memcpy would be even better here,
// but that would probably require unmanaged code)
for (--i; i >= 0; --i) data[i] = 0;
Untested.
Probably this would be a bit more performant if compiled as unmanaged code or even with a C or C++ compiler.
As harold noted correctly, if you have no a priori knowledge about your number, this O(n) method is the best you can do. If you can, you should keep the position of the highest two non-zero bytes, which would drastically reduce the time needed to perform your transformation.

I'm not sure if this is getting optimised out or not but this code appears to be 16x faster than ToByteArray. It also avoids the memory copy and it means you get to the results as uint instead of byte so you should have further improvements there.
//create delegate to get private _bit field
var par = Expression.Parameter(typeof(BigInteger));
var bits = Expression.Field(par, "_bits");
var lambda = Expression.Lambda(bits, par);
var func = (Func<BigInteger, uint[]>)lambda.Compile();
//test call our delegate
var bigint = BigInteger.Parse("3498574578238348969856895698745697868975687978");
int time = Environment.TickCount;
for (int y = 0; y < 10000000; y++)
{
var x = func(bigint);
}
Console.WriteLine(Environment.TickCount - time);
//compare time to ToByteArray
time = Environment.TickCount;
for (int y = 0; y < 10000000; y++)
{
var x = bigint.ToByteArray();
}
Console.WriteLine(Environment.TickCount - time);
From there finding the top 2 bits should be pretty easy. The first bit will be in the first int I presume, then it is just a matter of searching for the second top most bit. If it is in the same integer then just set the first bit to zero and find the topmost bit, otherwise search for the next no zero int and find the topmost bit.
EDIT: to make things simple just copy/paste this class into your project. This creates extension methods that means you can just call mybigint.GetUnderlyingBitsArray(). I added a method to get the Sign also and, to make it more generic, have created a function that will allow accessing any private field of any object. I found this to be slower than my original code in debug mode but the same speed in release mode. I would advise performance testing this yourself.
static class BigIntegerEx
{
private static Func<BigInteger, uint[]> getUnderlyingBitsArray;
private static Func<BigInteger, int> getUnderlyingSign;
static BigIntegerEx()
{
getUnderlyingBitsArray = CompileFuncToGetPrivateField<BigInteger, uint[]>("_bits");
getUnderlyingSign = CompileFuncToGetPrivateField<BigInteger, int>("_sign");
}
private static Func<TObject, TField> CompileFuncToGetPrivateField<TObject, TField>(string fieldName)
{
var par = Expression.Parameter(typeof(TObject));
var field = Expression.Field(par, fieldName);
var lambda = Expression.Lambda(field, par);
return (Func<TObject, TField>)lambda.Compile();
}
public static uint[] GetUnderlyingBitsArray(this BigInteger source)
{
return getUnderlyingBitsArray(source);
}
public static int GetUnderlyingSign(this BigInteger source)
{
return getUnderlyingSign(source);
}
}

Optimisation of threshold computation

I'm trying to optimise the following C# code, which sets bytes to 0x00 or 0xFF based on a threshold.
for (int i = 0; i < veryLargeNumber; i++)
{
data[i] = (byte)(data[i] < threshold ? 0 : 255);
}
Visual Studio's performance profiler shows that the above code is rather expensive, taking nearly 8 seconds to compute - 98% of my total processing expense. I'm processing just under a thousand items, so that adds up to over two hours.
I think the issue is to do with the ternary conditional operator, since it causes a branch. I'd imagine a pure-math operation of some sort could be significantly faster, since it's CPU-cache friendly.
Is there a way to optimise this? It's possible for me to fix the threshold value, if that helps. I'd consider anything above a ~7% performance increase a win, since that's a whole 10 minutes shaved off the total processing time.

If you are using .NET 4.0 Framework, you could make use of Parallel Library in following link,
http://msdn.microsoft.com/en-us/library/dd460717
In Your case, you must have to verify the threshold, anyway it would take time. So make use of thread or lambda expressions

Just to suggest, use bitwise operators for this purpose because they are faster, together with parallel approach.
0x00 = 0000 0000
0xFF = 1111 1111
Try with OR operator(i.e. 0 | 1 = 1 where | stands for OR operator
EDIT:
This is how you could compare which number is bigger:
let a,b be numbers:
int temp= a ^ b;
temp|= temp>> 1;
temp|= temp>> 2;
temp|= temp>> 4;
temp|= temp>> 8;
temp|= temp>> 16;
temp&= ~(temp>> 1) | 0x80000000;
temp&= (a ^ 0x80000000) & (b ^ 0x7fffffff);

If you want a bit-wise solution -
int intSize = sizeof(int) * 8 - 1;
byte t = (byte)(threshold - 1);
for (....)
{
data[i] = (byte)(255 + 1 ^ ((t - data[i]) >> intSize));
}
Note: Wont work for corner case of 0. Sorry bout that
Also, try using an int array instead of byte and see if it is faster

Fibonacci LFSRs calculation optimisation

Fibonacci LFSR is described on wiki, it's pretty simple.
I'd like to calucate the period of some Fibonacci's LFSR and use generated sequence for ciphering later.
Let's take and example from wiki:
x16 + x14 + x13 + x11 + 1;
//code from wiki:
include <stdint.h>
uint16_t lfsr = 0xACE1u;
unsigned bit;
unsigned period = 0;
do {
/* taps: 16 14 13 11; characteristic polynomial: x^16 + x^14 + x^13 + x^11 + 1 */
bit = ((lfsr >> 0) ^ (lfsr >> 2) ^ (lfsr >> 3) ^ (lfsr >> 5) ) & 1;
lfsr = (lfsr >> 1) | (bit << 15);
++period;
} while(lfsr != 0xACE1u);
My weakly try so far in php:
function getPeriod(){
$polynoms = array(16, 14, 13, 11);
$input = $polynoms[0] - 1;
$n = sizeof($polynoms);
for ($i = 1; $i < $n; $i++)
$polynoms[$i] = $polynoms[0] - $polynoms[$i];
$polynoms[0] = 0;
//reversed polynoms == array(0, 2, 3, 5);
$lfsr = 0x1; //begining state
$period = 0;
//gmp -- php library for long numbers;
$lfsr = gmp_init($lfsr, 16);
do {
$bit = $lfsr; //bit = x^16 >> 0;
for($i = 1; $i < $n; $i++) {
//bit ^= lfsr >> 2 ^ lfst >> 3 ^ lfst >> 5;
$bit = gmp_xor($bit, ( gmp_div_q($lfsr, gmp_pow(2, $polynoms[$i])) ));
}
//bit &= 1;
$bit = gmp_and($bit, 1);
//lfsr = $lfsr >> 1 | $lsfr << (16 - 1);
$lfsr = gmp_or( (gmp_div_q($lfsr, 2)), (gmp_mul($bit, gmp_pow(2, $input))) );
$period++;
} while (gmp_cmp($lfsr, 0x1) != 0);
echo '<br />period = '.$period;
//period == 65535 == 2^16 - 1; -- and that's correct;
// I hope, at least;
return $period;
}
Problem:
If i try to modulate work of i.e.
x321 + x14 + x13 + x11 + 1;
i got an error:"Fatal error: Maximum execution time of 30 seconds exceeded in /var/www/Dx02/test.php";
Can i somehow optimize (accelerate :) ) the calculation?
Any help is appreciated. Thank you and excuse me for my English.

You simply can't do it this way with a polynomial like x^321 + ...;
If the polynomial is chosen well, you get a period length of 2^231 -1,
and this is approximately 4.27 *10^96. If I'm not mistaken, this number is
believed to exceed the number of atoms in the universe...
(Strictly speaking, I'm referring to the posted C-code since I do not know php, but that certainly makes no difference.)
However, there is a mathematical method to calculate the length of the period without doing a brute-force attack. Unfortunately, this can't be explained in a few lines. If you have a solid background in math (especially calculations in finite fields), I'll be glad to look for a helpful reference for you.
EDIT:
The first step in calculating the period of the LFSR obtained by using a polynomial p(x) is to obtain a factorization of p(x) mod 2, i.e. in GF(2). To do this, I recommend using software like Mathematica or Maple if available. You could also try the freely available Sage, see e.g. http://www.sagemath.org/doc/constructions/polynomials.html for usage details.
The period of p(x) is given by its order e, that means the smallest number such that p(x) divedes x^e+1. Unfortunately, I can't provide more information at the moment, it will take me several days to look for the lecture notes of a course I took several years ago...
A small example: p(x) = (x^5+x^4+1) = (x^3+x+1)*(x^2+x+1), the individual periods are 2^3-1=7 and 2^2-1=3, and since all polynomial factors are different, the period of p(x) is 3*7=21, which I also verified in C++.

To optimize this a bit we need to remember that PHP has great overhead on parsing code as it is not compiled, so we need to do as much work ourselves for it as we can. You should always profile your CPU/memory sensitive code with xdebug+KCachegrind(for example) to see where PHP wastes most of it's time. With your code only 12% is spent on gmp_* functions calculations, most of the time is spent for code parsing.
On my notebook(it is rather slow) my code runs 2.4 sec instead of 3.5 sec for your code, but for greater degrees the difference should be more noticeable (for example 19 power gives 19 vs 28 sec). It is not much, but it is some.
I left comments inside code, but if you have some questions - feel free to ask. I used function creation to replace that 'for($i = 1; $i < $n; $i++)' loop inside your main loop.
Also, I think you should change type of your $period variable to GMP(and $period++ to gmp_* function) as it can be greater then maximum integer on your system.
function getPeriod() {
$polynoms = array(16, 14, 13, 11);
$highest = $polynoms[0];
$input = $highest - 1;
//Delete first element of array - we don't need it anyway
array_shift($polynoms);
$polynoms_count = count($polynoms);
//You always repeat gmp_pow(2, $input) and it's result is constant,
//so better precalculate it once.
$input_pow = gmp_pow(2, $input);
//Start function creation.
//If you don't use PHP accelerators, then shorter variable names
//work slightly faster, so I replaced some of names
//$perion->$r,$bit -> $b, $lfsr -> $l, $polynoms -> $p
$function_str = '$r=0;';
$function_str .= 'do{';
//Now we need to get rid of your loop inside loop, we can generate
//static functions chain to replace it.
//Also, PHP parses all PHP tokens, even ';' and it takes some time,
//So, we should write as much one-liners as we can.
$function_str .= '$b=gmp_xor($b=$l';
foreach ($polynoms AS $id => &$polynom) {
//You always repeat gmp_pow(2, $polynoms[$i]) and it's result is constant,
//so better precalculate it once.
$polynom = gmp_pow(2, $highest - $polynom);
//We create our functions chain here
if ($id < $polynoms_count - 1) {
$function_str.=',gmp_xor(gmp_div_q($l, $p[' . $id . '])';
} else {
$function_str.=',gmp_div_q($l, $p[' . $id . '])';
}
}
//Close all brackets
$function_str.=str_repeat(')', $polynoms_count);
//I don't know how to optimize the following, so I left it without change
$function_str.=';';
$function_str.='$l = gmp_or((gmp_div_q($l, 2)), (gmp_mul(gmp_and($b, 1), $i_p)));';
$function_str.='$r++;';
$function_str.='} while (gmp_cmp($l, 0x1));';
$function_str.='return $r;';
//Now, create our funciton
$function = create_function('$l,$p,$i_p', $function_str);
//Set begining states
$lfsr = 0x1;
$lfsr = gmp_init($lfsr, 16);
//Run function
$period = $function($lfsr, $polynoms, $input_pow);
//Use result
echo '<br />period = ' . $period;
return $period;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.