I am porting a library from C++ to C# but have come across a scenario I am unsure of how to resolve, which involves casting an unsigned char * to an unsigned int *.
C++
unsigned int c4;
unsigned int c2;
unsigned int h4;
int pos(unsigned char *p)
{
c4 = *(reinterpret_cast<unsigned int *>(p - 4));
c2 = *(reinterpret_cast<unsigned short *>(p - 2));
h4 = ((c4 >> 11) ^ c4) & (N4 - 1);
if ((tab4[h4][0] != 0) && (tab4[h4][1] == c4))
{
c = 256;
return (tab4[h4][0]);
}
c = 257;
return (tab2[c2]);
}
C# (It's wrong):
public uint pos(byte p)
{
c4 = (uint)(p - 4);
c2 = (ushort)(p - 2);
h4 = ((c4 >> 11) ^ c4) & (1 << 20 - 1);
if ((tab4[h4, 0] != 0) && (tab4[h4, 1] == c4)) {
c = 256;
return (tab4[h4, 0]);
}
c = 257;
return (tab2[c2]);
}
I believe in the C# example, you could change byte p to byte[] but I am clueless when it would come to casting byte[] to a single uint value.
Additionally, could anyone please explain to me, why would you cast an unsigned char * to a unsigned int *? What purpose does it have?
Any help/push to direction would be very useful.
Translation of the problematic lines would be:
int pos(byte[] a, int offset)
{
// Read the four bytes immediately preceding offset
c4 = BitConverter.ToUInt32(a, offset - 4);
// Read the two bytes immediately preceding offset
c2 = BitConverter.ToUInt16(a, offset - 2);
and change the call from x = pos(&buf[i]) (which even in C++ is the same as x = pos(buf + i)) to
x = pos(buf, i);
An important note is that the existing C++ code is wrong as it violates the strict aliasing rule.
Implementing analogous functionality in C# does not need to involve code that replicates the C version on a statement-by-statement basis, especially when the original is using pointers.
When we assume an architecture where int is 32 bit, you could simplify the C# version like this:
uint[] tab2;
uint[,] tab4;
ushort c;
public uint pos(uint c4)
{
var h4 = ((c4 >> 11) ^ c4) & (1 << 20 - 1);
if ((tab4[h4, 0] != 0) && (tab4[h4, 1] == c4))
{
c = 256;
return (tab4[h4, 0]);
}
else
{
c = 257;
var c2 = (c4 >> 16) & 0xffff; // HIWORD
return (tab2[c2]);
}
}
This simplification is possible because c4 and c2 overlap: c2 is the high word of c4, and is needed only when the lookup in tab4 does not match.
(The identifier N4 was present in original code but replaced in your own translation by the expression 1<<20).
The calling code would have to loop over an array of int, which according to comments is possible. While the original C++ code starts at offset 4 and looks back, the C# equivalent would start at offset 0, which seems a more natural thing to do.
In C++ code you are sending pointer to char, but normally C# does not working with memory using this way, you need array instead of pointer.
But you can use unsafe keyword to work directly with memory.
https://msdn.microsoft.com/en-us/library/chfa2zb8.aspx
Related
I have one byte of data and from there I have to extract it in the following manner.
data[0] has to extract
id(5 bit)
Sequence(2 bit)
HashAppData(1 bit)
data[1] has to extract
id(6 bit)
offset(2 bit)
Required functions are below where byte array length is 2 and I have to extract to the above manner.
public static int ParseData(byte[] data)
{
// All code goes here
}
Couldn't find any suitable solution to how do I make it. Can you please extract it?
EDIT: Fragment datatype should be in Integer
Something like this?
int id = (data[0] >> 3) & 31;
int sequence = (data[0] >> 1) & 3;
int hashAppData = data[0] & 1;
int id2 = (data[1] >> 2) & 63;
int offset = data[1] & 3;
This is how I'd do it for the first byte:
byte value = 155;
byte maskForHighest5 = 128+64+32+16+8;
byte maskForNext2 = 4+2;
byte maskForLast = 1;
byte result1 = (byte)((value & maskForHighest5) >> 3); // shift right 3 bits
byte result2 = (byte)((value & maskForNext2) >> 1); // shift right 1 bit
byte result3 = (byte)(value & maskForLast);
Working demo (.NET Fiddle):
https://dotnetfiddle.net/lNZ9TR
Code for the 2nd byte will be very similar.
If you're uncomfortable with bit manipulation, use an extension method to keep the intent of ParseData clear. This extension can be adapted for other integers by replacing both uses of byte with the necessary type.
public static int GetBitValue(this byte b, int offset, int length)
{
const int ByteWidth = sizeof(byte) * 8;
// System.Diagnostics validation - Excluded in release builds
Debug.Assert(offset >= 0);
Debug.Assert(offset < ByteWidth);
Debug.Assert(length > 0);
Debug.Assert(length <= ByteWidth);
Debug.Assert(offset + length <= ByteWidth);
var shift = ByteWidth - offset - length;
var mask = (1 << length) - 1;
return (b >> shift) & mask;
}
Usage in this case:
public static int ParseData(byte[] data)
{
{ // data[0]
var id = data[0].GetBitValue(0, 5);
var sequence = data[0].GetBitValue(5, 2);
var hashAppData = data[0].GetBitValue(7, 1);
}
{ // data[1]
var id = data[1].GetBitValue(0, 6);
var offset = data[1].GetBitValue(6, 2);
}
// ... return necessary data
}
I want to compare a stream of bits of arbitrary length to a mask in c# and return a ratio of how many bits were the same.
The mask to check against is anywhere between 2 bits long to 8k (with 90% of the masks being 5 bits long), the input can be anywhere between 2 bits up to ~ 500k, with an average input string of 12k (but yeah, most of the time it will be comparing 5 bits with the first 5 bits of that 12k)
Now my naive implementation would be something like this:
bool[] mask = new[] { true, true, false, true };
float dendrite(bool[] input) {
int correct = 0;
for ( int i = 0; i<mask.length; i++ ) {
if ( input[i] == mask[i] )
correct++;
}
return (float)correct/(float)mask.length;
}
but I expect this is better handled (more efficient) with some kind of binary operator magic?
Anyone got any pointers?
EDIT: the datatype is not fixed at this point in my design, so if ints or bytearrays work better, I'd also be a happy camper, trying to optimize for efficiency here, the faster the computation, the better.
eg if you can make it work like this:
int[] mask = new[] { 1, 1, 0, 1 };
float dendrite(int[] input) {
int correct = 0;
for ( int i = 0; i<mask.length; i++ ) {
if ( input[i] == mask[i] )
correct++;
}
return (float)correct/(float)mask.length;
}
or this:
int mask = 13; //1101
float dendrite(int input) {
return // your magic here;
} // would return 0.75 for an input
// of 101 given ( 1100101 in binary,
// matches 3 bits of the 4 bit mask == .75
ANSWER:
I ran each proposed answer against each other and Fredou's and Marten's solution ran neck to neck but Fredou submitted the fastest leanest implementation in the end. Of course since the average result varies quite wildly between implementations I might have to revisit this post later on. :) but that's probably just me messing up in my test script. ( i hope, too late now, going to bed =)
sparse1.Cyclone
1317ms 3467107ticks 10000iterations
result: 0,7851563
sparse1.Marten
288ms 759362ticks 10000iterations
result: 0,05066964
sparse1.Fredou
216ms 568747ticks 10000iterations
result: 0,8925781
sparse1.Marten
296ms 778862ticks 10000iterations
result: 0,05066964
sparse1.Fredou
216ms 568601ticks 10000iterations
result: 0,8925781
sparse1.Marten
300ms 789901ticks 10000iterations
result: 0,05066964
sparse1.Cyclone
1314ms 3457988ticks 10000iterations
result: 0,7851563
sparse1.Fredou
207ms 546606ticks 10000iterations
result: 0,8925781
sparse1.Marten
298ms 786352ticks 10000iterations
result: 0,05066964
sparse1.Cyclone
1301ms 3422611ticks 10000iterations
result: 0,7851563
sparse1.Marten
292ms 769850ticks 10000iterations
result: 0,05066964
sparse1.Cyclone
1305ms 3433320ticks 10000iterations
result: 0,7851563
sparse1.Fredou
209ms 551178ticks 10000iterations
result: 0,8925781
( testscript copied here, if i destroyed yours modifying it lemme know. https://dotnetfiddle.net/h9nFSa )
how about this one - dotnetfiddle example
using System;
namespace ConsoleApplication1
{
public class Program
{
public static void Main(string[] args)
{
int a = Convert.ToInt32("0001101", 2);
int b = Convert.ToInt32("1100101", 2);
Console.WriteLine(dendrite(a, 4, b));
}
private static float dendrite(int mask, int len, int input)
{
return 1 - getBitCount(mask ^ (input & (int.MaxValue >> 32 - len))) / (float)len;
}
private static int getBitCount(int bits)
{
bits = bits - ((bits >> 1) & 0x55555555);
bits = (bits & 0x33333333) + ((bits >> 2) & 0x33333333);
return ((bits + (bits >> 4) & 0xf0f0f0f) * 0x1010101) >> 24;
}
}
}
64 bits one here - dotnetfiddler
using System;
namespace ConsoleApplication1
{
public class Program
{
public static void Main(string[] args)
{
// 1
ulong a = Convert.ToUInt64("0000000000000000000000000000000000000000000000000000000000001101", 2);
ulong b = Convert.ToUInt64("1110010101100101011001010110110101100101011001010110010101100101", 2);
Console.WriteLine(dendrite(a, 4, b));
}
private static float dendrite(ulong mask, int len, ulong input)
{
return 1 - getBitCount(mask ^ (input & (ulong.MaxValue >> (64 - len)))) / (float)len;
}
private static ulong getBitCount(ulong bits)
{
bits = bits - ((bits >> 1) & 0x5555555555555555UL);
bits = (bits & 0x3333333333333333UL) + ((bits >> 2) & 0x3333333333333333UL);
return unchecked(((bits + (bits >> 4)) & 0xF0F0F0F0F0F0F0FUL) * 0x101010101010101UL) >> 56;
}
}
}
I came up with this code:
static float dendrite(ulong input, ulong mask)
{
// get bits that are same (0 or 1) in input and mask
ulong samebits = mask & ~(input ^ mask);
// count number of same bits
int correct = cardinality(samebits);
// count number of bits in mask
int inmask = cardinality(mask);
// compute fraction (0.0 to 1.0)
return inmask == 0 ? 0f : correct / (float)inmask;
}
// this is a little hack to count the number of bits set to one in a 64-bit word
static int cardinality(ulong word)
{
const ulong mult = 0x0101010101010101;
const ulong mask1h = (~0UL) / 3 << 1;
const ulong mask2l = (~0UL) / 5;
const ulong mask4l = (~0UL) / 17;
word -= (mask1h & word) >> 1;
word = (word & mask2l) + ((word >> 2) & mask2l);
word += word >> 4;
word &= mask4l;
return (int)((word * mult) >> 56);
}
This will check 64-bits at a time. If you need more than that you can just split the input data into 64-bit words and compare them one by one and compute the average result.
Here's a .NET fiddle with the code and a working test case:
https://dotnetfiddle.net/5hYFtE
I would change the code to something along these lines:
// hardcoded bitmask
byte mask = 255;
float dendrite(byte input) {
int correct = 0;
// store the xor:ed result
byte xored = input ^ mask;
// loop through each bit
for(int i = 0; i < 8; i++) {
// if the bit is 0 then it was correct
if(!(xored & (1 << i)))
correct++;
}
return (float)correct/(float)mask.length;
}
The above uses a mask and input of 8 bits, but of course you could modify this to use a 4 byte integer and so on.
Not sure if this will work as expected, but it might give you some clues on how to proceed.
For example if you only would like to check the first 4 bits you could change the code to something like:
float dendrite(byte input) {
// hardcoded bitmask i.e 1101
byte mask = 13;
// number of bits to check
byte bits = 4;
int correct = 0;
// store the xor:ed result
byte xored = input ^ mask;
// loop through each bit, notice that we only checking the first 4 bits
for(int i = 0; i < bits; i++) {
// if the bit is 0 then it was correct
if(!(xored & (1 << i)))
correct++;
}
return (float)correct/(float)bits;
}
Of course it might be faster to actually use a int instead of a byte.
I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.
Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs
public override int GetHashCode() {
#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING
unsafe {
fixed (char *src = this) {
Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;
#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}
This is possibly more than you bargained for, I'll annotate the code a bit:
The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.
Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.
// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
IntPtr arg_0F_0;
IntPtr expr_06 = arg_0F_0 = this;
if (expr_06 != 0)
{
arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
}
char* ptr = arg_0F_0;
int num = 352654597;
int num2 = num;
int* ptr2 = (int*)ptr;
for (int i = this.Length; i > 0; i -= 4)
{
num = ((num << 5) + num + (num >> 27) ^ *ptr2);
if (i <= 2)
{
break;
}
num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
ptr2 += (IntPtr)8 / 4;
}
return num + num2 * 1566083941;
}
String.GetHashCode's behavior is depend on the program architecture. So it will return one value in x86 and one value on x64. I have a test application which must run in x86 and it must predict the hash code output from an application which must run on x64.
Below is the disassembly of the String.GetHashCode implementation from mscorwks.
public override unsafe int GetHashCode()
{
fixed (char* text1 = ((char*) this))
{
char* chPtr1 = text1;
int num1 = 0x15051505;
int num2 = num1;
int* numPtr1 = (int*) chPtr1;
for (int num3 = this.Length; num3 > 0; num3 -= 4)
{
num1 = (((num1 << 5) + num1) + (num1 >≫ 0x1b)) ^ numPtr1[0];
if (num3 <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr1[1];
numPtr1 += 2;
}
return (num1 + (num2 * 0x5d588b65));
}
}
Can anybody port this function to a safe implementation??
Hash codes are not intended to be repeatable across platforms, or even multiple runs of the same program on the same system. You are going the wrong way. If you don't change course, your path will be difficult and one day it may end in tears.
What is the real problem you want to solve? Would it be possible to write your own hash function, either as an extension method or as the GetHashCode implementation of a wrapper class and use that one instead?
First off, Jon is correct; this is a fool's errand. The internal debug builds of the framework that we use to "eat our own dogfood" change the hash algorithm every day precisely to prevent people from building systems -- even test systems -- that rely on unreliable implementation details that are documented as subject to change at any time.
Rather than enshrining an emulation of a system that is documented as being not suitable for emulation, my recommendation would be to take a step back and ask yourself why you're trying to do something this dangerous. Is it really a requirement?
Second, StackOverflow is a technical question and answer site, not a "do my job for me for free" site. If you are hell bent on doing this dangerous thing and you need someone who can rewrite unsafe code into equivalent safe code then I recommend that you hire someone who can do that for you.
While all of the warnings given here are valid, they don't answer the question. I had a situation in which GetHashCode() was unfortunately already being used for a persisted value in production, and I had no choice but to re-implement using the default .NET 2.0 32-bit x86 (little-endian) algorithm. I re-coded without unsafe as shown below, and this appears to be working. Hope this helps someone.
// The GetStringHashCode() extension method is equivalent to the Microsoft .NET Framework 2.0
// String.GetHashCode() method executed on 32 bit systems.
public static int GetStringHashCode(this string value)
{
int hash1 = (5381 << 16) + 5381;
int hash2 = hash1;
int len = value.Length;
int intval;
int c0, c1;
int i = 0;
while (len > 0)
{
c0 = (int)value[i];
c1 = len > 1 ? (int)value[i + 1] : 0;
intval = c0 | (c1 << 16);
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ intval;
if (len <= 2)
{
break;
}
i += 2;
c0 = (int)value[i];
c1 = len > 3 ? (int)value[i + 1] : 0;
intval = c0 | (c1 << 16);
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ intval;
len -= 4;
i += 2;
}
return hash1 + (hash2 * 1566083941);
}
The following exactly reproduces the default String hash codes on .NET 4.7 (and probably earlier). This is the hash code given by:
Default on a String instance: "abc".GetHashCode()
StringComparer.Ordinal.GetHashCode("abc")
Various String methods that take StringComparison.Ordinal enumeration.
System.Globalization.CompareInfo.GetStringComparer(CompareOptions.Ordinal)
Testing on release builds with full JIT optimization, these versions modestly outperform the built-in .NET code, and have also been heavily unit-tested for exact equivalence with .NET behavior. Notice there are separate versions for x86 versus x64. Your program should generally include both; below the respective code listings is a calling harness which selects the appropriate version at runtime.
x86 - (.NET running in 32-bit mode)
static unsafe int GetHashCode_x86_NET(int* p, int c)
{
int h1, h2 = h1 = 0x15051505;
while (c > 2)
{
h1 = ((h1 << 5) + h1 + (h1 >> 27)) ^ *p++;
h2 = ((h2 << 5) + h2 + (h2 >> 27)) ^ *p++;
c -= 4;
}
if (c > 0)
h1 = ((h1 << 5) + h1 + (h1 >> 27)) ^ *p++;
return h1 + (h2 * 0x5d588b65);
}
x64 - (.NET running in 64-bit mode)
static unsafe int GetHashCode_x64_NET(Char* p)
{
int h1, h2 = h1 = 5381;
while (*p != 0)
{
h1 = ((h1 << 5) + h1) ^ *p++;
if (*p == 0)
break;
h2 = ((h2 << 5) + h2) ^ *p++;
}
return h1 + (h2 * 0x5d588b65);
}
Calling harness / extension method for either platform (x86/x64):
readonly static int _hash_sz = IntPtr.Size == 4 ? 0x2d2816fe : 0x162a16fe;
public static unsafe int GetStringHashCode(this String s)
{
/// Note: x64 string hash ignores remainder after embedded '\0'char (unlike x86)
if (s.Length == 0 || (IntPtr.Size == 8 && s[0] == '\0'))
return _hash_sz;
fixed (char* p = s)
return IntPtr.Size == 4 ?
GetHashCode_x86_NET((int*)p, s.Length) :
GetHashCode_x64_NET(p);
}
I have two bytes. I need to turn them into two integers where the first 12 bits make one int and the last 4 make the other. I figure i can && the 2nd byte with 0x0f to get the 4 bits, but I'm not sure how to make that into a byte with the correct sign.
update:
just to clarify I have 2 bytes
byte1 = 0xab
byte2 = 0xcd
and I need to do something like this with it
var value = 0xabc * 10 ^ 0xd;
sorry for the confusion.
thanks for all of the help.
int a = 10;
int a1 = a&0x000F;
int a2 = a&0xFFF0;
try to use this code
For kicks:
public static partial class Levitate
{
public static Tuple<int, int> UnPack(this int value)
{
uint sign = (uint)value & 0x80000000;
int small = ((int)sign >> 28) | (value & 0x0F);
int big = value & 0xFFF0;
return new Tuple<int, int>(small, big);
}
}
int a = 10;
a.UnPack();
Ok, let's try this again knowing what we're shooting for. I tried the following out in VS2008 and it seems to work fine, that is, both outOne and outTwo = -1 at the end. Is that what you're looking for?
byte b1 = 0xff;
byte b2 = 0xff;
ushort total = (ushort)((b1 << 8) + b2);
short outOne = (short)((short)(total & 0xFFF0) >> 4);
sbyte outTwo = (sbyte)((sbyte)((total & 0xF) << 4) >> 4);
Assuming you have the following to bytes:
byte a = 0xab;
byte b = 0xcd;
and consider 0xab the first 8 bits and 0xcd the second 8 bits, or 0xabc the first 12 bits and 0xd the last four bits. Then you can get the these bits as follows;
int x = (a << 4) | (b >> 4); // x == 0x0abc
int y = b & 0x0f; // y == 0x000d
edited to take into account clarification of "signing" rules:
public void unpack( byte[] octets , out int hiNibbles , out int loNibble )
{
if ( octets == null ) throw new ArgumentNullException("octets");
if ( octets.Length != 2 ) throw new ArgumentException("octets") ;
int value = (int) BitConverter.ToInt16( octets , 0 ) ;
// since the value is signed, right shifts sign-extend
hiNibbles = value >> 4 ;
loNibble = ( value << 28 ) >> 28 ;
return ;
}