I've implemented a method for parsing an unsigned integer string of length <= 8 using SIMD intrinsics available in .NET as follows:
public unsafe static uint ParseUint(string text)
{
fixed (char* c = text)
{
var parsed = Sse3.LoadDquVector128((byte*) c);
var shift = (8 - text.Length) * 2;
var shifted = Sse2.ShiftLeftLogical128BitLane(parsed,
(byte) (shift));
Vector128<byte> digit0 = Vector128.Create((byte) '0');
var reduced = Sse2.SubtractSaturate(shifted, digit0);
var shortMult = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
var collapsed2 = Sse2.MultiplyAddAdjacent(reduced.As<byte, short>(), shortMult);
var repack = Sse41.PackUnsignedSaturate(collapsed2, collapsed2);
var intMult = Vector128.Create((short)0, 0, 0, 0, 100, 1, 100, 1);
var collapsed3 = Sse2.MultiplyAddAdjacent(repack.As<ushort,short>(), intMult);
var e1 = collapsed3.GetElement(2);
var e2 = collapsed3.GetElement(3);
return (uint) (e1 * 10000 + e2);
}
}
Sadly, a comparison with a baseline uint.Parse() gives the following, rather unimpressive, result:
Method
Mean
Error
StdDev
Baseline
15.157 ns
0.0325 ns
0.0304 ns
ParseSimd
3.269 ns
0.0115 ns
0.0102 ns
What are some of the ways the above code can be improved? My particular areas of concern are:
The way a bit shift of the SIMD register happens with a calculation involving text.Length
~~The unpacking of UTF-16 data using a MultiplyAddAdjacent involving a vector of 0s and 1~~
The way elements are extracted using GetElement() -- maybe there's some ToScalar() call that can happen somwehere?
I've made some optimizations.
public unsafe uint ParseUint2(string text)
{
fixed (char* c = text)
{
Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
raw = Sse2.ShiftLeftLogical128BitLane(raw, (byte)(8 - text.Length << 1));
Vector128<ushort> digit0 = Vector128.Create('0');
raw = Sse2.SubtractSaturate(raw, digit0);
Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
res = Sse41.MultiplyLow(res, mul1);
res = Ssse3.HorizontalAdd(res, res);
res = Ssse3.HorizontalAdd(res, res);
return (uint)res.GetElement(0);
}
}
Reduced amount of type conversions and final calculations were made with vphaddd. As result it's by ~10% faster.
But...imm8 must be a compile-time constant. It means you can't use a variable where imm8 is argument. Otherwise JIT compiler won't produce the intrinsic instruction for the operation. It will make an external method call at this place (maybe some workaround is there). Thanks #PeterCordes for help.
This monster isn't significantly but faster than above one, regardless of text.Length.
public unsafe uint ParseUint3(string text)
{
fixed (char* c = text)
{
Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
switch (text.Length)
{
case 0: raw = Vector128<ushort>.Zero; break;
case 1: raw = Sse2.ShiftLeftLogical128BitLane(raw, 14); break;
case 2: raw = Sse2.ShiftLeftLogical128BitLane(raw, 12); break;
case 3: raw = Sse2.ShiftLeftLogical128BitLane(raw, 10); break;
case 4: raw = Sse2.ShiftLeftLogical128BitLane(raw, 8); break;
case 5: raw = Sse2.ShiftLeftLogical128BitLane(raw, 6); break;
case 6: raw = Sse2.ShiftLeftLogical128BitLane(raw, 4); break;
case 7: raw = Sse2.ShiftLeftLogical128BitLane(raw, 2); break;
};
Vector128<ushort> digit0 = Vector128.Create('0');
raw = Sse2.SubtractSaturate(raw, digit0);
Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
res = Sse41.MultiplyLow(res, mul1);
res = Ssse3.HorizontalAdd(res, res);
res = Ssse3.HorizontalAdd(res, res);
return (uint)res.GetElement(0);
}
}
Again, #PeterCordes doesn't allow me to write a slow code. The following version got 2 improvements. Now string loaded already shifted, and then subtracted to the shifted mask by the same offset. This avoids the slow fallback for ShiftLeftLogical128BitLane with a variable count.
The second improvement is replacing vphaddd with pshufd + paddd.
// Note that this loads up to 14 bytes before the data part of the string. (Or 16 for an empty string)
// This might or might not make it possible to read from an unmapped page and fault, beware.
public unsafe uint ParseUint4(string text)
{
const string mask = "\xffff\xffff\xffff\xffff\xffff\xffff\xffff\xffff00000000";
fixed (char* c = text, m = mask)
{
Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c - 8 + text.Length);
Vector128<ushort> mask0 = Sse3.LoadDquVector128((ushort*)m + text.Length);
raw = Sse2.SubtractSaturate(raw, mask0);
Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
res = Sse41.MultiplyLow(res, mul1);
Vector128<int> shuf = Sse2.Shuffle(res, 0x1b); // 0 1 2 3 => 3 2 1 0
res = Sse2.Add(shuf, res);
shuf = Sse2.Shuffle(res, 0x41); // 0 1 2 3 => 1 0 3 2
res = Sse2.Add(shuf, res);
return (uint)res.GetElement(0);
}
}
~Twice faster than initial solution. (o_O) At least on my Haswell i7.
C# (thanks #aepot)
public unsafe uint ParseUint(string text)
{
fixed (char* c = text)
{
Vector128<byte> mul1 = Vector128.Create(0x14C814C8, 0x010A0A64, 0, 0).AsByte();
Vector128<short> mul2 = Vector128.Create(0x00FA61A8, 0x0001000A, 0, 0).AsInt16();
Vector128<long> shift_amount = Sse2.ConvertScalarToVector128Int32(8 - text.Length << 3).AsInt64();
Vector128<short> vs = Sse2.LoadVector128((short*)c);
Vector128<byte> vb = Sse2.PackUnsignedSaturate(vs, vs);
vb = Sse2.SubtractSaturate(vb, Vector128.Create((byte)'0'));
vb = Sse2.ShiftLeftLogical(vb.AsInt64(), shift_amount).AsByte();
Vector128<int> v = Sse2.MultiplyAddAdjacent(Ssse3.MultiplyAddAdjacent(mul1, vb.AsSByte()), mul2);
v = Sse2.Add(Sse2.Add(v, v), Sse2.Shuffle(v, 1));
return (uint)v.GetElement(0);
}
}
C solution using SSSE3:
#include <uchar.h> // char16_t
#include <tmmintrin.h> // pmaddubsw
unsigned ParseUint(char16_t* ptr, size_t len) {
const __m128i mul1 = _mm_set_epi32(0, 0, 0x010A0A64, 0x14C814C8);
const __m128i mul2 = _mm_set_epi32(0, 0, 0x0001000A, 0x00FA61A8);
const __m128i shift_amount = _mm_cvtsi32_si128((8 - len) * 8);
__m128i v = _mm_loadu_si128((__m128i*)ptr); // unsafe chunking
v = _mm_packus_epi16(v,v); // convert digits from UTF16-LE to ASCII
v = _mm_subs_epu8(v, _mm_set1_epi8('0'));
v = _mm_sll_epi64(v, shift_amount); // shift off non-digit trash
// convert
v = _mm_madd_epi16(_mm_maddubs_epi16(mul1, v), mul2);
v = _mm_add_epi32(_mm_add_epi32(v,v), _mm_shuffle_epi32(v, 1));
return (unsigned)_mm_cvtsi128_si32(v);
}
Regardless of how one shifts/aligns the string (see aepot's anwser), we want to stay away from pmulld. SSE basically has 16-bit integer multiplication and the 32-bit multiply has double the latency and uops. However, care must be taken around the sign-extension behavior of pmaddubsw and pmaddwd.
using scalar x64:
// untested && I don't know C#
public unsafe static uint ParseUint(string text)
{
fixed (char* c = text)
{
var xmm = Sse2.LoadVector128((ushort*)c); // unsafe chunking
var packed = Sse2.PackSignedSaturate(xmm,xmm); // convert digits from UTF16-LE to ASCII
ulong val = Sse2.X64.ConvertToUInt64(packed); // extract to scalar
val -= 0x3030303030303030; // subtract '0' from each digit
val <<= ((8 - text.Length) * 8); // shift off non-digit trash
// convert
const ulong mask = 0x000000FF000000FF;
const ulong mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32)
const ulong mul2 = 0x0000271000000001; // 1 + (10000ULL << 32)
val = (val * 10) + (val >> 8);
val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
return (uint)val;
}
}
What if we don't know the length of the number ahead of time?
// C pseudocode & assumes ascii text
uint64_t v, m, len;
v = unaligned_load_little_endian_u64(p);
m = v + 0x4646464646464646; // roll '9' to 0x7F
v -= 0x3030303030303030; // unpacked binary coded decimal
m = (m | v) & 0x8080808080808080; // detect first non-digit
m = _tzcnt_u64(m >> 7); // count run of digits
if (((uint8_t)v) > 9) return error_not_a_number;
v <<= 64 - m; // shift off any "trailing" chars that are not digits
p += m >> 3; // consume bytes
v = parse_8_chars(v);
Or if we have a list of strings to process:
// assumes ascii text
__m256i parse_uint_x4(void* base_addr, __m256i offsets_64x4)
{
const __m256i x00 = _mm256_setzero_si256();
const __m256i x0A = _mm256_set1_epi8(0x0A);
const __m256i x30 = _mm256_set1_epi8(0x30);
const __m256i x08 = _mm256_and_si256(_mm256_srli_epi32(x30, 1), x0A);
const __m256i mul1 = _mm256_set1_epi64x(0x010A0A6414C814C8);
const __m256i mul2 = _mm256_set1_epi64x(0x0001000A00FA61A8);
__m256i v, m;
// process 4 strings at once, up to 8 digits in each string...
// (the 64-bit chunks could be manually loaded using 3 shuffles)
v = _mm256_i64gather_epi64((long long*)base_addr, offsets_64x4, 1);
// rebase digits from 0x30..0x39 to 0x00..0x09
v = _mm256_xor_si256(v, x30);
// range check
// (unsigned gte compare)
v = _mm256_min_epu8(v, x0A);
m = _mm256_cmpeq_epi8(x0A, v);
// mask of lowest non-digit and above
m = _mm256_or_si256(m, _mm256_sub_epi64(x00, m));
// align the end of the digit-string to the top of the u64 lane
// (shift off masked bytes and insert leading zeros)
m = _mm256_sad_epu8(_mm256_and_si256(m, x08), x00);
v = _mm256_sllv_epi64(v, m);
// convert to binary
// (the `add(v,v)` allow us to keep `mul2` unsigned)
v = _mm256_madd_epi16(_mm256_maddubs_epi16(mul1, v), mul2);
v = _mm256_add_epi32(_mm256_shuffle_epi32(v, 0x31), _mm256_add_epi32(v,v));
// zero the hi-dwords of each qword
v = _mm256_blend_epi32(v, x00, 0xAA);
return v;
}
First of all, 5x improvement is not “rather unimpressive”.
I would not do the last step with scalar code, here’s an alternative:
// _mm_shuffle_epi32( x, _MM_SHUFFLE( 3, 3, 2, 2 ) )
collapsed3 = Sse2.Shuffle( collapsed3, 0xFA );
// _mm_mul_epu32
var collapsed4 = Sse2.Multiply( collapsed3.As<int, uint>(), Vector128.Create( 10000u, 0, 1, 0 ) ).As<ulong, uint>();
// _mm_add_epi32( x, _mm_srli_si128( x, 8 ) )
collapsed4 = Sse2.Add( collapsed4, Sse2.ShiftRightLogical128BitLane( collapsed4, 8 ) );
return collapsed4.GetElement( 0 );
The C++ version gonna be way faster than what happens on my PC (.NET Core 3.1). The generated code is not good. They initialize constants like this:
00007FFAD10B11B6 xor ecx,ecx
00007FFAD10B11B8 mov dword ptr [rsp+20h],ecx
00007FFAD10B11BC mov dword ptr [rsp+28h],64h
00007FFAD10B11C4 mov dword ptr [rsp+30h],1
00007FFAD10B11CC mov dword ptr [rsp+38h],64h
00007FFAD10B11D4 mov dword ptr [rsp+40h],1
They use stack memory instead of another vector register. It looks like JIT developers forgot there’re 16 vector registers there, the complete function only uses xmm0.
00007FFAD10B1230 vmovapd xmmword ptr [rbp-0C0h],xmm0
00007FFAD10B1238 vmovapd xmm0,xmmword ptr [rbp-0C0h]
00007FFAD10B1240 vpsrldq xmm0,xmm0,8
00007FFAD10B1245 vpaddd xmm0,xmm0,xmmword ptr [rbp-0C0h]
Related
I am porting a library from C++ to C# but have come across a scenario I am unsure of how to resolve, which involves casting an unsigned char * to an unsigned int *.
C++
unsigned int c4;
unsigned int c2;
unsigned int h4;
int pos(unsigned char *p)
{
c4 = *(reinterpret_cast<unsigned int *>(p - 4));
c2 = *(reinterpret_cast<unsigned short *>(p - 2));
h4 = ((c4 >> 11) ^ c4) & (N4 - 1);
if ((tab4[h4][0] != 0) && (tab4[h4][1] == c4))
{
c = 256;
return (tab4[h4][0]);
}
c = 257;
return (tab2[c2]);
}
C# (It's wrong):
public uint pos(byte p)
{
c4 = (uint)(p - 4);
c2 = (ushort)(p - 2);
h4 = ((c4 >> 11) ^ c4) & (1 << 20 - 1);
if ((tab4[h4, 0] != 0) && (tab4[h4, 1] == c4)) {
c = 256;
return (tab4[h4, 0]);
}
c = 257;
return (tab2[c2]);
}
I believe in the C# example, you could change byte p to byte[] but I am clueless when it would come to casting byte[] to a single uint value.
Additionally, could anyone please explain to me, why would you cast an unsigned char * to a unsigned int *? What purpose does it have?
Any help/push to direction would be very useful.
Translation of the problematic lines would be:
int pos(byte[] a, int offset)
{
// Read the four bytes immediately preceding offset
c4 = BitConverter.ToUInt32(a, offset - 4);
// Read the two bytes immediately preceding offset
c2 = BitConverter.ToUInt16(a, offset - 2);
and change the call from x = pos(&buf[i]) (which even in C++ is the same as x = pos(buf + i)) to
x = pos(buf, i);
An important note is that the existing C++ code is wrong as it violates the strict aliasing rule.
Implementing analogous functionality in C# does not need to involve code that replicates the C version on a statement-by-statement basis, especially when the original is using pointers.
When we assume an architecture where int is 32 bit, you could simplify the C# version like this:
uint[] tab2;
uint[,] tab4;
ushort c;
public uint pos(uint c4)
{
var h4 = ((c4 >> 11) ^ c4) & (1 << 20 - 1);
if ((tab4[h4, 0] != 0) && (tab4[h4, 1] == c4))
{
c = 256;
return (tab4[h4, 0]);
}
else
{
c = 257;
var c2 = (c4 >> 16) & 0xffff; // HIWORD
return (tab2[c2]);
}
}
This simplification is possible because c4 and c2 overlap: c2 is the high word of c4, and is needed only when the lookup in tab4 does not match.
(The identifier N4 was present in original code but replaced in your own translation by the expression 1<<20).
The calling code would have to loop over an array of int, which according to comments is possible. While the original C++ code starts at offset 4 and looks back, the C# equivalent would start at offset 0, which seems a more natural thing to do.
In C++ code you are sending pointer to char, but normally C# does not working with memory using this way, you need array instead of pointer.
But you can use unsafe keyword to work directly with memory.
https://msdn.microsoft.com/en-us/library/chfa2zb8.aspx
I have an array of audio data, which is a lot of Int32 numbers represented by array of bytes (each 4 byte element represents an Int32) and i want to do some manipulation on the data (for example, add 10 to each Int32).
I converted the bytes to Int32, do the manipulation and convert it back to bytes as in this example:
//byte[] buffer;
for (int i=0; i<buffer.Length; i+=4)
{
Int32 temp0 = BitConverter.ToInt32(buffer, i);
temp0 += 10;
byte[] temp1 = BitConverter.GetBytes(temp0);
for (int j=0;j<4;j++)
{
buffer[i + j] = temp1[j];
}
}
But I would like to know if there is a better way to do such manipulation.
You can check the .NET Reference Source for pointers (grin) on how to convert from/to big endian.
class intFromBigEndianByteArray {
public byte[] b;
public int this[int i] {
get {
i <<= 2; // i *= 4; // optional
return (int)b[i] << 24 | (int)b[i + 1] << 16 | (int)b[i + 2] << 8 | b[i + 3];
}
set {
i <<= 2; // i *= 4; // optional
b[i ] = (byte)(value >> 24);
b[i + 1] = (byte)(value >> 16);
b[i + 2] = (byte)(value >> 8);
b[i + 3] = (byte)value;
}
}
}
and sample use:
byte[] buffer = { 127, 255, 255, 255, 255, 255, 255, 255 };//big endian { int.MaxValue, -1 }
//bool check = BitConverter.IsLittleEndian; // true
//int test = BitConverter.ToInt32(buffer, 0); // -129 (incorrect because little endian)
var fakeIntBuffer = new intFromBigEndianByteArray() { b = buffer };
fakeIntBuffer[0] += 2; // { 128, 0, 0, 1 } = big endian int.MinValue - 1
fakeIntBuffer[1] += 2; // { 0, 0, 0, 1 } = big endian 1
Debug.Print(string.Join(", ", buffer)); // "128, 0, 0, 0, 1, 0, 0, 1"
For better performance you can look into parallel processing and SIMD instructions - Using SSE in C#
For even better performance, you can look into Utilizing the GPU with c#
How about the following approach:
struct My
{
public int Int;
}
var bytes = Enumerable.Range(0, 20).Select(n => (byte)(n + 240)).ToArray();
foreach (var b in bytes) Console.Write("{0,-4}", b);
// Pin the managed memory
GCHandle handle = GCHandle.Alloc(bytes, GCHandleType.Pinned);
for (int i = 0; i < bytes.Length; i += 4)
{
// Copy the data
My my = (My)Marshal.PtrToStructure<My>(handle.AddrOfPinnedObject() + i);
my.Int += 10;
// Copy back
Marshal.StructureToPtr(my, handle.AddrOfPinnedObject() + i, true);
}
// Unpin
handle.Free();
foreach (var b in bytes) Console.Write("{0,-4}", b);
I made it just for fun.
Not sure that's less ugly.
I don't know, will it be faster? Test it.
i've a byte array of size 16384(SourceArray). i want to split it into 4 parts,4 byte array of size 4096. Now each 32 bits from SourceArray will be divided into 10 bits,10 bits,11 bits and 1 bit (10+10+11+1=32 bits). Taking first 10bits,then next 10bits then 11bits and then 1bit.
lets say for example first 4 bytes from the sourceArray, i get a value 9856325. Now 9856325(dec) = 00000000100101100110010101000101(Binary)-----------------------------
0000000010(10bits) 0101100110(10bits) 01010100010(11bits) 1(1bit). I cannot store 10,10,11 bits in byte array, So i'm storing it into Int16 array and for 1 bit i can store it into a byte array.
`
I hope problem statement is clear. I've created some logic.
Int16[] Array1 = new Int16[4096];
Int16[] Array2 = new Int16[4096];
Int16[] Array3 = new Int16[4096];
byte[] Array4 = new byte[4096];
byte[] SourceArray = File.ReadAllBytes(#"Reading an image file from directory");
//Size of source Array = 16384
for (int i = 0; i < 4096; i++)
{
uint temp = BitConverter.ToUInt32(SourceArray, i * 4);
uint t = temp;
Array1[i] = (Int16)(t >> 22);
t = temp << 10;
Array2[i] = (Int16)(t >> 22);
t = temp << 20;
Array3[i] = (Int16)(t >> 21);
t = temp << 31;
Array4[i] = (byte)(t>>31);
}
But i'm not getting the actual results. i might be missing something in my logic. Please verify it, whats wrong Or if you have some better logic,please suggest me.
In my opinion much easier to process bits starting from the last (least significant bit):
uint t = BitConverter.ToUInt32(SourceArray, i * 4);
Array4[i] = (byte) (t & 0x1);
t >>= 1;
Array3[i] = (Int16) (t & 0x7ff); // 0x7ff - mask for 11 bits
t >>= 11;
Array2[i] = (Int16) (t & 0x3ff); // 0x3ff - mask for 10 bits
t >>= 10;
Array1[i] = (Int16) t; // we don't need mask here
Or in reverse order of bit groups:
uint t = BitConverter.ToUInt32(SourceArray, i * 4);
Array2[i] = (Int16) (t & 0x3ff); // 0x3ff - mask for 10 bits
t >>= 10;
Array2[i] = (Int16) (t & 0x3ff); // 0x3ff - mask for 10 bits
t >>= 10;
Array3[i] = (Int16) (t & 0x7ff); // 0x7ff - mask for 11 bits
t >>= 11;
Array4[i] = (byte) (t & 0x1); // we don't need mask here
I'm having some difficulty reversing the following functions done before storing data to a device.
ss = enum + 16
u32ts = number << 8
u32timestamp = ss+u32ts
enum and number are the two vaules I'm trying to get back but I'm unawaire of what either of them are when I start with u32timestamp.
What I have tried to do is
uint temp = u32timestamp;
number = 0;
if (u32timestamp > 100)
{
number = (u32timestamp >> 8 & 8 );
temp = u32timestamp - number ;
}
enum = temp - 16;
But I keep getting out the incorrect values. Please help me fix this. enum is always between 16 and 21 but number can be positive any value.
// sample a and b
int a = 5, b = 7;
int ss = a + 16;
int u32ts = b << 8;
int u32timestamp = ss + u32ts;
// reversing...
int rev_b = u32timestamp >> 8;
int rev_ss = u32timestamp - (rev_b << 8);
int rev_a = rev_ss - 16;
var a = u32timestamp&0xFF;
var number= u32timestamp >>8;
var en = a-16;
the coding place "number" shifting eight bit to the left, so a regular 8 shift on the right will recover it by trashing the LS bits. Then we mask the lower bit to recover en.
public void Decode
{
uint u32timestamp = 31778;
var number = u32timestamp >> 8;
var temp = u32timestamp - (number << 8);
var en = temp - 16;
}
I need to set and clear some bits in bytes of a byte array.
Set works good but clear doesn't compile (I guess because negation has int as result and it's negative by then and ...).
public const byte BIT_1 = 0x01;
public const byte BIT_2 = 0x02;
byte[] bytes = new byte[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
// set 1st bit
for (int i = 0; i < bytes.Length; i++)
bytes[i] |= BIT_1;
// clear 2nd bit (set it to 0)
for (int i = 0; i < bytes.Length; i++)
bytes[i] &= (~BIT_2);
How can I clear bits in a byte in a preferably simple way ?
Try this:
int i = 0;
// Set 5th bit
i |= 1 << 5;
Console.WriteLine(i); // 32
// Clear 5th bit
i &= ~(1 << 5);
Console.WriteLine(i); // 0
Version on bytes (and on bytes only in this case):
byte b = 0;
// Set 5th bit
b |= 1 << 5;
Console.WriteLine(b);
// Clear 5th bit
b &= byte.MaxValue ^ (1 << 5);
Console.WriteLine(b);
It appears that you can't do this without converting the byte to an int first.
byte ClearBit(byte b, int i)
{
int x = Convert.ToInt32(b);
x &= ~(1 << i);
return Convert.ToByte(x);
}
This can be avoided by converting to an int and back again:
bytes[i] = (byte)(bytes[i] & ~(BIT_2));
Bitwise operations are not defined on bytes, and using them with bytes will always return an int. So when you use the bitwise NOT on your byte, it will return an int, and as such yield an error when you are trying to AND it with your other byte.
To solve this, you need to explicitely cast the result as a byte again.
byte bit = 0x02;
byte x = 0x0;
// set
x |= bit;
Console.WriteLine(x); // 2
// clear
x &= (byte)~bit;
Console.WriteLine(x); // 0