HashSet for byte arrays [duplicate] - c#

This question already has answers here:
How to create a HashSet<List<Int>> with distinct elements?
(5 answers)
Closed 4 years ago.
This post was edited and submitted for review 2 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I need a HashSet for byte arrays in order to check if a given byte array exists in the collection. But it seems like this doesn't work for byte arrays (or perhaps any array).
Here is my test code:
void test()
{
byte[] b1 = new byte[] { 1, 2, 3 };
byte[] b2 = new byte[] { 1, 2, 3 };
HashSet<byte[]> set = new HashSet<byte[]>();
set.Add(b1);
set.Add(b2);
Text = set.Count.ToString();//returns 2 instead of the expected 1.
}
Is there a way to make a HashSet for byte arrays?

Construct a HashSet with an IEqualityComparer<byte[]>. You don't want to use an interface here. While byte[] does in fact implement interfaces such as IEnumerable<byte>, IList<byte>, etc., use of them is a bad idea due to the weightiness involved. You don't use the fact that string implements IEnumerable<char> much at all so don't for byte[] either.
public class bytearraycomparer : IEqualityComparer<byte[]> {
public bool Equals(byte[] a, byte[] b)
{
if (a.Length != b.Length) return false;
for (int i = 0; i < a.Length; i++)
if (a[i] != b[i]) return false;
return true;
}
public int GetHashCode(byte[] a)
{
uint b = 0;
for (int i = 0; i < a.Length; i++)
b = ((b << 23) | (b >> 9)) ^ a[i];
return unchecked((int)b);
}
}
void test()
{
byte[] b1 = new byte[] { 1, 2, 3 };
byte[] b2 = new byte[] { 1, 2, 3 };
HashSet<byte[]> set = new HashSet<byte[]>(new bytearraycomparer );
set.Add(b1);
set.Add(b2);
Text = set.Count.ToString();
}
https://msdn.microsoft.com/en-us/library/bb359100(v=vs.110).aspx
If you were to use the answers in proposed duplicate question, you would end up with one function call and one array bounds check per byte processed. You don't want that. If expressed in the simplest way like so, the jitter will inline the fetches, and then notice that the bounds checks cannot fail (arrays can't be resized) and omit them. Only one function call for the entire array. Yay.
Lists tend to have only a few elements as compared to a byte array so often the dirt-simple hash function such as foreach (var item in list) hashcode = hashcode * 5 + item.GetHashCode(); if you use that kind of hash function for byte arrays you will have problems. The multiply by a small odd number trick ends up being rather biased too quickly for comfort here. My particular hash function given here is probably not optimal but we have run tests on this family and it performs quite well with three million entries. The multiply-by-odd was getting into trouble too quickly due to possessing numerous collisions that were only two bytes long/different. If you avoid the degenerate numbers this family will have no collisions in two bytes and most of them have no collisions in three bytes.
Considering actual use cases: By far the two most likely things here are byte strings and actual files being checked for sameness. In either case, taking a hash code of the first few bytes is most likely a bad idea. String's hash code uses the whole string, so byte strings should do the same, and most files being duplicated don't have a unique prefix in the first few bytes. For N entries, if you have hash collisions for the square root on N, you might as well have walked the entire array when generating the hash code, neglecting the fact that compares are slower than hashes.

Related

How to take array segments out of a byte array after every X step?

I got a big byte array (around 50kb) and i need to extract numeric values from it. Every three bytes are representing one value.
What i tried is to work with LINQs skip & take but it's really slow regarding the large size of the array.
This is my very slow routine:
List<int> ints = new List<int>();
for (int i = 0; i <= fullFile.Count(); i+=3)
{
ints.Add(BitConverter.ToInt16(fullFile.Skip(i).Take(i + 3).ToArray(), 0));
}
I think i got a wrong approach to this.
Your code
First of all, ToInt16 only uses two bytes. So your third byte will be discarded.
You can't use ToInt32 as it would include one extra byte.
Let's review this:
fullFile.Skip(i).Take(i + 3).ToArray()
..and take a careful look at Take(i + 3). It says that you want to copy a larger and larger buffer. For instance, when i is on index 32000 you copy 32003 bytes into your new buffer.
That's why the code is quite slow.
The code is also slow since you allocate a lot of byte buffers which will need to be garbage collected. 65535 extra buffers of growing size which would have to be garbage collected.
You could also have done like this:
List<int> ints = new List<int>();
var workBuffer = new byte[4];
for (int i = 0; i <= fullFile.Length; i += 3)
{
// Copy the three bytes into the beginning of the temp buffer
Buffer.BlockCopy(fullFile, i, workBuffer, 0, 3);
// Now we can use ToInt32 as the last byte always is zero
var value = BitConverter.ToInt32(workBuffer, 0);
ints.Add(value);
}
Quite easy to understand, but not the fastest code.
A better solution
So the most efficient way is to do the conversion by yourself (bit shifting).
Something like:
List<int> ints = new List<int>();
for (int i = 0; i <= fullFile.Length; i += 3)
{
// This code assume little endianess
var value = (fullFile[i + 2] << 16)
+ (fullFile[i + 1] << 8)
+ fullFile[i];
ints.Add(value);
}
This code do not allocate anything extra (except the ints), and should be quite fast.
You can read more about Shift operators in MSDN. And about endianess

Find nearest safe split in byte array containing UTF-8 data

I want to split a large array of UTF-8 encoded data, so that decoding it into chars can be parallelized.
It seems that there's no way to find out how many bytes Encoding.GetCharCount reads. I also can't use GetByteCount(GetChars(...)) since it decodes the entire array anyways, which is what I'm trying to avoid.
UTF-8 has well-defined byte sequences and is considered self-synchronizing, meaning given any position in bytes you can find where the character at that position begins at.
The UTF-8 spec (Wikipedia is the easiest link) defines the following byte sequences:
0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character
So, the following method (or something similar) should get your result:
Get the byte count for bytes (bytes.Length, et. al.)
Determine how many sections to split into
Select byte byteCount / sectionCount
Test byte against table:
If byte & 0x80 == 0x00 then you can make this byte part of either section
If byte & 0xE0 == 0xC0 then you need to seek ahead one byte, and keep it with the current section
If byte & 0xF0 == 0xE0 then you need to seek ahead two bytes, and keep it with the current section
If byte & 0xF8 == 0xF0 then you need to seek ahead three bytes, and keep it with the current section
If byte & 0xC0 == 0x80 then you are in a continuation, and should seek ahead until the first byte that does not fit val & 0xB0 == 0x80, then keep up to (but not including) this value in the current section
Select byteStart through byteCount + offset where offset can be defined by the test above
Repeat for each section.
Of course, if we redefine our test as returning the current char start position, we have two cases:
1. If (byte[i] & 0xC0) == 0x80 then we need to move around the array
2. Else, return the current i (since it's not a continuation)
This gives us the following method:
public static int GetCharStart(ref byte[] arr, int index) =>
(arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
Next, we want to get each section. The easiest way is to use a state-machine (or abuse, depending on how you look at it) to return the sections:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
for (var i = 0; i < sectionCount; i++)
{
sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
Now I built this in this manner because I want to use Parallel.ForEach to demonstrate the result, which makes it super easy if we have an IEnumerable, and it also allows me to be extremely lazy with the processing: we only continue to gather sections when needed, which means we can lazily process it and do it on-demand, which is a good thing, no?
Lastly, we need to be able to get a section of bytes, so we have the GetSection method:
public static byte[] GetSection(ref byte[] array, int start, int end)
{
var result = new byte[end - start];
for (var i = 0; i < result.Length; i++)
{
result[i] = array[i + start];
}
return result;
}
Finally, the demonstration:
var sourceText = "Some test 平仮名, ひらがな string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);
var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
new ParallelOptions { MaxDegreeOfParallelism = 1 },
x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });
Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Some test ???, ??
?? string that should b
e decoded in parallel, thi
s demonstrates that we work
flawlessly with Parallel.
ForEach. The only downside
to using `Parallel.ForEach`
the way I demonstrate is
that it doesn't take order into account, but oh-well.
Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Not perfect, but it does the job. If we change MaxDegreesOfParallelism to a higher value, our string gets jumbled:
Some test ???, ??
e decoded in parallel, thi
flawlessly with Parallel.
?? string that should b
to using `Parallel.ForEach`
ForEach. The only downside
that it doesn't take order into account, but oh-well.
s demonstrates that we work
the way I demonstrate is
So, as you can see, super easy. You'll want to make modifications to allow for correct order-reassembly, but this should demonstrate the trick.
If we modify the GetByteSections method as follows, the last section is no longer ~2x the size of the remaining ones:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);
for (var i = 0; i < sectionCount; i++)
{
if (i == (sectionCount - 1))
{
var lengthRem = utf8Array.Length - i * sectionSize;
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
sectionEnd = utf8Array.Length;
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
}
else
{
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
}
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
Some test ???, ???? string that should be de
coded in parallel, this demonstrates that we work flawless
ly with Parallel.ForEach. The only downside to using `Para
llel.ForEach` the way I demonstrate is that it doesn't tak
e order into account, but oh-well. We can continue to incr
ease the length of this string to demonstrate that the las
t section is usually about double the size of the other se
ctions, we could fix that if we really wanted to. In fact,
with a small modification it does so, we just have to rem
ember that we'll end up with `sectionCount + 1` results.
Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
And finally, if for some reason you split into an abnormally large number of sections compared to input size (my input size of ~578 bytes at 250 chars demonstrates this) you'll hit an IndexOutOfRangeException in GetCharStart, the following version fixes that:
public static int GetCharStart(ref byte[] arr, int index)
{
if (index > arr.Length)
{
index = arr.Length - 1;
}
return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}
Of course this leaves you with a bunch of empty results, but when you reassemble the string doesn't change, so I'm not even going to bother posting the full scenario test here. (I leave it up to you to experiment.)
Great answer Mathieu and Der, adding a python variant 100% based on your answer which works great:
def find_utf8_split(data, bytes=None):
bytes = bytes or len(data)
while bytes > 0 and data[bytes - 1] & 0xC0 == 0x80:
bytes -= 1
if bytes > 0:
if data[bytes - 1] & 0xE0 == 0xC0: bytes = bytes - 1
if data[bytes - 1] & 0xF0 == 0xE0: bytes = bytes - 1
if data[bytes - 1] & 0xF8 == 0xF0: bytes = bytes - 1
return bytes
This code finds a UTF-8 compatible split in a given byte string. It does not do the split as that would take more memory, that is left to the rest of the code.
For example you could:
position = find_utf8_split(data)
leftovers = data[position:]
text = data[:position].decode('utf-8')

Bitarray VS bool[]

I expected to find a existing question here on SO about this but i didn't.
What is the advantage of using a Bitarray when you can store your bool values in a bool[]?
System.Collections.BitArray biArray = new System.Collections.BitArray(8);
biArray[4] = true;
bool[] boArray = new bool[8];
boArray[4] = true;
The bool[] seems a little more handy to me because there exist more (extension)methods to work with a array instead of a BitArray
There's a memory/performance tradeoff. BitArray will store 8 entries per byte, but accessing a single entry requires a bunch of logical operations under the hood. A bool array will store each entry as one byte and thus taking up more memory, but requiring fewer CPU cycles to access.
Essentially, BitArray is a memory optimization over bool[], but there's no point in using it unless memory is sparse.
Edit:
Created a simple performance test.
Inversing 500M elements using BitArray takes 6 seconds on my machine:
const int limit = 500000000;
var bitarray = new BitArray(limit);
for (var i = 0; i < limit; ++i)
{
bitarray[i] = !bitarray[i];
}
The same test using a bool array takes about 1.5 seconds:
const int limit = 500000000;
var boolarray = new bool[limit];
for (var i = 0; i < limit; ++i)
{
boolarray[i] = !boolarray[i];
}
BitArray is compact and allows you to perform bitwise operations. From the MSDN forum :
A BitArray uses one bit for each value, while a bool[] uses one byte
for each value. You can pass to the BitArray constructor either an
array of bools, an array of bytes or an array of integers. You can
also pass an integer value specifying the desired length and
(optionally) a boolean argument that specifies if the individual bits
should be set or not.

faster code to remove first elements from byte array [duplicate]

This question already has answers here:
How do you remove and add bytes from a byte array in C#
(2 answers)
Closed 7 years ago.
So I have a byte array, and I need to remove the first 5 elements from it. Anyway, I looked online and I couldn't find anything that suited what I was looking for. So I made this, and it is horribly slow, in essence, unusable.
private byte[] fR(byte[] tb)
{
string b = "";
byte[] m = new byte[tb.Length - 5];
for (int a = 5; a < tb.Length; a++)
{
b = b + " " + tb.GetValue(a);
}
b = b.Remove(0, 1);
string[] rd = Regex.Split(b, " ");
for (int c = 0; c < rd.Length; c++)
{
byte curr = Convert.ToByte(rd[c]);
m.SetValue(curr, c);
}
return m;
}
What I am asking is, is if there is a way to make this faster/improve. Or another method in which I can remove the first 5 elements from a byte array.
Much easier and quicker:
byte[] src = ...;
byte[] dst = new byte[src.Length - 5];
Array.Copy(src, 5, dst, 0, dst.Length);
This is as fast as you'll be able to get.
If you're using C# 8, you can use ranges to copy a slice of the array very concisely:
byte[] src = ...;
byte[] dst = src[5..];
LINQ used in other answers, being a bit easier to understand, is what I'd do 90% of the time. But, LINQ has its own overheads especially for simple problems like this, and I'd not use it if performance is critical.
Your code is slow because you're packing the byte array into a string and then unpacking it. Get rid of the string manipulation and it will be fast.
You can use Linq:
tb.Skip(5).ToArray();
What about
tb.Skip(5).ToArray()
?

A workaround for a big multidimensional array (Jagged Array) C#?

I'm trying to initialize an array in three dimension to load a voxel world.
The total size of the map should be (2048/1024/2048). I tried to initialize an jagged array of "int" but I throw a memory exception. What is the size limit?
Size of my table: 2048 * 1024 * 2048 = 4'191'893'824
Anyone know there a way around this problem?
// System.OutOfMemoryException here !
int[][][] matrice = CreateJaggedArray<int[][][]>(2048,1024,2048);
// if i try normal Initialization I also throws the exception
int[, ,] matrice = new int[2048,1024,2048];
static T CreateJaggedArray<T>(params int[] lengths)
{
return (T)InitializeJaggedArray(typeof(T).GetElementType(), 0, lengths);
}
static object InitializeJaggedArray(Type type, int index, int[] lengths)
{
Array array = Array.CreateInstance(type, lengths[index]);
Type elementType = type.GetElementType();
if (elementType != null)
{
for (int i = 0; i < lengths[index]; i++)
{
array.SetValue(
InitializeJaggedArray(elementType, index + 1, lengths), i);
}
}
return array;
}
The maximum size of a single object in C# is 2GB. Since you are creating a multi-dimensional array rather than a jagged array (despite the name of your method) it is a single object that needs to contain all of those items, not several. If you actually used a jagged array then you wouldn't have a single item with all of that data (even though the total memory footprint would be a tad larger, not smaller, it's just spread out more).
Thank you so much to all the staff who tried to help me in understanding and solving my problem.
I tried several solution to be able to load a lot of data and stored in a table.
After two days, here are my tests and finally the solution which can store 4'191'893'824 entry into one array
I add my final solution, hoping someone could help
the goal
I recall the goal: Initialize an integer array [2048/1024/2048] for storing 4'191'893'824 data
Test 1: with JaggedArray method (failure)
system out of memory exception thrown
/* ******************** */
/* Jagged Array method */
/* ******************** */
// allocate the first dimension;
bigData = new int[2048][][];
for (int x = 0; x < 2048; x++)
{
// allocate the second dimension;
bigData[x] = new int[1024][];
for (int y = 0; y < 1024; y++)
{
// the last dimension allocation
bigData[x][y] = new int[2048];
}
}
Test 2: with List method (failure)
system out of memory exception thrown (divide the big array into several small array .. Does not work because "List <>" allows a maximum of "2GB" Ram allocution like a simple array unfortunately.)
/* ******************** */
/* List method */
/* ******************** */
List<int[,,]> bigData = new List<int[,,]>(512);
for (int a = 0; a < 512; a++)
{
bigData.Add(new int[256, 128, 256]);
}
Test 3: with MemoryMappedFile (Solution)
I finally finally found the solution!
Use the class "Memory Mapped File" contains the contents of a file in virtual memory.
MemoryMappedFile MSDN
Use with custom class that I found on codeproject here. The initialization is long but it works well!
/* ************************ */
/* MemoryMappedFile method */
/* ************************ */
string path = AppDomain.CurrentDomain.BaseDirectory;
var myList = new GenericMemoryMappedArray<int>(2048L*1024L*2048L, path);
using (myList)
{
myList.AutoGrow = false;
/*
for (int a = 0; a < (2048L * 1024L * 2048L); a++)
{
myList[a] = a;
}
*/
myList[12456] = 8;
myList[1939848234] = 1;
// etc...
}
From the MSDN documentation on Arrays (emphasis added)
By default, the maximum size of an Array is 2 gigabytes (GB). In a
64-bit environment, you can avoid the size restriction by setting the
enabled attribute of the gcAllowVeryLargeObjects configuration element
to true in the run-time environment. However, the array will still be
limited to a total of 4 billion elements, and to a maximum index of
0X7FEFFFFF in any given dimension (0X7FFFFFC7 for byte arrays and
arrays of single-byte structures).
So despite the above answers, even if you set the flag to allow a larger object size, the array is still limited to the 32bit limit of the number of elements.
EDIT: You'll likely have to redesign to eliminate the need for a multidimensional array as you're currently using it (as others have suggested, there are a few ways to do this between using actual jagged arrays, or some other collection of dimensions). Given the scale of the number of elements, it may be best to use a design that dynamically allocates objects/memory as used instead of arrays that have to pre-allocate it. (unless you don't mind using many gigabytes of memory) EDITx2: That is, perhaps you can define data structures that define filled content rather than defining every possible voxel in the world, even the "empty" ones. (I'm assuming the vast majority of voxels are "empty" rather than "filled")
EDIT: Although not trivial, especially if most of the space is considered "empty", then your best bet would be to introduce some sort of spatial tree that will let you efficiently query your world to see what objects are in a particular area. For example: Octrees (as Eric suggested) or RTrees
Creating this object as described, either as a standard array or as a jagged array, is going to destroy the locality of reference that allows your CPU to be performant. I recommend you use a structure like this instead:
class BigArray
{
ArrayCell[,,] arrayCell = new ArrayCell[32,16,32];
public int this[int i, int j, int k]
{
get { return (arrayCell[i/64, j/64, k/64])[i%64, j%64, k%16]; }
}
}
class ArrayCell
{
int[,,] cell = new int[64,64,64];
public int this[int i, int j, int k]
{
get { return cell[i,j,k]; }
}
}

Categories

Resources