Read/Write compressed binary data - c#

I read all over the place people talk about compressing objects on a bit by bit scale. Things like "The first three bits represent such and such, then the next two represent this and twelve bits for that"
I understand why it would be desirable to minimize memory usage, but I cannot think of a good way to implement this. I know I would pack it into one or more integers (or longs, whatever), but I cannot envision an easy way to work with it. It would be pretty cool if there were a class where I could get/set arbitrary bits from an arbitrary length binary field, and it would take care of things for me, and I wouldn't have to go mucking about with &'s and |'s and masks and such.
Is there a standard pattern for this kind of thing?

From MSDN:
BitArray Class
Manages a compact array of bit values, which are represented as Booleans, where true indicates that the bit is on (1) and false indicates the bit is off (0).
Example:
BitArray myBitArray = new BitArray(5);
myBitArray[3] = true; // set bit at offset 3 to 1
BitArray allows you to set only individual bits, though. If you want to encode values with more bits, there's probably no way around mucking about with &'s and |'s and masks and stuff :-)

You might want to check out the BitVector32 structure in the .NET Framework. It lets you define "sections" which are ranges of bits within an int, then read and write values to those sections.
The main limitation is that it's limited to a single 32-bit integer; this may or may not be a problem depending on what you're trying to do. As dtb mentioned, BitArray can handle bit fields of any size, but you can only get and set a single bit at a time--there is no support for sections as in BitVector32.

What you're looking for are called bitwise operations.
For example, let's say we're going to represent an RGB value in the least significant 24 bits of an integer, with R being bits 23-16, G being bits 15-8, and B being bits 7-0.
You can set R to any value between 0 and 255 without effecting the other bits like this:
void setR(ref int RGBValue, int newR)
{
int newRValue = newR << 16; // shift it left 16 bits so that the 8 low-bits are now in position 23-16
RGBValue = RGBValue & 0x00FF; // AND it with 0x00FF so that the top 16 bits are set to zero
RGBValue = RGBValue | newRValue; // now OR it with the newR value so that the new value is set.
}
By using bitwise ANDs and ORs (and occasionally more exotic operations) you can easily set and clear any individual bit of a larger value.

Rather than using toolkit or platform specific wrapper classes I think you are better off biting the bullet and learning your &s and |s and 0x04s and how all the bitwise operators work. By and large that's how its done for most projects, and the operations are extremely fast. The operations are pretty much identical on most languages so you won't be stuck dependant on some specific toolkit.

Related

Why does everyone use 2^n numbers for allocation? -> new StringBuilder(256)

15 years ago, while programming with Pascal, I understood why to use power of two's for memory allocation. But this still seems to be state-of-the-art.
C# Examples:
new StringBuilder(256);
new byte[1024];
int bufferSize = 1 << 12;
I still see this thousands of times, I use this myself and I'm still questioning:
Do we need this in modern programming languages and modern hardware?
I guess its good practice, but what's the reason?
EDIT
For example a byte[] array, as stated by answers here, a power of 2 will make no sense: the array itself will use 16 bytes (?), so does it make sense to use 240 (=256-16) for the size to fit a total of 256 bytes?
Do we need this in modern programming languages and modern hardware? I guess its good practice, but what's the reason?
It depends. There are two things to consider here:
For sizes less than the memory page size, there's no appreciable difference between a power-of-two and an arbitrary number to allocate space;
You mostly use managed data structures with C#, so you won't even know how many bytes are really allocated underneath.
Assuming you're doing low-level allocation with malloc(), using multiples of the page size would be considered a good idea, i.e. 4096 or 8192; this is because it allows for more efficient memory management.
My advice would be to just allocate what you need and let C# handle the memory management and allocation for you.
Sadly, it's quite stupid if you want to keep a block of memory in a single memory page of 4k... And persons don't even know it :-) (I didn't until 10 minutes ago... I only had an hunch)... An example... It's unsafe code and implementation dependant (using .NET 4.5 at 32/64 bits)
byte[] arr = new byte[4096];
fixed (byte* p = arr)
{
int size = ((int*)p)[IntPtr.Size == 4 ? -1 : -2];
}
So the CLR has allocated at least 4096 + (1 or 2) sizeof(int)... So it has gone over one 4k memory page. This is logical... It has to keep the size of the array somewhere, and keeping it together with the array is the most intelligent thing (for those that know what Pascal Strings and BSTR are, yes, it's the same principle)
I'll add that all the objects in .NET have a syncblck number and a RuntimeType... They are at least int if not IntPtr, so a total of between 8 and 16 bytes/object (This is explained in various places... try looking for .net object header if you are interested)
It still makes sense in certain cases, but I would prefer to analyze case-by-case whether I need that kind of specification or not, rather than blindly use it as good practice.
For example, there might be cases where you want to use exactly 8 bits of information (1 byte) to address a table.
In that case, I would let the table have the size of 2^8.
Object table = new Object[256];
By this, you will be able to address any object of the table using only one byte.
Even if the table is actually smaller and doesn't use all 256 places, you still have the guarantee of bidirectional mapping from table to index and from index to table, which could prevent errors that would appear, for example, if you had:
Object table = new Object[100];
And then someone (probably someone else) accesses it with a byte value out of table's range.
Maybe this kind of bijective behavior could be good, maybe you could have other ways to guarantee your constraints.
Probably, given the increase in smartness of current compilers, it is not the only good practice anymore.
IMHO, anything ending in exact power of two's arithmeric operation is like a fast track. low level arithmeric operation for power of two takes less number of turns and bit manipulations than any other numbers need extra work for cpu.
And found this possible duplicate:Is it better to allocate memory in the power of two?
Yes, it's good practice, and it has at least one reason.
The modern processors have L1 cache-line size 64 bytes, and if you will use buffer size as 2^n (for example 1024, 4096,..), you will take fully cache-line, without wasted space.
In some cases, this will help prevent false sharing problem (http://en.wikipedia.org/wiki/False_sharing).

Effectively storing bit indexes

I want to store certain indexes in a sequence of bits. The number of bits are powers of
two and it is safe to assume that maximum index to be 255 (starting index is 0). Earlier i was storing every index as an integer. But that occupies too much of memory.
I want to use something like masks.
ex: If i want to store the indexes 0,3,5 then i store 101001 i.e. 41 as an integer.
The problem is that maximum index that i have is 255 and using above technique i can store indexes
only till 64(using 64 bit integer). Is there any other way i can do this??
Thanks :-)
.NET has a built-in class for this kind of thing: BitArray
This will let you store a string of bits (in the form of bools) efficiently.
You can perform bitwise operations on BitArrays with the And, Or, Xor, and Not methods.
You can initialize a BitArray with bools, bytes, or ints. For example, you can initialize it with 00101100 like this:
BitArray bits = new BitArray(new byte[] { 0x2C }); // 0x2C == 00101100
You can use 2 bit indexes for every bit index.
Pratically with 10 bits (index from 1 to 10, set bits 0, 4, 8):
Single index:
i = 0100010001
Index1 = i
Two combined indexes:
i1 = 01000
i2 = 10001
Index2 = [i1, i2];
Index2.fragment_length = 5
Array
In pseudocode, to retrieve or set a bit
set(Index, bit) {
fragment = quotient(bit, Index.flagment_length); //quotient = integer division
bit_index = module(bit, Index.flagment_length); //index of the bit in the fragment
set(Index[fragment], bit-or(Index[Fragment], bit-shift-left(1 << bit_index))); //Set the bit indexes vector fragment with or-ing the appropriate bitmask
}
get(Index, bit) {
fragment = quotient(bit, Index.flagment_length); //quotient = integer division
bit_index = module(bit, Index.flagment_length); //index of the bit in the fragment
if (get(Index[fragment], bit-and(Index[Fragment], bit-shift-left(1 << bit_index))) > 0) then true else false; //Get the bit indexes vector fragment bit with and-ing the appropriate bitmask and return true or false
}
I hope to have understood your requirements!
In java the bit indexing made easy as the following little tutorial...
This class implements a vector of bits that grows as needed. Each component of the bit set has a boolean value. The bits of a BitSet are indexed by nonnegative integers. Individual indexed bits can be examined, set, or cleared. One BitSet may be used to modify the contents of another BitSet through logical AND, logical inclusive OR, and logical exclusive OR operations.
By default, all bits in the set initially have the value false.
Every bit set has a current size, which is the number of bits of space currently in use by the bit set. Note that the size is related to the implementation of a bit set, so it may change with implementation. The length of a bit set relates to logical length of a bit set and is defined independently of implementation.
Unless otherwise noted, passing a null parameter to any of the methods in a BitSet will result in a NullPointerException. A BitSet is not safe for multithreaded use without external synchronization.

Manipulating very large binary strings in c#

I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.
What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.
If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.
Is the BitArray class no help?
A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?
Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.

How to "Add" two bytes together

I have an odd scenario (see this answer for more details), where I need to add two bytes of data together. Obviously this is not normal adding. Here is the scenario:
I am trying to get a coordinate out of a control. When the control is less that 256 in width then the x coordinate takes one byte, otherwise it takes two bites.
So, I now have an instance of that control that is larger than 256 in width. How do I add these two numbers together?
So for example:
2 + 0 is not 2 because the 2 is the high byte (or maybe it is the low byte and it is 2...)
Am I making sense? If so, how can I do this kind of addition in C#?
Update: Sorry for the confusing question. I think I got it figured out. See my answer below.
Approach with multiplying is pretty clear but not common in bitwise word, and your approach with BitConverter takes byte array which is not convenient in many cases.
The most common (and easy way) to perform this - use bitwise operators:
var r = (high << 8) | low;
And remember about byte ordering because it's not always obvious which byte is high and which is low.
You mean something like
256 * high + low
?
Just in case anyone else needs this, I was looking for:
BitConverter.ToInt16
It takes two bytes and converts them to an integer.

(C#) Is there a way to set the value of a single bit through pointers?

What I mean is: Imagine we have a 8 byte variable that has a high value and low value. I can make one pointer point to the upper 4 bytes and other point to the lower 4 bytes, and set/retrieve their values without problems. Now, is there a way to get/set values for anything smaller than a byte? If instead of dividing it in two 4 bytes "variables", I'd want to consider eight 1 byte variables I could use a bool, but there is no defined smaller variable in c#. Would it possible to divide it to 16 just with pointers? Or even in 32, 64? It wouldn't right?
This is a pretty academic question, I know this can be achieved otherwise with bitshiffting, unions(Struct.Explicit), etc. Thanks!
No, C# does not support bit fields and a byte is the minimum amount of addressable memory. You can manually provide properties that change one or several specific bits but you have to provide packing/unpacking logic yourself:
public bool Bit5 {
get { return (field & 32) != 0; }
set { if (value) field |= 32; else field &= ~32; }
}
By the way, I don't know how you achieve it using LayoutKind.Explicit as the minimum FieldOffset you can specify is one byte.
As a side note, even C++ that can do this with bit fields will just hide the bitwise tricks and makes the compiler do it instead of you. There's no way you could grab something less than a byte from memory to a register, at least on x86 architecture.

Categories

Resources