I have a String of 512 characters that contains only 0, 1. I'm trying to represent it into a data structure that can save the space. Is BitArray the most efficient way?
I'm also thinking about using 16 int32 to store the number, which would then be 16 * 4 = 64 bytes.
Most efficient can mean many different things...
Most efficient from a memory management perspective?
Most efficient from a CPU calculation perspective?
Most efficient from a usage perspective? (In respect to writing code that uses the numbers for calculations)
For 1 - use byte[64] or long[8] - if you aren't doing calculations or don't mind writing your own calculations.
For 3 definitely BigInteger is the way to go. You have your math functions already defined and you just need to turn your binary number into a decimal representation.
EDIT: Sounds like you don't want BigInteger due to size concerns... however I think you are going to find that you will of course have to parse this as an enumerable / yield combo where you are parsing it a bit at a time and don't hold the entire data structure in memory at the same time.
That being said... I can help you somewhat with parsing your string into array's of Int64's... Thanks King King for part of this linq statement here.
// convert string into an array of int64's
// Note that MSB is in result[0]
var result = input.Select((x, i) => i)
.Where(i => i % 64 == 0)
.Select(i => input.Substring(i, input.Length - i >= 64 ?
64 : input.Length - i))
.Select(x => Convert.ToUInt64(x, 2))
.ToArray();
If you decide you want a different array structure byte[64] or whatever it should be easy to modify.
EDIT 2: OK I got bored so I wrote an EditDifference function for fun... here you go...
static public int GetEditDistance(ulong[] first, ulong[] second)
{
int editDifference = 0;
var smallestArraySize = Math.Min(first.Length, second.Length);
for (var i = 0; i < smallestArraySize; i++)
{
long signedDifference;
var f = first[i];
var s = second[i];
var biggest = Math.Max(f, s);
var smallest = Math.Min(f, s);
var difference = biggest - smallest;
if (difference > long.MaxValue)
{
editDifference += 1;
signedDifference = Convert.ToInt64(difference - long.MaxValue - 1);
}
else
signedDifference = Convert.ToInt64(difference);
editDifference += Convert.ToString(signedDifference, 2)
.Count(x => x == '1');
}
// if arrays are different sizes every bit is considered to be different
var differenceOfArraySize =
Math.Max(first.Length, second.Length) - smallestArraySize;
if (differenceOfArraySize > 0)
editDifference += differenceOfArraySize * 64;
return editDifference;
}
Use BigInteger from .NET. It can easily support 512-bit numbers as well as operations on those numbers.
BigInteger.Parse("your huge number");
BitArray (with 512 bits), byte[64], int[16], long[8] (or List<> variants of those), or BigInteger will all be much more efficient than your String. I'd say that byte[] is the most idiomatic/typical way of representing data such as this, in general. For example, ComputeHash uses byte[] and Streams deal with byte[]s, and if you store this data as a BLOB in a DB, byte[] will be the most natural way to work with that data. For that reason, it'd probably make sense to use this.
On the other hand, if this data represents a number that you might do numeric things to like addition and subtraction, you probably want to use a BigInteger.
These approaches have roughly the same performance as each other, so you should choose between them based primarily on things like what makes sense, and secondarily on performance benchmarked in your usage.
The most efficient would be having eight UInt64/ulong or Int64/long typed variables (or a single array), although this might not be optimal for querying/setting. One way to get around this is, indeed, to use a BitArray (which is basically a wrapper around the former method, including additional overhead [1]). It's a matter of choice either for easy use or efficient storage.
If this isn't sufficient, you can always choose to apply compression, such as RLE-encoding or various other widely available encoding methods (gzip/bzip/etc...). This will require additional processing power though.
It depends on your definition of efficient.
[1] Addtional overhead, as in storage overhead. BitArray internally uses an Int32-array to store values. In addition to that BitArray stores its current mutation version, the number of ints 'allocated' and a syncroot. Even though the overhead is negligible for smaller amount of values, it can be an issue if you keep a lot of these in memory.
Related
I have a struct with some properties (like int A1, int A2,...). I store a list of struct as binary in a file.
Now, I'm reading the bytes from file using binary reader into Buffer and I want to apply a filter based on the struct's properties (like .A1 = 100 & .A2 = 12).
The performance is very important in my scenario, so I convert the filter criteria to byte array (Filter) and then I want to mask Buffer with Filter. If the result of masking is equal to Filter, the Buffer will be converted to the struct.
The question: What is the fastest way to mask and compare two byte arrays?
Update: The Buffer size is more than 256 bytes. I'm wondering if there is a better way rather than iterating in each byte of Buffer and Filter.
The way I would usually approach this is with unsafe code. You can use the fixed keyword to get a byte[] as a long*, which you can then iterate in 1/8th of the iterations - but using the same bit operations. You will typically have a few bytes left over (from it not being an exact multiple of 8 bytes) - just clean those up manually afterwards.
Try a simple loop with System.BitConverter.ToInt64(). Something Like this:
byte[] arr1;
byte[] arr2;
for (i = 0; i < arr1.Length; i += 8)
{
var P1 = System.BitConverter.ToInt64(arr1, i);
var P2 = System.BitConverter.ToInt64(arr2, i);
if((P1 & P2) != P1) //or whatever
//break the loop if you need to.
}
My assumption is that comparing/masking two Int64s will be much faster (especially on 64-bit machines) than masking one byte at a time.
Once you've got the two arrays - one from reading the file and one from the filter, all you then need is a fast comparison for the arrays. Check out the following postings which are using unsafe or PInvoke methods.
What is the fastest way to compare two byte arrays?
Comparing two byte arrays in .NET
I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.
What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.
If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.
Is the BitArray class no help?
A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?
Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.
I have an array of uint-types in C#, After checking if the program is working on a little-endian machine, I want to convert the data to a big-endian type. Because the amount of data can become very large but is always even, I was thinking to consider two uint types as an ulong type, for a better performance and program it in ASM, so I am searching for a very fast (the fastest if possible) Assembler-algorithm to convert little-endian in big-endian.
For a large amount of data, the bswap instruction (available in Visual C++ under the _byteswap_ushort, _byteswap_ulong, and _byteswap_uint64 intrinsics) is the way to go. This will even outperform handwritten assembly. These are not available in pure C# without P/Invoke, so:
Only use this if you have a lot of data to byte swap.
You should seriously consider writing your lowest level application I/O in managed C++ so you can do your swapping before ever bringing the data into a managed array. You already have to write a C++ library, so there's not much to lose and you sidestep all the P/Invoke-related performance issues for low-complexity algorithms operating on large datasets.
PS: Many people are unaware of the byte swap intrinsics. Their performance is astonishing, doubly so for floating point data because it processes them as integers. There is no way to beat it without hand coding your register loads for every single byte swap use case, and should you try that, you'll probably incur a bigger hit in the optimizer than you'll ever pick up.
You may want to simply rethink the problem, this should not be a bottleneck. Take the naive algorithm (written in CLI assembly, just for fun). lets assume the number we want is in local number 0
LDLOC 0
SHL 24
LDLOC 0
LDC.i4 0x0000ff00
SHL 8
OR
LDLOC 0
LDC.i4 0x00ff0000
SHL.UN 8
OR
LDLOC 0
SHL.UN 24
OR
At most that's 13 (x86) assembly instructions per number (and most likely the interpreter will be even smarter by using clever registers). And it doesn't get more naive than that.
Now, compare that to the costs of
Getting the data loaded in (including whatever peripherals you are working with!)
Maniuplation of the data (doing comparisons, for instance)
Outputting the result (whatever it is)
If 13 instructions per number is a significant chunk of your execution time, then you are doing a VERY high performance task and should have your input in the correct format! You also probably would not be using a managed language because you would want far more control over buffers of data and what-not, and no extra array bounds checking.
If that array of data comes across a network, I would expect there to be much greater costs from the managing of sockets than from a mere byte order flip, if its from disk, consider pre-flipping before executing this program.
I was thinking to consider two uint
types as an ulong type
Well, that would also swap the two uint values, which might not be desirable...
You could try some C# code in unsafe mode, which may actually perform well enough. Like:
public static unsafe void SwapInts(uint[] data) {
int cnt = data.Length;
fixed (uint* d = data) {
byte* p = (byte*)d;
while (cnt-- > 0) {
byte a = *p;
p++;
byte b = *p;
*p = *(p + 1);
p++;
*p = b;
p++;
*(p - 3) = *p;
*p = a;
p++;
}
}
}
On my computer the throughput is around 2 GB per second.
I'm sorry, didn't know that Length is computed at construction time!!
I got 200 char long string A, 5 char long string B
If I do
int Al = A.length;
int Bl = B.length;
and compare it -- all seems fine BUT If I do this few million times to
calculate something, it's too expensive for the thing I need.
Much simpler and neater way would be some function that can compare two strings and tell me when the other is AT LEAST same as the other.
Something like (compare_string_lengths(stringA,stringB) -> where string B must be at least as long (chars) as string A to return TRUE for function.
Yes,
I know that the function wouldn't have any idea which string is shorter, but if lengths of the two strings would be counted in parallel so when one exceeds the other, function knows what to "answer".
Thanks for any hints.
If you only need to know whether the strings differs in length (or if you wish to check whether the lenghts are equal before comparing), I don't think that you can do it faster than comparing the Length property. Retrieving the length from a string is an O(1) operation.
To actually compare the strings you need to look at each character, which makes it an O(n) operation.
Edit:
If things runs too slowly, you should try to have a look in a profiler, what is the slowest parts ? Perhaps it is the construction of your strings that takes the time ?
There are few things cheaper than comparing the length of two strings.
If you want to find a string in a list of strings, use a Hashtable, like:
var x = new System.Collections.Generic.Dictionary<string, bool>();
x.Add("string", true);
if (x.ContainsKey("string"))
Console.WriteLine("Found string.");
This is amazingly fast.
I am randomly generating a grid of characters and storing it in a char[,] array ...
I need a way to ensure that i haven't already generated a grid before serializing it to a database in binary format...what is the best way to compare two grids based on bytes? The last thing i want to do is loop through their contents as I am already pulling one of them from the db in byte form.
I was thinking checksum but not so sure if this would work.
char[,] grid = new char[8,8];
char[,] secondgrid = new char[8,8];//gets its data from db
From what I can see, you are going to have to loop over the contents (or at least, a portion of it); there is no other way of talking about an arrays contents.
Well, as a fast "definitely not the same" you could compute a hash over the array - i.e. something like:
int hash = 7;
foreach (char c in data) {
hash = (hash * 17) + c.GetHashCode();
}
This has the risk of some false positives (reporting a dup when it is unique), but is otherwise quite cheap. Any use? You could store the hash alongside the data in the database to allow fast checks - but if you do that you should pick your own hash algorithm for char (since it isn't guaranteed to stay the same) - perhaps just convert to an int, for example - or to re-use the existing implementation:
int hash = 7;
foreach (char c in data) {
hash = (hash * 17) + (c | (c << 0x10));
}
As an aside - for 8x8, you could always just think in terms of a 64 character string, and just check ==. This would work equally well at the database and application.
Can't you get the database to do it? Make the grid column UNIQUE. Then, if you need to detect that you've generated a duplicate grid, the method for doing this might involve checking the number of rows affected by your operation, or perhaps testing for errors.
Also, if each byte is simply picked at random from [0, 255], then performing a hash to get a 4-byte number is no better than taking the first four bytes out of the grid. The chance of collisions is the same.
I'd go with a checksum/hash mechanism to catch a large percentage of the matches, then do a full comparison if you get a match.
What is the range of characters used to fill in your grid? If you're using just letters (not mixed case, or case not important), and an 8x8 grid, you're only talking about 7 or so possible collisions per item within your problem space (a very rare occurence) assuming a good hashing function. You could do something like:
Generate Grid
Load any matching grids from DB
if found match from #2, goto 1
Use your new grid.
Try this (invoke ComputeHash for every matrix and compare the guids):
private static MD5 md5 = MD5.Create();
public static Guid ComputeHash(object value)
{
Guid g = Guid.Empty;
BinaryFormatter bf = new BinaryFormatter();
using (MemoryStream stm = new MemoryStream())
{
bf.Serialize(stm, value);
g = new Guid(md5.ComputeHash(stm.ToArray()));
stm.Close();
}
return g;
}
note: Generating the byte array might be accomplished a lot simpler since you have a char array.