I have a bit pattern of 100 bits. The program will change the bits in the pattern as true or false. At any given time I need to find the positions of first "n" true values. For example,if the patten is as follows
10011001000
The first 3 indexes where bits are true are 0, 3, 4
The first 4 indexes where bits are true are 0, 3, 4, 7
I can have a List, but the complexity of firstntrue(int) will be O(n). Is there anyway to improve the performance?
I'm assuming the list isn't changing while you are searching, but that it changes up until you decide to search, and then you do your thing.
For each byte there are 2^8 = 256 combinations of 0 and 1. Here you have 100/8 = 13 bytes to examine.
So you can build a lookup table of 256 entries. The key is the current real value of the byte you're examining in the bit stream, and the value is the data you seek (a tuple containing the positions of the 1 bits). So, if you gave it 5 it would return {0,2}. The cost of this lookup is constant and the memory usage is very small.
Now as you go through the bit stream you can proces the data a byte at a time (instead of a bit at a time) and just keep track of the current byte number (starting at 0, of course) and add 8*current-byte-number to the values in the tuple returned. So now you've essentially reduced the problem to O(n/8) by using the precomputed lookup table.
You can build a larger look-up table to get more speed but that will use more memory.
Though I can't imagine that an O(n) algorithm where n=100 is really the source of some performance issue for you. Unless you're calling it a lot inside some inner loop?
No there is no way to improve O(n). That can be proven mathematically
No.
Well, not unless you intercept the changes as they occur, and maintain a "first 100" list.
The complexity cannot be reduced without additional data structures, because in the worst case you need to scan the whole list.
For "n" Items you have to check at most "n" times I.E O(n)!
How can you expect to reduce that without any interception and any knowledge of how they've changed?!
No, you can not improve the complexity if you just have a plain array.
If you have few 1:s to many 0:s you can improve the performance by a constant factor, but it will still be O(n).
If you can treat you bit array as an byte array (or even int32 array) you can check each byte if the byte > 0 before checking each individual bit.
If you have less 1-bits than 1:8 you could implement it as a sparse array instead List<byte> where you store the index of all 1:s.
As others have said, to find the n lowest set bits in the absence of further structures is an O(n) operation.
If you're looking to improve performance, have you looked at the implementation side of the problem?
Off the top of my head, q & ~(q-1) will leave only the lowest bit set in the number q, since subtracting 1 from any binary number fills in 1s to the right up to the first digit that was set, changes that digit into a 0 and leaves the rest alone. In a number with 1 bit set, shifting to the right and testing against zero gives a simple test to distinguish whether a potential answer is less than the real answer or is greater than or equal. So you can binary search from there.
To find the next one, remove the lowest digit and use a smaller initial binary search window. There are probably better solutions, but this should be better than going through every bit from least to most and checking if it's set.
That implementation stuff that doesn't affect the complexity, but may help with performance.
Related
The other day I was reading that article on CodeProject
And I got hard times understanding a few points about the implementation of the .NET Dictionary (considering the implementation here without all the optimizations in .NET Core):
Note: If will add more items than the maximum number in the table
(i.e 7199369), the resize method will manually search the next prime
number that is larger than twice the old size.
Note: The reason that the sizes are being doubled while resizing the
array is to make the inner-hash table operations to have asymptotic
complexity. The prime numbers are being used to support
double-hashing.
So I tried to remember my old CS classes back a decade ago with my good friend wikipedia:
Open Addressing
Separate Chaining
Double Hashing
But I still don't really see how first it relates to double hashing (which is a collision resolution technique for open-addressed hash tables) except the fact that the Resize() method double of the entries based on the minimum prime number (taken based on the current/old size), and tbh I don't really see the benefits of "doubling" the size, "asymptotic complexity" (I guess that article meant O(n) when the underlying array (entries) is full and subject to resize).
First, If you double the size with or without using a prime, is it not really the same?
Second, to me, the .NET hash table use a separate chaining technique when it comes to collision resolution.
I guess I must have missed a few things and I would like to have someone who can shed the light on those two points.
I got my answer on Reddit, so I am gonna try to summarize here:
Collision Resolution Technique
First off, it seems that the collision resolution is using Separate Chaining technique and not Open addressing technique and therefore there is no Double Hashing strategy:
The code goes as follows:
private struct Entry
{
public int hashCode; // Lower 31 bits of hash code, -1 if unused
public int next; // Index of next entry, -1 if last
public TKey key; // Key of entry
public TValue value; // Value of entry
}
It just that instead of having one dedicated storage for all the entries sharing the same hashcode / index like a list or whatnot for every bucket, everything is stored in the same entries array.
Prime Number
About the prime number the answer lies here: https://cs.stackexchange.com/a/64191/42745 it's all about multiple:
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this
be achieved? By choosing m to be a number that has very few factors: a
prime number.
Doubling the underlying entries array size
Help to avoid call too many resize operations (i.e. copies) by increasing the size of the array by enough amount of slots.
See that answer: https://stackoverflow.com/a/2369504/4636721
Hash-tables could not claim "amortized constant time insertion" if,
for instance, the resizing was by a constant increment. In that case
the cost of resizing (which grows with the size of the hash-table)
would make the cost of one insertion linear in the total number of
elements to insert. Because resizing becomes more and more expensive
with the size of the table, it has to happen "less and less often" to
keep the amortized cost of insertion constant.
This question already has an answer here:
What is performance of ContainsKey and TryGetValue?
(1 answer)
Closed 9 years ago.
As I have read on wikipedia that hash tables have on average O(1) search time.
So lets say I have a very large dictionary that contains maybe tens of millions of records.
If I use Dicionary.ContainsKey to extract the value against a given key will it's lookup time be really 1 or would it be like log n or something else due to some different internal implementation by .NET.
Big Oh notation doesn't tell you how long something takes. It tells you how it scales.
Easiest one to envision is searching for an item in a List<>, it has O(n) complexity. If it takes, on average, 2 milliseconds to find an item in a list with a million elements then you can expect it to take 4 milliseconds if the list has two million elements. It scales linearly with the size of the list.
O(1) predicts constant time for finding an element in a dictionary. In other words, it doesn't depend on the size of the dictionary. If the dictionary is twice as big, it doesn't take twice as long to find the element, it takes (roughly) as much time. The "roughly" means that it actually does take a bit longer, it is amortized O(1).
It would still be close to O(1), because it would still not depend on the number of the entries, but on the numbers of the collisions you have. Indexing an array is still O(1), no matter how many items you have.
Also, there seems to be a top limit on size of Dictionary caused by the implementation: How is the c#/.net 3.5 dictionary implemented?
Once we pass this size, the next step falls outside the internal array, and it will manually search for larger primes. This will be quite slow. You could initialize with 7199369 (the largest value in the array), or consider if having more than about 5 million entries in a Dictionary might mean that you should reconsider your design.
What is the key? If the key is Int32 then yes it will be close to order 1.
You only get less than order 1 if there are hash collisions.
Int32 as a key will have zero hash collisions but that does not guarantee zero hash bucket collisions.
Be careful of keys that produce hash collisions.
KVP and tuple can create a lot of hash collisions and are not good candidates for key.
I am using C# and have a list of int numbers which contains different numbers such as {34,36,40,35,37,38,39,4,5,3}. Now I need a script to find the different ranges in the list and write it on a file. for this example it would be: (34-40) and (3-5). What is the quick way to do it?
thanks for the help in advance;
The easiest way would be to sort the array and then do a single sequential pass to capture the ranges. That will most likely be fast enough for your purposes.
Two techniques come to mind: histogramming and sorting. Histogramming will be good for dense number sets (where you have most of the numbers between min and max) and sorting will be good if you have sparse number sets (very few of the numbers between min and max are actually used).
For histogramming, simply walk the array and set a Boolean flag to True in the corresponding position histogram, then walk the histogram looking for runs of True (default should be false).
For sorting, simply sort the array using the best applicable sorting technique, then walk the sorted array looking for contiguous runs.
EDIT: some examples.
Let's say you have an array with the first 1,000,000 positive integers, but all even multiples of 191 are removed (you don't know this ahead of time). Histogramming will be a better approach here.
Let's say you have an array containing powers of 2 (2, 4, 8, 16, ...) and 3 (3, 9, 27, 81, ...). For large lists, the list will be fairly sparse and sorting should be expected to do better.
As Mike said, first sort the list. Now, starting with the first element, remember that element, then compare it with the next one. If the next element is 1 greater than the current one, you have a contiguous series. Continue this until the next number is NOT contiguous. When you reach that point, you have a range from the first remembered value to the current value. Remember/output that range, then start again with the next value as the first element of a new series. This will execute in roughly 2N time (linear).
I would sort them and then check for consecutive numbers. If the difference > 1 you have a new range.
I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.
What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.
If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.
Is the BitArray class no help?
A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?
Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.
I'm working on an application that needs to pass around large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000. I expect distribution of ids in any given set to be effectively random over this range. The operations I need on the set are dirt simple:
Add a new value
Iterate over all of the values.
There is a serious concern about the memory usage of these sets, so I'm looking for a data structure that can store the ids more efficiently than a simple List<int>or HashSet<int>. I've looked at BitArray, but that can be wasteful depending on how sparse the ids are. I've also considered a bitwise trie, but I'm unsure how to calculate the space efficiency of that solution for the expected data. A Bloom Filter would be great, if only I could tolerate the false negatives.
I would appreciate any suggestions of data structures suitable for this purpose. I'm interested in both out-of-the-box and custom solutions.
EDIT: To answer your questions:
No, the items don't need to be sorted
By "pass around" I mean both pass between methods and serialize and send over the wire. I clearly should have mentioned this.
There could be a decent number of these sets in memory at once (~100).
Use the BitArray. It uses only some 6MB of memory; the only real problem is that iteration is Theta(N), i.e. you have to walk the entire range. Locality of reference is good though and you can allocate the entire structure in one operation.
As for wasting space: you waste 6MB in the worst case.
EDIT: ok, you've lots of sets and you're serializing. For serializing on disk, I suggest 6MB files :)
For sending over the wire, just iterate and consider sending ranges instead of individual elements. That does require a sorting structure.
You need lots of these sets. Consider if you have 600MB to spare. Otherwise, check out:
Bytewise tries: O(1) insert, O(n) iteration, much lower constant factors than bitwise tries
A custom hash table, perhaps Google sparsehash through C++/CLI
BSTs storing ranges/intervals
Supernode BSTs
It would depend on the distribution of the sizes of your sets. Unless you expect most of the sets to be (close to) the minimum you've specified, I'd probably use a bitset. To cover a range up to 50,000,000, a bitset ends up ~6 megabytes.
Compared to storing the numbers directly, this is marginally larger for the minimum size set you've specified (~6 megabytes instead of ~4), but considerably smaller for the maximum size set (1/32nd the size).
The second possibility would be to use a delta encoding. For example, instead of storing each number directly, store the difference between that number and the previous number that was included. Given a maximum magnitude of 50,000,000 and a minimum size of 1,000,000 items, the average difference between one number and the next is ~50. This means you can theoretically store the difference in <6 bits on average. I'd probably use the 7 least significant bits directly, and if you need to encode a larger gap, set the msb and (for example) store the size of the gap in the lower 7 bits plus the next three bytes. That can't happen very often, so in most cases you're using only one byte per number, for about 4:1 compression compared to storing numbers directly. In the best case this would use ~1 megabyte for a set, and in the worst about 50 megabytes -- 4:1 compression compared to storing numbers directly.
If you don't mind a little bit of extra code, you could use an adaptive scheme -- delta encoding for small sets (up to 6,000,000 numbers), and a bitmap for larger sets.
I think the answer depends on what you mean by "passing around" and what you're trying to accomplish. You say you are only adding to the list: how often do you add? How fast will the list grow? What is an acceptable overhead for memory use, versus the time to reallocate memory?
In your worst case, 50,000,000 32-bit numbers = 200 megabytes using the most efficient possible data storage mechanism. Assuming you may end up with this much use in your worst case scenario, is it OK to use this much memory all the time? Is that better than having to reallocate memory frequently? What's the distribution of typical usage patterns? You could always just use an int[] that's pre-allocated to the whole 50 million.
As far as access speed for your operations, nothing is faster than iterating and adding to a pre-allocated chunk of memory.
From OP edit: There could be a decent number of these sets in memory at once (~100).
Hey now. You need to store 100 sets of 1 to 50 million numbers in memory at once? I think the bitset method is the only possible way this could work.
That would be 600 megabytes. Not insignificant, but unless they are (typically) mostly empty, it seems very unlikely that you would find a more efficient storage mechanism.
Now, if you don't use bitsets, but rather use dynamically sized constructs, and they could somehow take up less space to begin with, you're talking about a real ugly memory allocation/deallocation/garbage collection scenario.
Let's assume you really need to do this, though I can only imagine why. So your server's got a ton of memory, just allocate as many of these 6 megabyte bitsets as you need and recycle them. Allocation and garbage collection are no longer a problem. Yeah, you're using a ton of memory, but that seems inevitable.