Efficiently finding new max element in a set after removing some elements - c#

We have some graphs (as in visual graphs with axes) with n data points, where n can be quite large. The data can be represented as coherent lists or arrays of x- and y-coordinates in double format, and the x-values are sorted. There may be duplicates among both the x- and y-values, and both may contain negative values. Furthermore, the y-values may contain values with NaN.
Each time the data are updated, we need to recalculate the max value of the y-values to update the max value on the axis of the graphs. This is easy if data points are inserted, since we can just compare the new value with the current max value and see if this is exceeded. But when removing data points, we need to check a lot more data.
Often, a range of values are removed, say from index i and the following m indices (we always receive this information as an index and a range in the data lists). Our current strategy is to find the max value of the removed m data points and compare this with the current max value for the entire data set. If one of these match, it is recalculated from the remaining n - m data points and updated. This means that we only rarely need to check against all n data points.
...but we would rather avoid this completely. The current remove operation has an average running time of O(1) (I think), but a worst-case of O(n). Is there some way to remove an element from a set and find the new max of the set in something like O(log n) which would be unnoticeable for our users? We can create and store additional lists and arrays of equivalent sizes if needed.
We have considered things like partitioning the data in segments, each with their own max value, but since the remove operation changes the indices of the underlying data, we need an efficient way of linking them without recalculating all indices. We also considered using a SortedSet, but sets don't allow duplicates.
I hope someone can point us to a solution or unexpectedly tell us that this method is already maximally efficient.

Related

How double hashing works in case of the .NET Dictionary?

The other day I was reading that article on CodeProject
And I got hard times understanding a few points about the implementation of the .NET Dictionary (considering the implementation here without all the optimizations in .NET Core):
Note: If will add more items than the maximum number in the table
(i.e 7199369), the resize method will manually search the next prime
number that is larger than twice the old size.
Note: The reason that the sizes are being doubled while resizing the
array is to make the inner-hash table operations to have asymptotic
complexity. The prime numbers are being used to support
double-hashing.
So I tried to remember my old CS classes back a decade ago with my good friend wikipedia:
Open Addressing
Separate Chaining
Double Hashing
But I still don't really see how first it relates to double hashing (which is a collision resolution technique for open-addressed hash tables) except the fact that the Resize() method double of the entries based on the minimum prime number (taken based on the current/old size), and tbh I don't really see the benefits of "doubling" the size, "asymptotic complexity" (I guess that article meant O(n) when the underlying array (entries) is full and subject to resize).
First, If you double the size with or without using a prime, is it not really the same?
Second, to me, the .NET hash table use a separate chaining technique when it comes to collision resolution.
I guess I must have missed a few things and I would like to have someone who can shed the light on those two points.
I got my answer on Reddit, so I am gonna try to summarize here:
Collision Resolution Technique
First off, it seems that the collision resolution is using Separate Chaining technique and not Open addressing technique and therefore there is no Double Hashing strategy:
The code goes as follows:
private struct Entry
{
public int hashCode; // Lower 31 bits of hash code, -1 if unused
public int next; // Index of next entry, -1 if last
public TKey key; // Key of entry
public TValue value; // Value of entry
}
It just that instead of having one dedicated storage for all the entries sharing the same hashcode / index like a list or whatnot for every bucket, everything is stored in the same entries array.
Prime Number
About the prime number the answer lies here: https://cs.stackexchange.com/a/64191/42745 it's all about multiple:
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this
be achieved? By choosing m to be a number that has very few factors: a
prime number.
Doubling the underlying entries array size
Help to avoid call too many resize operations (i.e. copies) by increasing the size of the array by enough amount of slots.
See that answer: https://stackoverflow.com/a/2369504/4636721
Hash-tables could not claim "amortized constant time insertion" if,
for instance, the resizing was by a constant increment. In that case
the cost of resizing (which grows with the size of the hash-table)
would make the cost of one insertion linear in the total number of
elements to insert. Because resizing becomes more and more expensive
with the size of the table, it has to happen "less and less often" to
keep the amortized cost of insertion constant.

Optimizing array that has many elements and different standards

I have a function that takes in X as an argument and randomly picks an element from a 2D array.
The 2D array has thousands of elements, each of them has a different requirement on X, stored in arr[Y][1].
For example,
arr[0] should only be chosen when X is larger than 4. (arr[0][1] = 4+)
Then arr[33] should only be chosen when X is between 37 and 59. (arr[33][1] = 37!59)
And arr[490] should only be chosen when X is less than 79. (arr[490][1] = 79-)
And there are many more, most with a different X requirement.
What is the best way to tackle this problem that takes the least space, and least repetition of elements?
The worst way would be storing possible choices for each X in a 2D array. But that would cause a lot of repetition, costing too much memory.
Then, I have thought about using three arrays, separating X+ requirements, X- and X range. But it still sounds too basic to me, is there a better way?
One option here would be what's called "accept/reject sampling": you pick a random index i and check if the condition on X is satisfied for that index. If so, you return arr[i]. If not, you pick another index at random and repeat until you find something.
Performance will be good so long as most conditions are satisfied for most values of i. If this isn't the case -- if there are a lot of values of X for which only a tiny number of conditions are satisfied -- then it might make sense to try and precompute something that lets you find (or narrow down) the indices that are allowable for a given X.
How to do this depends on what you allow as a condition on each index. For instance, if every condition is given by an interval like in the examples you give, you could sort the list twice, first by left endpoints and then by right endpoints. Then determining the valid indices for a particular value of X comes down to intersecting the intervals whose left endpoint is less than or equal to X with those whose right endpoint is greater than or equal to X.
Of course if you allow conditions other than "X is in this interval" then you'd need a different algorithm.
While I believe that re-sampling will be the optimal solution in your case (dozens of resamplings is very cheap price to pay), here is the algorithm I would never implement in practice (since it uses very complicated datastructures and is less efficient than resampling), but with provable bounds. It requires O(n log n) preprocessing time, O(n log n) memory and O(log n) time for each query, where n is the number of elements you can potentially sample.
You store all ends of all ranges in one array (call it ends). E.g. in your case you have an array [-infty, 4, 37, 59, 79, +infty] (it may require some tuning, like adding +1 to right ends of ranges; not important now). The idea is that for any X we only have to determine between which ends it's located. E.g. if X=62 is in range [59; 79] (I'll call such pair an interval). Then for each interval you store a set of all possible ranges. For your input X you just find the interval (using binary search) and then output a random range, corresponding to this interval.
How do you compute the corresponding set of ranges for each interval? We go from left to right in ends array. Let's assume we compute the set for the current interval, and go to the next one. There is some end between these interval. If it's a left end of some interval, we add the corresponding range to the new set (since we enter this range). If it's a right end, we remove the range. How do we do this in O(log n) time instead of O(n)? Immutable balanced tree sets can do this (essentially, they create new trees instead of modifying the old one).
How do you return a uniformly random range from a set? You should augment tree sets: each node should know how many nodes its subtree contains. First you sample an integer in range [0; size(tree)). Then you look at your root node and its children. For example, assume that you sampled integer 15, and your left child's subtree has size 10, while the right's one is 20. Then you go to the right child (since 15 >= 10) and process it with integer 5 (since 15 - 10 = 5). You will eventually visit a leaf, corresponding to a single range. Return this range.
Sorry if it's hard to understand. Like I said, it's not trivial approach which you would need for upper bounds in the worse case (other approaches discussed before require linear time in the worst case; resampling may run for indefinite time if there is no element satisfying restrictions). It also requires some careful handling (e.g. when some ranges have coinciding endpoints).

find the k smallest\biggest element in a 2D sorted array

Given a 2 D array. The rows and columns are sorted.
Find the kth largest element from the 2-d array in most efficient way. Can it be done in-place?
A slightly bruteforce in-place solution: try to guess the value by binary search. You know the max and min values (they are in the corners). For every candidate, count the number of elements that are smaller while you follow the boundary between smaller and greater elements. Since the array is sorted, this boundary is a reasonably short path crossing it. Keep track of the position of the maximum among the smaller elements. This value might appear several times. Count them. Assuming an NxN array, this would take O(N*B), where B is the number of bits of the values.
I'm just thinking out loud... I vaguely remember reading about an incredibly optimal solution, but I don't know where.

Generate Number Range in a List of Numbers

I am using C# and have a list of int numbers which contains different numbers such as {34,36,40,35,37,38,39,4,5,3}. Now I need a script to find the different ranges in the list and write it on a file. for this example it would be: (34-40) and (3-5). What is the quick way to do it?
thanks for the help in advance;
The easiest way would be to sort the array and then do a single sequential pass to capture the ranges. That will most likely be fast enough for your purposes.
Two techniques come to mind: histogramming and sorting. Histogramming will be good for dense number sets (where you have most of the numbers between min and max) and sorting will be good if you have sparse number sets (very few of the numbers between min and max are actually used).
For histogramming, simply walk the array and set a Boolean flag to True in the corresponding position histogram, then walk the histogram looking for runs of True (default should be false).
For sorting, simply sort the array using the best applicable sorting technique, then walk the sorted array looking for contiguous runs.
EDIT: some examples.
Let's say you have an array with the first 1,000,000 positive integers, but all even multiples of 191 are removed (you don't know this ahead of time). Histogramming will be a better approach here.
Let's say you have an array containing powers of 2 (2, 4, 8, 16, ...) and 3 (3, 9, 27, 81, ...). For large lists, the list will be fairly sparse and sorting should be expected to do better.
As Mike said, first sort the list. Now, starting with the first element, remember that element, then compare it with the next one. If the next element is 1 greater than the current one, you have a contiguous series. Continue this until the next number is NOT contiguous. When you reach that point, you have a range from the first remembered value to the current value. Remember/output that range, then start again with the next value as the first element of a new series. This will execute in roughly 2N time (linear).
I would sort them and then check for consecutive numbers. If the difference > 1 you have a new range.

First n positions of true values from a bit pattern

I have a bit pattern of 100 bits. The program will change the bits in the pattern as true or false. At any given time I need to find the positions of first "n" true values. For example,if the patten is as follows
10011001000
The first 3 indexes where bits are true are 0, 3, 4
The first 4 indexes where bits are true are 0, 3, 4, 7
I can have a List, but the complexity of firstntrue(int) will be O(n). Is there anyway to improve the performance?
I'm assuming the list isn't changing while you are searching, but that it changes up until you decide to search, and then you do your thing.
For each byte there are 2^8 = 256 combinations of 0 and 1. Here you have 100/8 = 13 bytes to examine.
So you can build a lookup table of 256 entries. The key is the current real value of the byte you're examining in the bit stream, and the value is the data you seek (a tuple containing the positions of the 1 bits). So, if you gave it 5 it would return {0,2}. The cost of this lookup is constant and the memory usage is very small.
Now as you go through the bit stream you can proces the data a byte at a time (instead of a bit at a time) and just keep track of the current byte number (starting at 0, of course) and add 8*current-byte-number to the values in the tuple returned. So now you've essentially reduced the problem to O(n/8) by using the precomputed lookup table.
You can build a larger look-up table to get more speed but that will use more memory.
Though I can't imagine that an O(n) algorithm where n=100 is really the source of some performance issue for you. Unless you're calling it a lot inside some inner loop?
No there is no way to improve O(n). That can be proven mathematically
No.
Well, not unless you intercept the changes as they occur, and maintain a "first 100" list.
The complexity cannot be reduced without additional data structures, because in the worst case you need to scan the whole list.
For "n" Items you have to check at most "n" times I.E O(n)!
How can you expect to reduce that without any interception and any knowledge of how they've changed?!
No, you can not improve the complexity if you just have a plain array.
If you have few 1:s to many 0:s you can improve the performance by a constant factor, but it will still be O(n).
If you can treat you bit array as an byte array (or even int32 array) you can check each byte if the byte > 0 before checking each individual bit.
If you have less 1-bits than 1:8 you could implement it as a sparse array instead List<byte> where you store the index of all 1:s.
As others have said, to find the n lowest set bits in the absence of further structures is an O(n) operation.
If you're looking to improve performance, have you looked at the implementation side of the problem?
Off the top of my head, q & ~(q-1) will leave only the lowest bit set in the number q, since subtracting 1 from any binary number fills in 1s to the right up to the first digit that was set, changes that digit into a 0 and leaves the rest alone. In a number with 1 bit set, shifting to the right and testing against zero gives a simple test to distinguish whether a potential answer is less than the real answer or is greater than or equal. So you can binary search from there.
To find the next one, remove the lowest digit and use a smaller initial binary search window. There are probably better solutions, but this should be better than going through every bit from least to most and checking if it's set.
That implementation stuff that doesn't affect the complexity, but may help with performance.

Categories

Resources