Optimizing array that has many elements and different standards

Optimizing array that has many elements and different standards - c#

I have a function that takes in X as an argument and randomly picks an element from a 2D array.
The 2D array has thousands of elements, each of them has a different requirement on X, stored in arr[Y][1].
For example,
arr[0] should only be chosen when X is larger than 4. (arr[0][1] = 4+)
Then arr[33] should only be chosen when X is between 37 and 59. (arr[33][1] = 37!59)
And arr[490] should only be chosen when X is less than 79. (arr[490][1] = 79-)
And there are many more, most with a different X requirement.
What is the best way to tackle this problem that takes the least space, and least repetition of elements?
The worst way would be storing possible choices for each X in a 2D array. But that would cause a lot of repetition, costing too much memory.
Then, I have thought about using three arrays, separating X+ requirements, X- and X range. But it still sounds too basic to me, is there a better way?

One option here would be what's called "accept/reject sampling": you pick a random index i and check if the condition on X is satisfied for that index. If so, you return arr[i]. If not, you pick another index at random and repeat until you find something.
Performance will be good so long as most conditions are satisfied for most values of i. If this isn't the case -- if there are a lot of values of X for which only a tiny number of conditions are satisfied -- then it might make sense to try and precompute something that lets you find (or narrow down) the indices that are allowable for a given X.
How to do this depends on what you allow as a condition on each index. For instance, if every condition is given by an interval like in the examples you give, you could sort the list twice, first by left endpoints and then by right endpoints. Then determining the valid indices for a particular value of X comes down to intersecting the intervals whose left endpoint is less than or equal to X with those whose right endpoint is greater than or equal to X.
Of course if you allow conditions other than "X is in this interval" then you'd need a different algorithm.

While I believe that re-sampling will be the optimal solution in your case (dozens of resamplings is very cheap price to pay), here is the algorithm I would never implement in practice (since it uses very complicated datastructures and is less efficient than resampling), but with provable bounds. It requires O(n log n) preprocessing time, O(n log n) memory and O(log n) time for each query, where n is the number of elements you can potentially sample.
You store all ends of all ranges in one array (call it ends). E.g. in your case you have an array [-infty, 4, 37, 59, 79, +infty] (it may require some tuning, like adding +1 to right ends of ranges; not important now). The idea is that for any X we only have to determine between which ends it's located. E.g. if X=62 is in range [59; 79] (I'll call such pair an interval). Then for each interval you store a set of all possible ranges. For your input X you just find the interval (using binary search) and then output a random range, corresponding to this interval.
How do you compute the corresponding set of ranges for each interval? We go from left to right in ends array. Let's assume we compute the set for the current interval, and go to the next one. There is some end between these interval. If it's a left end of some interval, we add the corresponding range to the new set (since we enter this range). If it's a right end, we remove the range. How do we do this in O(log n) time instead of O(n)? Immutable balanced tree sets can do this (essentially, they create new trees instead of modifying the old one).
How do you return a uniformly random range from a set? You should augment tree sets: each node should know how many nodes its subtree contains. First you sample an integer in range [0; size(tree)). Then you look at your root node and its children. For example, assume that you sampled integer 15, and your left child's subtree has size 10, while the right's one is 20. Then you go to the right child (since 15 >= 10) and process it with integer 5 (since 15 - 10 = 5). You will eventually visit a leaf, corresponding to a single range. Return this range.
Sorry if it's hard to understand. Like I said, it's not trivial approach which you would need for upper bounds in the worse case (other approaches discussed before require linear time in the worst case; resampling may run for indefinite time if there is no element satisfying restrictions). It also requires some careful handling (e.g. when some ranges have coinciding endpoints).

Related

Efficiently finding new max element in a set after removing some elements

We have some graphs (as in visual graphs with axes) with n data points, where n can be quite large. The data can be represented as coherent lists or arrays of x- and y-coordinates in double format, and the x-values are sorted. There may be duplicates among both the x- and y-values, and both may contain negative values. Furthermore, the y-values may contain values with NaN.
Each time the data are updated, we need to recalculate the max value of the y-values to update the max value on the axis of the graphs. This is easy if data points are inserted, since we can just compare the new value with the current max value and see if this is exceeded. But when removing data points, we need to check a lot more data.
Often, a range of values are removed, say from index i and the following m indices (we always receive this information as an index and a range in the data lists). Our current strategy is to find the max value of the removed m data points and compare this with the current max value for the entire data set. If one of these match, it is recalculated from the remaining n - m data points and updated. This means that we only rarely need to check against all n data points.
...but we would rather avoid this completely. The current remove operation has an average running time of O(1) (I think), but a worst-case of O(n). Is there some way to remove an element from a set and find the new max of the set in something like O(log n) which would be unnoticeable for our users? We can create and store additional lists and arrays of equivalent sizes if needed.
We have considered things like partitioning the data in segments, each with their own max value, but since the remove operation changes the indices of the underlying data, we need an efficient way of linking them without recalculating all indices. We also considered using a SortedSet, but sets don't allow duplicates.
I hope someone can point us to a solution or unexpectedly tell us that this method is already maximally efficient.

find the k smallest\biggest element in a 2D sorted array

Given a 2 D array. The rows and columns are sorted.
Find the kth largest element from the 2-d array in most efficient way. Can it be done in-place?

A slightly bruteforce in-place solution: try to guess the value by binary search. You know the max and min values (they are in the corners). For every candidate, count the number of elements that are smaller while you follow the boundary between smaller and greater elements. Since the array is sorted, this boundary is a reasonably short path crossing it. Keep track of the position of the maximum among the smaller elements. This value might appear several times. Count them. Assuming an NxN array, this would take O(N*B), where B is the number of bits of the values.
I'm just thinking out loud... I vaguely remember reading about an incredibly optimal solution, but I don't know where.

Generate Number Range in a List of Numbers

I am using C# and have a list of int numbers which contains different numbers such as {34,36,40,35,37,38,39,4,5,3}. Now I need a script to find the different ranges in the list and write it on a file. for this example it would be: (34-40) and (3-5). What is the quick way to do it?
thanks for the help in advance;

The easiest way would be to sort the array and then do a single sequential pass to capture the ranges. That will most likely be fast enough for your purposes.

Two techniques come to mind: histogramming and sorting. Histogramming will be good for dense number sets (where you have most of the numbers between min and max) and sorting will be good if you have sparse number sets (very few of the numbers between min and max are actually used).
For histogramming, simply walk the array and set a Boolean flag to True in the corresponding position histogram, then walk the histogram looking for runs of True (default should be false).
For sorting, simply sort the array using the best applicable sorting technique, then walk the sorted array looking for contiguous runs.
EDIT: some examples.
Let's say you have an array with the first 1,000,000 positive integers, but all even multiples of 191 are removed (you don't know this ahead of time). Histogramming will be a better approach here.
Let's say you have an array containing powers of 2 (2, 4, 8, 16, ...) and 3 (3, 9, 27, 81, ...). For large lists, the list will be fairly sparse and sorting should be expected to do better.

As Mike said, first sort the list. Now, starting with the first element, remember that element, then compare it with the next one. If the next element is 1 greater than the current one, you have a contiguous series. Continue this until the next number is NOT contiguous. When you reach that point, you have a range from the first remembered value to the current value. Remember/output that range, then start again with the next value as the first element of a new series. This will execute in roughly 2N time (linear).

I would sort them and then check for consecutive numbers. If the difference > 1 you have a new range.

Random.Next() - finding the Nth .Next()

Given a consistently seeded Random:
Random r = new Random(0);
Calling r.Next() consistently produces the same series; so is there a way to quickly discover the N-th value in that series, without calling r.Next() N times?
My scenario is a huge array of values created via r.Next(). The app occasionally reads a value from the array at arbitrary indexes. I'd like to optimize memory usage by eliminating the array and instead, generating the values on demand. But brute-forcing r.Next() 5 million times to simulate the 5 millionth index of the array is more expensive than storing the array. Is it possible to short-cut your way to the Nth .Next() value, without / with less looping?

I don't know the details of the PRNG used in the BCL, but my guess is that you will find it extremely difficult / impossible to find a nice, closed-form solution for N-th value of the series.
How about this workaround:
Make the seed to the random-number generator the desired index, and then pick the first generated number. This is equally 'deterministic', and gives you a wide range to play with in O(1) space.
static int GetRandomNumber(int index)
{
return new Random(index).Next();
}

In theory if you knew the exact algorithm and the initial state you'd be able to duplicate the series but the end result would just be identical to calling r.next().
Depending on how 'good' you need your random numbers to be you might consider creating your own PRNG based on a Linear congruential generator which is relatively easy/fast to generate numbers for. If you can live with a "bad" PRNG there are likely other algorithms that may be better to use for your purpose. Whether this would be faster/better than just storing a large array of numbers from r.next() is another question.

No, I don't believe there is. For some RNG algorithms (such as linear congruential generators) it's possible in principle to get the n'th value without iterating through n steps, but the Random class doesn't provide a way of doing that.
I'm not sure whether the algorithm it uses makes it possible in principle -- it's a variant (details not disclosed in documentation) of Knuth's subtractive RNG, and it seems like the original Knuth RNG should be equivalent to some sort of polynomial-arithmetic thing that would allow access to the n'th value, but (1) I haven't actually checked that and (2) whatever tweaks Microsoft have made might break that.
If you have a good enough "scrambling" function f then you can use f(0), f(1), f(2), ... as your sequence of random numbers, instead of f(0), f(f(0)), f(f(f(0))), etc. (the latter being roughly what most RNGs do) and then of course it's trivial to start the sequence at any point you please. But you'll need to choose a good f, and it'll probably be slower than a standard RNG.

You could build your own on-demand dictionary of 'indexes' & 'random values'. This assumes that you will always 'demand' indexes in the same order each time the program runs or that you don't care if the results are the same each time the program runs.
Random rnd = new Random(0);
Dictionary<int,int> randomNumbers = new Dictionary<int,int>();
int getRandomNumber(int index)
{
if (!randomNumbers.ContainsKey(index))
randomNumbers[index] = rnd.Next();
return randomNumbers[index];
}

First n positions of true values from a bit pattern

I have a bit pattern of 100 bits. The program will change the bits in the pattern as true or false. At any given time I need to find the positions of first "n" true values. For example,if the patten is as follows
10011001000
The first 3 indexes where bits are true are 0, 3, 4
The first 4 indexes where bits are true are 0, 3, 4, 7
I can have a List, but the complexity of firstntrue(int) will be O(n). Is there anyway to improve the performance?

I'm assuming the list isn't changing while you are searching, but that it changes up until you decide to search, and then you do your thing.
For each byte there are 2^8 = 256 combinations of 0 and 1. Here you have 100/8 = 13 bytes to examine.
So you can build a lookup table of 256 entries. The key is the current real value of the byte you're examining in the bit stream, and the value is the data you seek (a tuple containing the positions of the 1 bits). So, if you gave it 5 it would return {0,2}. The cost of this lookup is constant and the memory usage is very small.
Now as you go through the bit stream you can proces the data a byte at a time (instead of a bit at a time) and just keep track of the current byte number (starting at 0, of course) and add 8*current-byte-number to the values in the tuple returned. So now you've essentially reduced the problem to O(n/8) by using the precomputed lookup table.
You can build a larger look-up table to get more speed but that will use more memory.
Though I can't imagine that an O(n) algorithm where n=100 is really the source of some performance issue for you. Unless you're calling it a lot inside some inner loop?

No there is no way to improve O(n). That can be proven mathematically

No.
Well, not unless you intercept the changes as they occur, and maintain a "first 100" list.

The complexity cannot be reduced without additional data structures, because in the worst case you need to scan the whole list.

For "n" Items you have to check at most "n" times I.E O(n)!
How can you expect to reduce that without any interception and any knowledge of how they've changed?!

No, you can not improve the complexity if you just have a plain array.
If you have few 1:s to many 0:s you can improve the performance by a constant factor, but it will still be O(n).
If you can treat you bit array as an byte array (or even int32 array) you can check each byte if the byte > 0 before checking each individual bit.
If you have less 1-bits than 1:8 you could implement it as a sparse array instead List<byte> where you store the index of all 1:s.

As others have said, to find the n lowest set bits in the absence of further structures is an O(n) operation.
If you're looking to improve performance, have you looked at the implementation side of the problem?
Off the top of my head, q & ~(q-1) will leave only the lowest bit set in the number q, since subtracting 1 from any binary number fills in 1s to the right up to the first digit that was set, changes that digit into a 0 and leaves the rest alone. In a number with 1 bit set, shifting to the right and testing against zero gives a simple test to distinguish whether a potential answer is less than the real answer or is greater than or equal. So you can binary search from there.
To find the next one, remove the lowest digit and use a smaller initial binary search window. There are probably better solutions, but this should be better than going through every bit from least to most and checking if it's set.
That implementation stuff that doesn't affect the complexity, but may help with performance.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.