How to find the top several values from an array?

How to find the top several values from an array? - c#

I have an array of float values and want the value and more importantly the position of the maximum four values.
I built the system originally to walk through the array and find the max the usual way, by comparing the value at the current position to a recorded max-so-far, and updating a position variable when the max-so-far changes. This worked well, an O(n) algo that was very simple. I later learned that I need to keep not only the top value, but the top three or four. I extended the same procedure and complicated the max-so-far into an array of four max-so-fars and now the code is ugly.
It still works and is still sufficiently fast because only a trivial amount of computations have been added to the procedure. it still effectively walks across the array and checks each value once.
I do this in MATLAB with a sort function that returns two arrays, the sorted list and the accompanying original position list. By looking at the first few values I have exactly what I need. I am replicating this functionality into a C# .NET 2.0 program.
I know that I could do something similar with a List object, and that the List object has a built in sort routine, but I do not believe that it can tell me the original positions, and those are really what I am after.
It has been working well, but now I find myself wanting the fifth max value and see that rewriting the max-so-far checker that is currently an ugly mess of if statements would only compound the ugliness. It would work fine and be no slower to add a fifth level, but I want to ask the SO community if there is a better way.
Sorting the entire list takes many more computations than my current method, but I don't think it would be a problem, as the list is 'only' one or two thousand floats; so if there is a sort routine that can give back the original positions, that would be ideal.
As background, this array is the result of a Fourier Transform on a kilobyte of wave file, so the max values' positions correspond to the sample data's peak frequencies. I had been content with the top four, but see a need to really gather the top five or six for more accurate sample classification.

I can suggest an alternative algorithm which you'll have to code :)
Use a heap of size K where K denotes the count of top elements you want to save. Initialize this to the first K elements of your original array. For all N - K elements walk the array, inserting as and when required.
proc top_k (array<n>, heap<k>)
heap <- array<1..k-1>
for each (array<k..n-1>)
if array[i] > heap.min
heap.erase(heap.min)
heap.insert(array[i])
end if
end for

You could still use your list idea - the elements you put in the list could be a structure which stores both the index and the value; but sorts only on the value, for instance:
class IndexAndValue : IComparable<IndexAndValue>
{
public int index;
public double value;
public int CompareTo(IndexAndValue other)
{
return value.CompareTo(other.value);
}
}
Then you can stick them in the list, while retaining the information about the index. If you keep only the largest m items in the list, then your efficiency should be O(mn).

I don't know which algorithm you're currently using, but I'll suggest a simple one.
Admitting that you have an array of floats f and a maximum of capacity
numbers, you could do the following:
int capacity = 4; // number of floats you want to retrieve
float [] f; // your float list
float [] max_so_far = new float[capacity]; // max so far
// say that the first 'capacity' elements are the biggest, for now
for (int i = 0; i < capacity; i++)
max_so_far[i] = i;
// for each number not processed
for (int i = capacity; i < f.length; i++)
{
// find out the smallest 'max so far' number
int m = 0;
for (int j = 0; j < capacity; j++)
if (f[max_so_far[j]] < f[max_so_far[m]])
m = j;
// if our current number is bigger than the smallest stored, replace it
if (f[i] > f[max_so_far[m]])
max_so_far[m] = i;
}
By the end of the algorithm, you'll have the indices of the greatest elements stored
in max_so_far.
Do note that if the capacity value grows, it will become slightly slower than the
alternative, which is sorting the list while keeping track of the initial positions.
Remember that sorting takes O(nlog n) comparisons, while this algorithm takes O(ncapacity).

Another option is to use quick-select.
Quick-select returns the position of the k-th element in a list. After you have the position and the value of the k-th element, go over the list and take every element whose value is smaller/larger than the k-th element.
I found a c# implementation of quick-select here: link text
Pros:
O(n+k) average running time.
Cons:
The k elements found are not sorted. If you sort them the running time is O(n + logk)
I haven't checked this, but I think that for a very small k the best option is to do k runs over the array, each time finding the next smallest/largest element.

Related

Interview - Write a program to remove even elements

I was asked this today and i know the answer is damn sure simple but he kept me the twist to the last.
Question
Write a program to remove even numbers stored in ArrayList containing 1 - 100.
I just said wow
Here you go this is how i have implemented it.
ArrayList source = new ArrayList(100);
for (int i = 1; i < 100; i++)
{
source.Add(i);
}
for (int i = 0; i < source.Count; i++)
{
if (Convert.ToInt32(source[i]) % 2 ==0)
{
source.RemoveAt(i);
}
}
//source contains only Odd elements
The twist
He asked me what is the computational complexity of this give him a equation. I just did and said this is Linear directly proportional to N (Input).
he said : hmmm.. so that means i need to wait longer to get results when the input size increases am i right? Yes sirr you are
Tune it for me, make it Log(N) try as much as you can he said. I failed miserably in this part.
Hence come here for the right logic, answer or algorithm to do this.
note: He wanted no Linq, No extra bells and whistles. Just plain loops or other logic to do it

I dare say that the complexity is in fact O(N^2), since removal in arrays is O(N) and it can potentially be called for each item.
So you have O(N) for the traversal of the array(list) and O(N) for each removal => O(N) * O(N).
Since it does not seem clear, I'll explain the reasoning. At each step a removal of an item may take place (assuming the worst case in which every item must be removed). In an array the removal is done by shifting. Hence, to remove the first item, I need to shift all the following N-1 items by one position to the left:
1 2 3 4 5 6...
<---
2 3 4 5 6...
Now, at each iteration I need to shift, so I'm doing N-1 + N-2 + ... + 1 + 0 shifts, which gives a result of (N) * (N-1) / 2 (arithmetic series) giving a final complexity of O(N^2).

Let's think it this way:
The number of delete actions you are doing is, forcely, the half of array lenght (if the elements are stored in array). So the complexity is at least O(N) .
The question you received let me suppose that your professor wanted you to reason about different ways of storing the numbers.
Usually when you have log complexity you are working with different structures, like graphs or trees.
The only way I can think of having logartmic complexity is having the numbers stored in a tree (ordered tree, b-tree... we colud elaborate on this), but it is actually out of the constraints of your exam (sotring numbers in array).
Does it make sense to you?

You can get noticeably better performance if you keep two indexes, one to the current read position and one to the current write position.
int read = 0
int write = 0;
The idea is that read looks at each member of the array in turn; write keeps track of the current end of the list. When we find a member we want to delete, we move read forwards, but not write.
for (int read = 0; read < source.Count; read++) {
if (source[read] % 2 != 0) {
source[write] = source[read];
write += 1;
}
}
Then at the end, tell the ArrayList that its new length is the current value of `write'.
This takes you from your original O(n^2) down to O(n).
(note: I haven't tested this)

Without changing the data structure or making some assumption on the way items are stores inside the ArrayList, I can't see how you'll avoid checking the parity of each and every member (hence at least O(n) complexity). Perhaps the interviewer simply wanted you to tell him it's impossible.

If you really have to use an ArrayList and actively have to remove the entries (instead if not adding them in the first place)
Not incrementing by i + 1 but i + 2 will remove your need to check if it is odd.
for (int i = source.Count - 1 ; i > 0; i = i i 2)
{
source.RemoveAt(i);
}
Edit: I know this will only work if source contains the entries from 1-100 in sequential order.

The problem with the given solution is that it starts from the beginning, so the entire list must be shifted each time an item is removed:
Initial List: 1, 2, 3, 4, 5, ..., 98, 99
/ / / /// /
After 1st removal: 1, 3, 4, 5, ..., 98, 99, <empty>
/ /// / /
After 2nd removal: 1, 3, 5, ..., 98, 99, <empty>, <empty>
I've used the slashes to try to show how the list shifts after each removal.
You can reduce the complexity (and eliminate the bug I mentioned in the comments) simply by reversing the order of removal:
for (int i = source.Count-1; i >= 0; --i) {
if (Convert.ToInt32(source[i]) % 2 == 0) {
// No need to re-check the same element during the next iteration.
source.RemoveAt(--i);
}
}

It is possible IF you have unlimited parallel threads available to you.
Suppose that we have an array with n elements. Assign one thread per element. Assume all threads act in perfect sync.
Each thread decides whether its element is even or odd. (Time O(1).)
Determine how many elements below it in the array are odd. (Time O(log(n)).)
Mark a 0 or 1 in an second array depending whether you are even or odd at the same index. So each one is a count of odds at that spot.
If your index is odd, add the previous number. Now each entry is a count of odds in the current block of 2 up to yourself
If your index mod 4 is 2, add the value at the index below, if it is 3, add the answer 2 indexes below. Now each entry is a count of odds in the current block of 4 up to yourself.
Continue this pattern with blocks of 2**i (if you're in the top half add the count for the bottom half) log2(n) times - now each entry in this array is the count of odds below.
Each CPU inserts its value into the correct slot.
Truncate the array to the right size.
I am willing to bet that something like this is the answer your friend has in mind.

What is the fastest way to calculate frequency distribution for array in C#?

I am just wondering what is the best approach for that calculation. Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array.
Is it good idea to use bucket search for that?
Actually I found that question Calculating frequency distribution of a collection with .Net/C#
But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation.
EDIT:
After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. So we need some sort of hash function with O(1) complexity? Any ideas how to do it?

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.
int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
for(int i = 0; i < buckets.length - 1; i++)
{
if(d >= buckets[i] && d < buckets[i+1])
{
freq[i]++;
break;
}
}
}
It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.

If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:
First of all sort your input array. If you work with real-world data
I would recommend to consider Timsort - Wiki for this. It
provides very good performance guarantees for a patterns that can be seen in
real-world data.
Traverse through sorted array and compare it with the first value in the array of boundaries:
If value in input array is less then boundary - increment frequency counter for this boundary
If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary.
In a code it can look like this:
Timsort(myArray);
int boundPos;
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()
for (int i = 0; i<myArray.Lenght; i++) {
if (myArray[i]<boundaries[boundPos]) {
boundaries[boubdPos]++;
}
else {
boundPos++;
boundaries[boubdPos]++;
}
}

Algorithm for generating a nearly sorted list on predefined data

Note: This is part 1 of a 2 part question.
Part 2 here
I'm wanting to more about sorting algorithms and what better way to do than then to code! So I figure I need some data to work with.
My approach to creating some "standard" data will be as follows: create a set number of items, not sure how large to make it but I want to have fun and make my computer groan a little bit :D
Once I have that list, I'll push it into a text file and just read off that to run my algorithms against. I should have a total of 4 text files filled with the same data but just sorted differently to run my algorithms against (see below).
Correct me if I'm wrong but I believe I need 4 different types of scenarios to profile my algorithms.
Randomly sorted data (for this I'm going to use the knuth shuffle)
Reversed data (easy enough)
Nearly sorted (not sure how to implement this)
Few unique (once again not sure how to approach this)
This question is for generating a nearly sorted list.
Which approach is best to generate a nearly sorted list on predefined data?

To "shuffle" a sorted list to make it "almost sorted":
Create a list of functions you can think of which you can apply to parts of the array, like:
Negate(array, startIndex, endIndex);
Reverse(array, startIndex, endIndex);
Swap(array, startIndex, endIndex);
For i from zero to some function of the array's length (e.g. Log(array.Length):
Randomly choose 2 integers*
Randomly choose a function from the functions you thought of
Apply that function to those indices of the array
*Note: The integers should not be constricted to the array size. Rather, choose random integers and "wrap" around the array -- that way the elements near the ends will have the same chance of being modified as the elements in the middle.

Answering my own question here. All this does is taking a sorted list and shuffling up small sections of it.
public static T[] ShuffleBagSort<T>(T[] array, int shuffleSize)
{
Random r = _random;
for (int i = 0; i < array.Length; i += shuffleSize)
{
//Prevents us from getting index out of bounds, while still getting a shuffle of the
//last set of un shuffled array, but breaks for loop if the number of unshuffled array is 1
if (i + shuffleSize > array.Length)
{
shuffleSize = array.Length - i;
if (shuffleSize <= 1) // should never be less than 1, don't think that's possible lol
continue;
}
if (i % shuffleSize == 0)
{
for (int j = i; j < i + shuffleSize; j++)
{
// Pick random element to swap from our small section of the array.
int k = r.Next(i, i + shuffleSize);
// Swap.
T tmp = array[k];
array[k] = array[j];
array[j] = tmp;
}
}
}
return array;
}

Sort the array.
Start sorting it in descending order with bubble sort
Stop after a few iterations (depending how much 'dis-sorted' you want the array to be
Add some randomness (each time when bubblesort wants to swap two elements toss a coin and perform that operation or not depending on the result, or use a different probability than 50/50 for that)
This will give you an array which will be roughly equally modified across its whole range, preserving most of the order (the begining will hold the least elements, the end the greatest). That's because the changes performed by bubblesort with a random test will be rather local. It won't mix the whole array at once so much that it wouldn't resemble the original.
If you want to you can also completely randomly shuffle whole parts of the array (but keep the parts not to big because, you'll completely loose the ordering).
Or you may also randomly swap whole sorted parts of the array. That would be an interesing test case, for example:
[1,2,3,4,5,6,7,8] -> [1,2,6,7,8,3,4,5]

The almost sorted list is the reason why Timsort (python) is so efficient in the real world is because data is typically "almost sorted" . There is an article about it explaining the math behind the entropy of data.

How do I insert an int into a sorted array quickly?

I'd like to insert an int into a sorted array. This operation is going to be performed very often, so it needs to be as fast as possible.
It is possible and even preferred to use a List or any other class instead of an array
All values are in the 1 to 34 range
The array typically contains exactly 14 values
I was thinking of many different approaches, including binary search and simple insert-on-copy, but found it hard to decide. Also, I felt like I missed an idea. Do you have experiences on this topic or any new ideas to consider?

I will use an int array whose length is 35(because you said range 1-34) to record the status of the numbers.
int[] status = Enumerable.Repeat(0, 35).ToArray();
//an array contains 35 zeros
//which means currently there is no elements in the array
status[10] = 1; // now the array have only one number: 10
status[11] ++; // a new number 11 is added to the list
So if you want to add a number i to the list:
status[i]++; // O(1) to add a number
To remove an i from the list:
status[i]--; // O(1) to remove a number
Want to know all the numebrs in the list?
for (int i = 0; i < status.Length; i++)
{
if (status[i] > 0)
{
for (int j = 0; j < status[i]; j++)
Console.WriteLine(i);
}
}
//or more easier using LINQ
var result = status.SelectMany((i, index) => Enumerable.Repeat(index, i));
The following example may help you understand my code better:
the real number array: 1 12 12 15 9 34 // i don't care if it's sorted
the status array: status[1]=1,status[12]=2,status[15]=1,status[9]=1,status[34]=1
all others are 0

At 14 values this is a pretty small array, I don't think switching to a smarter data structure such as a list will win you much, especially if you fast good random access. Even binary search may actually be slower than linear search at this scale. Are you sure that, say, insert-on-copy does not satisfy your performance requirements?

This operation is going to be performed very often, so it needs to be as fast as possible.
The things that you notice happen "very often" are frequently not the bottlenecks in the program - it's often surprising what the actual bottlenecks are. You should code something simple and measure the actual performance of your program before performing any optimizations.
I was thinking of many different approaches, including binary search and simple insert-on-copy, but found it hard to decide.
Assuming that this is the bottleneck, the big-O performance of the different methods is not going to be relevant here because of the small size of your array. It is easier to just try a few different approaches, measure the results, see which performs best and choose that method. If you have followed the advice from the first paragraph you already have a profiler setup that you can use for this step too.

For inserting into the middle, a LinkedList<int> would be the fastest option - anything else involves copying data. At 14 elements, don't stress over binary search etc - just walk forwards to the item you want:
using System;
using System.Collections.Generic;
static class Program
{
static void Main()
{
LinkedList<int> data = new LinkedList<int>();
Random rand = new Random(12345);
for (int i = 0; i < 20; i++)
{
data.InsertSortedValue(rand.Next(300));
}
foreach (int i in data) Console.WriteLine(i);
}
}
static class LinkedListExtensions {
public static void InsertSortedValue(this LinkedList<int> list, int value)
{
LinkedListNode<int> node = list.First, next;
if (node == null || node.Value > value)
{
list.AddFirst(value);
}
else
{
while ((next = node.Next) != null && next.Value < value)
node = next;
list.AddAfter(node, value);
}
}
}

Doing the brute-force approach is the best decision here because 14 isn't a number :). However, this is not a scalable decision, since should 14 become 14000 one day that will cause problems

What is the most common operation with your array?
Insert? Read?
Heap data structure will give you O(log(14)) for both of them. SortedDictionary may hit your performance.
Using a simple array will give you O(1) for reading and O(14) for insert.
By the way, have you tried System.Collections.Generic.SortedDictionary ot System.Collections.Generic.SortedList?

If you're on .Net 4 you should take a look at the SortedSet<T>. Otherwise take a look at SortedDictionary<TKey, TValue> where you make TValue as object and just put null into it, cause you're just interested into the keys.

If there is no repeated value on the array and the possible values won´t change maybe a fixed size array where the value is equal to the index is a good choice
Both insert and read are O(1)

You have a range of possible values from 1-34 which is rather narrow. So the fastest way would likely be using an array with 34 slots. To insert a number n just do array[n-1]++ and to remove it do array[n.1]-- (if n>0).
To check if a value exists in your collection you do array[n-1]>0.
edit: Damn...Danny was faster. :)

Write a method takes an array of integers and sorts them in place using Bubble Sort. The method is not allowed to create any additional arrays. Bubble Sort is a simple sorting algorithm that works by looping through the array to be sorted, comparing each pair of adjacent elements and swapping them if they are in the wrong order.

Quickest way to determine if a 2D array contains an element?

Let's assume that I've got 2d array like :
int[,] my_array = new int[100, 100];
The array is filled with ints. What would be the quickest way to check if a target-value element is contained within the array ?
(* this is not homework, I'm trying to come up with most efficient solution for this case)

If the array isn't sorted in some fashion, I don't see how anything would be faster than checking every single value using two for statements. If it is sorted you can use a binary search.
Edit:
If you need to do this repeatedly, your approach would depend on the data. If the integers within this array range only up to 256, you can have a boolean array of that length, and go through the values in your data flipping the bits inside the boolean array. If the integers can range higher you can use a HashSet. The first call to your contains function would be a little slow because it would have to index the data. But subsequent calls would be O(1).
Edit1:
This will index the data on the first run, benchmarking found that the Contains takes 0 milliseconds to run after the first run, 13 to index. If I had more time I might multithread it and have it return the result, while asynchronously continuing indexing on the first call. Also since arrays are reference types, changing the value of data passed before or after it has been indexed will provide strange functionality, so this is just a sample and should be refactored prior to use.
private class DataContainer
{
private readonly int[,] _data;
private HashSet<int> _index;
public DataContainer(int[,] data)
{
_data = data;
}
public bool Contains(int value)
{
if (_index == null)
{
_index = new HashSet<int>();
for (int i = 0; i < _data.GetLength(0); i++)
{
for (int j = 0; j < _data.GetLength(1); j++)
{
_index.Add(_data[i, j]);
}
}
}
return _index.Contains(value);
}
}

Assumptions:
There is no kind of ordering in the arrays we can take advantage of
You are going to check for existence in the array several times
I think some kind of index might work nicely. If you want a yes/no answer if a given number is in the array. A hash table could be used for this, giving you a constant O(k) for lookups.
Also don't forget that realistically, for small MxN array sizes, it might actually be faster just to do a linear O(n) lookup.

create a hash out of the 2d array, where
1 --> 1st row
2 --> 2nd row
...
n --> nth row
O(n) to check the presence of a given element, assuming each hash check gives O(1).
This data structure gives you an opportunity to preserve your 2d array.
upd: ignore the above, it does not give any value. See comments

You could encapsulate the data itself, and keep a Dictionary along with it that gets modified as the data gets modified.
The key of the Dictionary would be the target element value, and the value would be the number of entries of the element. To test if an element exists, simply check the dictionary for a count > 0, which is somewhere between O(1) and O(n). You could also get other statistics on the data much quicker with this construct, particularly if the data is sparse.
The biggest drawback to this solution is that data modifications have more operations involved (still should be O(1), though), so if you're mostly doing data manipulation, then this might not be suitable.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.