Uses of method BinarySearch with ArrayList

Uses of method BinarySearch with ArrayList - c#

In C# we have ArrayList to store various type of data and objects.
In ArrayList we have IndexOf method which returns the index of first occurrence with in the ArrayLIst.
Also in ArrayList we have BinarySearch method which searches the Sorted ArrayList for an element and returns it's index.
My query :
To my understanding BinarySearch and IndexOf are doing same task.In what scenario will BinarySearch method be useful ? I understand that BinarySearch needs a sorted ArrayList .So can we say that when ArrayList is sorted one should use BinarySearch and when not sorted we should use IndexOf method ? Also in Sorted ArrayList which method gives high performance : BinarySearch or IndexOf ?

Definitions
BinarySearch works by progressively halving the size of the search space. It relies on the data first being ordered by the search parameter to accomplish this. Its performance is logarithmic - O(log n)
IndexOf will perform a linear search to find the index of the item - O(n).
Consequences
This effectively means that in an ArrayList of 1000 values, where the item being sought is at the end of the list, IndexOf would have to examine 1000 values to find the result. BinarySearch would first check the middle value, then the middle of the remaining values and so on, effectively only examining 10 items in total before returning the correct result.
Of course, in practice it is unlikely that the sought item will always be at the end of the list, so 1000 comparisons is only the worst-case scenario for a linear search. If the item were the first one in the list, IndexOf would out-perform BinarySearch.
As with all algorithms, which to use depends heavily on what you are trying to accomplish and the nature of your data.
If your data is unsorted and you do not want to change the order of the items in your ArrayList, or if comparing data is an expensive operation, BinarySearch could be far more computationally expensive than IndexOf despite performing fewer comparisons on average due to the need to make a copy of the ArrayList and sort that copy.
If the item you need to find generally tends to be one of the first items in your ArrayList (on average) then IndexOf would probably be the best option to use.
Similarly if you have a very small array (in the order of 10 items), BinarySearch will not yield significantly better results.
Code relying on BinarySearch may also be more difficult to maintain - Your code must document the fact that maintaining ordering of the data is essential to the correct performance of the application - otherwise another developer might later alter the code to something that re-orders the data invalidating the binary search and breaking the application.
If your data is already sorted, (ie. it doesn't need to be sorted just to make it ordered for the purposes of searching), then a BinarySearch will almost always outperform IndexOf when searching for an item in a list of more than a handful of values... But the level of performance gain might be completely insignificant in an application that is also performing any other non-trivial tasks (such as I/O bound activities).
Recommendation
In general, one should favour the simpler operation which has no requirements or side-effects (ie. IndexOf) until it becomes apparent (through profiling) that BinarySearch would significantly improve the application's usability or efficiency.
Whenever you choose an algorithm, always document the reason why. It will help other developers understand the code, and sometimes that "other" developer will be you, reviewing code years after you've forgotten why you chose one algorithm over another in the first place.

Related

Performance Dictionary<string,int> versus List<string>

I have a list of about 500 strings "joe" "john" "jack" ... "jan"
I only need to find the ordinal.
In my example, the list will never be changed.
One could just put them in a list and IndexOf
ll.Add("joe")
ll.Add("john")
...
ll.Add("jan")
ll.IndexOf("jib") is 315
or you can put them in a dictionary, using the ordinal integers as the values,
dd.Add("joe", 1)
dd.Add("john", 2)
dd.Add("jack", 3)
...
dd.Add("jan", 571)
dd["jib"] is 315
FTR the strings are 3 to 8 characters long. FTR this is in a Unity, hence Mono, milieu.
Purely for performance, is one approach generally preferable?
1b) Indeed, I found a number of analysis of this nature: http://www.dotnetperls.com/dictionary-time (google for a number of similar analyses). Does this apply to the situation I describe or am I off here?
It's a shame there isn't a "HashSetLikeThingWithOrdinality" type of facility - if I'm missing an obvious please let us know. Indeed, this seems like a fairly common, basic, collections use case - "get the ordinal of some strings" - perhaps I am completely missing something obvious.

Here's a small overview on the difference between using a Dictionary<string,int> and a (sorted)List<string> for this:
Observations:
1) In my micro benchmarks, once the dictionary is created, the dictionary is much faster. (Explanations as to why will follow shortly)
2) In my opinion, mapping in some way (eg. a Dictionary or HashTable) will be significantly less awkward.
Performance:
For the List<string>, to do a binary search, the system will start in the 'middle', then walk each direction (stepping into the 'middle' in the now halved search space, in a typical divide and conquer pattern) depending on if the value is greater or smaller than the value at the index it's looking at. This is O(log n) growth. This assumes that data is already sorted in some manner (also applies to stuff like SortedDictionary, which uses data structures that allow for binary searching)
Alternately, you'd do IndexOf, which is O(n) complexity because you have to walk every element.
For the Dictionary<string,int>, it uses a hash lookup (generates a hash of the object by calling .GetHashCode() on the TKey (string in this case), then uses that to look up in a hash table (then does a compare to ensure it is an exact match), and gets the value out. This is roughly O(1) growth (ie. the complexity doesn't grow meaningfully with the number of elements) [Not including worst case scenarios involving hash collisions here]
Because of this, Dictionary<string,int> takes a (relatively) constant amount of time to do lookups, while List<string> grows according to the number of elements (albeit at a logarithmic (slow) rate).
Testing:
I did a few micro benchmarks, where I took the top 500 female names and did lookups against them. The lookups looked something like this:
var searchItems = new[] { "Maci", "Daria", "Michelle", "Amber", "Henrietta"};
foreach (var item in searchItems)
{
sortedList.BinarySearch(item); //You'd store the output here. Just looking at performance
}
And compared it to a dictionary lookup:
foreach (var item in searchItems)
{
var output = dictionary.ContainsKey(item) ? dictionary[item] : -1; //Presumably, output would be declared outside of this, just getting rid of a compiler error
}
So, here's the thing: even for a small number of elements, with short strings as lookup keys, a sorted List<string> isn't any faster (on my machine, in my admittedly simplistic tests) than a Dictionary<string,int>. Once again, this is a microbenchmark, but, for 500 elements, the 5 lookups are roughly 3x faster with the dictionary.
Keep in mind, however, that the list was 6.3 microseconds, and the dictionary was 1.8 microseconds.
Syntax:
Using a list as a lookup to find indexes is slightly awkward. A mapping type (like Dictionary) implies intent much better than your lookup list does, which should make for more maintainable code in the end.
That said, with my syntax and performance considerations, I'd say go with the Dictionary. However, if you don't like Dictionaries for whatever reason, the performance considerations are on such a small scale that it's a pointless thing to worry about anyways.
Edit: Bonus points, you will probably want to use a case-insensitive comparer for either method. You can pass a comparer as an argument for Dictionary and BinarySearch() should support a comparer as well.

I suspect that there might be a twist somewhere, as such a simple question has no answer for 2 hours. I'll risk being down-voted, but here is my answers:
1) Dictionary (hash table-based) is clearly a better choice for a fast lookup. List, on the other hand, is the worst choice.
1.b) Yes, it applies here. Search in the List has linear complexity, while Dictionary provides constant time lookup.
2) You are trying to map a string to an ordinal; any kind of map will be natural here (while any kind of list is awkward).

Dictionary is the natural approach for a lookup.
A list would be an optimisation for less memory use at the cost of decreased speed. An array would do better still (same time, but slightly less memory again).
If you already had a list or array for some other reason then the memory saving would be greater still, because no more memory was used that would be used anyway, and so a better optimisation for space at the same cost to speed. (If the order of the keys was the same as a sort then it could be O(log n) but otherwise it's O(n)).
Creating the dictionary itself takes time, so while it's the fastest approach if the number of times it is looked up is small then it might cost as much as it saves and so not be worth it.

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

What is the most performant way to check for existence with a collection of integers?

I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?

Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.

Sounds like a job for a Hashset...

If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.

If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.

What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.

In C#, is there a kind of a SortedList<double> that allows fast querying (with LINQ) for the nearest value?

I am looking for a structure that holds a sorted set of double values. I want to query this set to find the closest value to a specified reference value.
I have looked at the SortedList<double, double>, and it does quite well for me. However, since I do not need explicit key/value pairs. this seems to be overkill to me, and i wonder if i could do faster.
Conditions:
The structure is initialised only once, and does never change (no insert/deletes)
The amount of values is in the range of 100k.
The structure is queried often with new references, which must execute fast.
For simplicity and speed, the set's value just below of the reference may be returned, not actually the nearest value
I want to use LINQ for the query, if possible, for simplicity of code.
I want to use no 3rd party code if possible. .NET 3.5 is available.
Speed is more importand than memory footprint
I currently use the following code, where SortedValues is the aforementioned SortedList
IEnumerable<double> nearest = from item in SortedValues.Keys
where item <= suggestion
select item;
return nearest.ElementAt(nearest.Count() - 1);
Can I do faster?
Also I am not 100% percent sure, if this code is really safe. IEnumerable, the return type of my query is not by definition sorted anymore. However, a Unit test with a large test data base has shown that it is in practice, so this works for me. Have you hints regarding this aspect?
P.S. I know that there are many similar questions, but none actually answers my specific needs. Especially there is this one C# Data Structure Like Dictionary But Without A Value, but the questioner does just want to check the existence not find anything.

The way you are doing it is incredibly slow as it must search from the beginning of the list each time giving O(n) performance.
A better way is to put the elements into a List and then sort the list. You say you don't need to change the contents once initialized, so sorting once is enough.
Then you can use List<T>.BinarySearch to find elements or to find the insertion point of an element if it doesn't already exist in the list.
From the docs:
Return Value
The zero-based index of
item in the sorted List<T>,
if item is found; otherwise, a
negative number that is the bitwise
complement of the index of the next
element that is larger than item or,
if there is no larger element, the
bitwise complement of Count.
Once you have the insertion point, you need to check the elements on either side to see which is closest.

Might not be useful to you right now, but .Net 4 has a SortedSet class in the BCL.

I think it can be more elegant as follows:
In case your items are not sorted:
double nearest = values.OrderBy(x => x.Key).Last(x => x.Key <= requestedValue);
In case your items are sorted, you may omit the OrderBy call...

Array that can be resized fast

I'm looking for a kind of array data-type that can easily have items added, without a performance hit.
System.Array - Redim Preserve copies entire RAM from old to new, as slow as amount of existing elements
System.Collections.ArrayList - good enough?
System.Collections.IList - good enough?

Just to summarize a few data structures:
System.Collections.ArrayList: untyped data structures are obsolete. Use List(of t) instead.
System.Collections.Generic.List(of t): this represents a resizable array. This data structure uses an internal array behind the scenes. Adding items to List is O(1) as long as the underlying array hasn't been filled, otherwise its O(n+1) to resize the internal array and copy the elements over.
List<int> nums = new List<int>(3); // creates a resizable array
// which can hold 3 elements
nums.Add(1);
// adds item in O(1). nums.Capacity = 3, nums.Count = 1
nums.Add(2);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(3);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(4);
// adds item in O(n). Lists doubles the size of our internal array, so
// nums.Capacity = 6, nums.count = 4
Adding items is only efficient when adding to the back of the list. Inserting in the middle forces the array to shift all items forward, which is an O(n) operation. Deleting items is also O(n), since the array needs to shift items backward.
System.Collections.Generic.LinkedList(of t): if you don't need random or indexed access to items in your list, for example you only plan to add items and iterate from first to last, then a LinkedList is your friend. Inserts and removals are O(1), lookup is O(n).

You should use the Generic List<> (System.Collections.Generic.List) for this. It operates in constant amortized time.
It also shares the following features with Arrays.
Fast random access (you can access any element in the list in O(1))
It's quick to loop over
Slow to insert and remove objects in the start or middle (since it has to do a copy of the entire listbelieve)
If you need quick insertions and deletions in the beginning or end, use either linked-list or queues

Would the LinkedList< T> structure work for you? It's not (in some cases) as intuitive as a straight array, but is very quick.
AddLast to append to the end
AddBefore/AddAfter to insert into list
AddFirst to append to the beginning
It's not so quick for random access however, as you have to iterate over the structure to access your items... however, it has .ToList() and .ToArray() methods to grab a copy of the structure in list/array form so for read access, you could do that in a pinch. The performance increase of the inserts may outweigh the performance decrease of the need for random access or it may not. It will depend entirely on your situation.
There's also this reference which will help you decide which is the right way to go:
When to use a linked list over an array/array list?

What is "good enough" for you? What exactly do you want to do with that data structure?
No array structure (i.e. O(n) access) allows insertion in the middle without an O(n) runtime; insertion at the end is O(n) worst case an O(1) amortized for self-resizing arrays like ArrayList.
Maybe hashtables (amortized O(1) access and insertion anywhere, but O(n) worst case for insertion) or trees (O(log(n)) for access and insertion anywhere, guaranteed) are better suited.

If speed is your problem, I don't see how the selected answer is any better than using a raw Array, although it resizes itself, it uses the exact same mechanism you would use to resize an array (and should take just a touch longer) UNLESS you are always adding to the end, in which case it should do things a bit smarter because it allocates a chunk at a time instead of just one element.
If you often add near the beginning/middle of your collection and don't index into the middle/end very often, you probably want a Linked List. That will have the fastest insert time and will have great iteration time, it just sucks at indexing (such as looking at the 3rd element from the end, or the 72nd element).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.