String Array comparision and sorting

String Array comparision and sorting - c#

First off, i'd like to say that my programming knowledge is very basic and got a learn as you go style. So please bear with me if i sound stupid.
So i have a multi dimensional string array, a part of which is:
X Y
4,1 Adelaide
4,2 Interlagos
4,3 Sakhir
4,4 Hungaroring
4,5 Estoril
4,6 Barcelona
4,7 Silverstone
4,8 Mugello
4,9 Hockenheim
4,10 Monte Carlo
In the above table, X and Y are the 2 dimensions of the array.
Now i have another string array with elements from X dimension of the above array in unsorted fashion. For example,
4,6
5,15
3,7
10,12
etc...
Now what i want to do is write a code which looks into array #2 and assigns a corresponding element from dimension Y of the array #1.
For example, when the code encounters 4,6 in array #2, i want the code to assign the corresponding value which is Barcelona.
Just the basic snippet or algorithm is what i'm looking for. I'll do the rest myself.
Thanks in advance!

Sounds like table 1 should really be a Dictionary<string, string>, mapping "4,6" to "Barcelona". Then you can just do:
// However you want to populate your data
Dictionary<string, string> mapping = ...;
List<string> values = keys.Select(key => mapping[key]).ToList();
Note that this will throw an exception if any of the keys isn't mapped - if that's not what you want, please clarify the requirements.
It's not clear how you're getting this data, or whether your "multi-dimensional string array" is a string[,] or a string[][]. If you have to receive it as a string array, give us more details and we can explain how to convert that into the dictionary.

You should rather use a Dictionary for that. A dictionary is internally an array. If you hand over a key, value pair (to insert it) a so called hash function is applied to the key. This function returns an integer i. The value is than stored at array[i]. If you want to get a value from the Dictionary you hand over just the key. Internally the hash function is applied, i is computed and array[i] is returned. This sounds like very much overhead, but searching for the key is slow for large arrays (O(log n) if it is sorted by keys and O(n) if it is not sorted at all - if you know O notiation), where the hash function can be very fast in most applications. So even with large dictionaries accessing a value is fast. (There are some more tricks inside a dictionary, which handle the case that two keys result in the same integer i, but you don't have to care much about that, if you don't want to implement a dictionary yourself)
Dictionarys are also called maps or hashmaps in other languages.

Not sure if I'm interpreting your question correctly here...
Your array #2, are you saying you want to replace its elements(say "4,6") with "Barcelona"?If this is the case then:
Loop through array #2, for each element use String.split() to get the two numerical parts from it(ex. "4" and "6"). Then use Integer.parseInt() to convert them from String to ints(call them a,b) and use those ints as indexes to array #1 like array1[a][b] to get Y value.
I assume you really want to use an array because those numbers are small and bounded, otherwise use dictionary as suggest by other answers...

If you must receive your first set of data as a 2D array, here's how you can turn it into a dictionary:
Dictionary<string, string> dic = new Dictionary<string,string>();
for (int i = 0; i < firstArray.GetLength(0); i++)
{
dic.Add(firstArray[i, 0], firstArray[i, 1]);
}

Related

Dictionary element accessing

Question
Must looping through all C# Dictionary elements be done only through foreach, and if so why?
Or, I could ask my question as: can Dictionary elements be accessed by position within the Dictionary object (i.e., first element, last element, 3rd from last element, etc)?
Background to this question
I'm trying to learn more about how Dictionary objects work, so I'd appreciate help wrapping my mind around this. I'm learning about this, so I have several thoughts that are all tied into this question. I'll try to present in a way that is appropriate for SO format.
Research
In a C# array, elements are referenced by position. In a Dictionary, values are referenced by keys.
Looking through the documentation on MSDN, there are the statements
"For purposes of enumeration, each item in the dictionary is treated as a
KeyValuePair structure representing a value and its key. The
order in which the items are returned is undefined."
So, it would seem that since the order items are returned in is undefined, there is no way to access elements by position. I also read:
"Retrieving a value by using its key is very fast, close to O(1),
because the Dictionary class is implemented as a hash table."
Looking at the documentation for the HashTable .NET 4.5 class, there is reference to using a foreach statement to loop through and return elements. But there is no reference to using a for statement, or for that matter while or any other looping statement.
Also, I've noticed Dictionary elements use the IEnumerable interface, which seems to use foreach as the only type of statement for looping functions.
Thoughts
So, does this mean that Dictionary elements cannot be accessed by "position," as arrays or lists can?
If this is so, why is there a .Count property that returns the number of key/value pairs, yet nothing that lets me reference these by nearness to the total? For example, .Count is 5, why can't I request key/value pair .Count minus 1?
How is foreach able to loop over each element, yet I have no access to individual elements in the same way?
Is there no way to determine the position of an element (key or value) in a Dictionary object, without utilizing foreach? Can I not tell, without mapping elements to a collection, if a key is the first key in a Dictionary, or the last key?
This SO question and the excellent answers touch on this, but I'm specifically looking to see if I must copy elements to an array or other enumerable type, to access specific elements by position.
Here's an example. Please note I'm not looking for a way to specifically solve this example - it's for illustration purposes of my questions only. Suppose I want to add all they keys in a Dictionary<string, string> object to a comma-separated list, with no comma at the end. With an array I could do:
string[] arrayStr = new string[2] { "Test1", "Test2" };
string outStr = "";
for (int i = 0; i < arrayStr.Length; i++)
{
outStr += arrayStr[i];
if (i < arrayStr.Length - 1)
{
outStr += ", ";
}
}
With Dictionary<string, string>, how would I copy each key to outStr using the above method? It appears I would have to use foreach. But what Dictionary methods or properties exist that would let me identify where an element is located at, within a dictionary?
If you're still reading this, I also want to point out I'm not trying to say there's something wrong with Dictionary... I'm simply trying to understand how this tool in the .NET framework works, and how to best use it myself.

Say you have four cars of different colors. And you want to be able to quickly find the key to a car by its color. So you make 4 envelopes labelled "red", "blue", "black", and "white" and place the key to each car in the right envelope. Which is the "first" car? Which is the "third"? You're not concerned about the order of the envelopes; you're concerned about being able to quickly get the key by the color.
So, does this mean that Dictionary elements cannot be accessed by "position," as arrays or lists can?
Not directly, no. You can use Skip and Take but all they will do is iterate until you get to the "nth" item.
If this is so, why is there a .Count property that returns the number of key/value pairs, yet nothing that lets me reference these by nearness to the total? For example, .Count is 5, why can't I request key/value pair .Count minus 1?
You can still measure the number of items even thought there's no order. In my example you know there are 4 envelopes, but there's no concept of the "third" envelope.
How is foreach able to loop over each element, yet I have no access to individual elements in the same way?
Because foreach use IEnumerable, which just asks for the "next" element each time - the underlying collection determines what order the elements are returned in. You can pick up the envelopes one by one, but the order is irrelevant.
Is there no way to determine the position of an element (key or value) in a Dictionary object, without utilizing foreach?
You can infer it by using foreach and counting how many elements you have before reaching the one you want, but as soon as you add or remove an item, that position may change. If I buy a green car and add the envelope, where in the "order" would it go?
I'm specifically looking to see if I must copy elements to an array or other enumerable type, to access specific elements by position.
Well, no, you can use Skip and Take, but there's no way to predict what item is at that location. You can pick up two envelopes, ignore them, pick up another one and call it the "third" envelope, but so what?

Several correct answers here, but I thought you might like a short version :)
Under the hood, the Dictionary class has a private field called buckets. It's just an ordinary array which maps integer positions to the objects you've added to the Dictionary.
When you add a key/value pair to the Dictionary, it calculates a hash value for your key. The hash value gets used as the index into the buckets array. The Dictionary uses as many bits of the hash as it needs to ensure that the index into the buckets array doesn't collide with an existing entry. The buckets array will be expanded as needed due to collisions.
Yes, it's possible via reflection (which allows you to extract private data fields) to get the 3rd, or 4th, or Nth member of the buckets array. But the array could be resized at any time and you're not even guaranteed that the implementation details of Dictionary won't change.

In addition to D Stanley's answer, I'd like to add that you should check out SortedDictionary<TKey, TValue>. It stores the key/value pairs in a data structure that does keep the keys ordered.
var d = new SortedDictionary<int, string>();
d.Add(4, "banana");
d.Add(2, "apple");
d.Add(7, "pineapple");
Console.WriteLine(d.ElementAt(1).Value); // banana

Looping through a dictionary does not need to be done using foreach, but the terms 'first,' and 'last' are meaningless in terms of a dictionary, because order is not guaranteed and is in no way related to the order items are added to your dictionary.
Think of it this way. You have a bag that you are using to store blocks, and each block has a unique label on it. Throw in a block with the labels "Foo," "Bar," and "Baz." Now, if you ask me what the count of my bag is, I can say I have 3 blocks in it and if you ask me for the block labeled "Bar" I can get it for you. However, if you ask me for the first block, I don't know what you mean. The blocks are just a jumble inside my bag. If, instead you say 'foreach' block, I'd like to take a photo of it, I'll hand you each block, 1 by 1. Again, the order isn't guaranteed. I'm just reaching into my bag and pulling out each block until I've gotten each one.
You can also ask for a collection of all the keys in a dictionary, then use each key to access the items in a dictionary. However, once again, the order of the keys is not guaranteed and, in theory, could change every time you access it (in practice, the .NET key order is normally pretty stable).
There's a lot of reasons why a dictionary is stored like this, but the key thing is dictionaries have to have unspecified order in order to have both O(1) insertion and O(1) access. An array, which has a specified order has O(1) access (you can get the n'th item in one step), but insertions are O(n).

There is a large number of collections available in the .Net framework. You have to analyse your requirements and decide which collection to use:
Do you need Key/Value pairs or just Items?
Is it important that items are sorted?
Do you need fast insertion or just add at start/end of collection?
Do you need fast retrieval: O(1) or O(log n)?
Do you need an index i.e. acces to items by an integer position?
For most combinations of these requirements there exists a specialized collection.
In your case: Key/Value pairs and acces through an index: SortedList

Array with negative indexes

I have an array which I'm using to store map data for a game I'm working on.
MyMapType[,,] map;
The reason I'm using a fixed array instead of a Collection is because fixed arrays work very much faster.
Now my problem is, I'd like to have support for negative z levels in the game. So I'd like to be able to access a negative index.
If this is not possible, I thought of a pair of other solutions.
I was thinking as a possible solution to have ground-level as some arbitrary number (say 10), and anything less than 10 could be considered negative. But wouldn't this make the array 10 times larger for nothing if its not in use?
Another solution I considered was to 'roll my own' where you have a Dictionary of 2D arrays, with the Z level held in the List as the index. But this is a lot more work and I'm not sure if its slow or not.
So to summarise - any way of creating an array which supports a negative index? And if there's not - is there a clean way of 'emulating' such behaviour without sacrificing too much CPU time or RAM - noting that these are game maps which could end up large AND need to be accessed constantly.

replace your arrays with a class:
class MyArray {
private MyMapType[] myArray = new myMapType[size]
MyMapType this[index] {
get{return myArray[index + offset];}
}
}
you can set the size and the offset in the constructor or even change it at will.
Building on this example here is another version:
class MyArray {
private MyMapType[] positives = new myMapType[size]
private MyMapType[] negatives = new myMapType[size-1]
MyMapType this[index] {
get{return index >= 0 ? positives[index] : negateves[1-index];}
}
}
It does not change the fact that you need to set the size for both of them. Honestly I like the first one better

If you want "negative" indexes C# 8 now supports it.
var words = new string[]
{
// index from start index from end
"The", // 0 ^9
"quick", // 1 ^8
"brown", // 2 ^7
"fox", // 3 ^6
"jumps", // 4 ^5
"over", // 5 ^4
"the", // 6 ^3
"lazy", // 7 ^2
"dog" // 8 ^1
}; // 9 (or words.Length) ^0
So the to call the negative one would be like this
words[^1]
See this link
So in your case the middle element could be the zero Z

Use the Dictionary class, since you can assign whatever values you want for either the key or value. While I'm not sure how this would work for the 3-dimensional array that you showed above, I can show how this would work if this were a 1-dimensional array, and you can infer how to best make use of it:
MyMapType[] map;
//map is filled with w/e data
Dictionary<int, MyMapType> x = new Dictionary<int, MyMapType>();
x[-1] = //(map data for whatever value is for the negative value);
x[0] = map[0]
//(etc...)

Could you try to store a list of MyMapTime[,] in two lists:
one for z values of greater than or equal to 0
and second of negative z-values.
The index of the tables would be the value of z.
Having this would let you access quickly the xy-values for specific z-level.
Of course the question is: what are your z-values? Are there sparse or dense.
Even for sparse values you would end up with an array holding null values for [,].

I'd like to note here that dictionaries allow for negative indexes
and a 2D dictionairy can solve problems like these too, just think about the datastructure and if you can live with a dictionary
note that dictionaries and lists are used in different scenario's.
and their speed depends on what functions are used on them

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Quickest way to determine if a 2D array contains an element?

Let's assume that I've got 2d array like :
int[,] my_array = new int[100, 100];
The array is filled with ints. What would be the quickest way to check if a target-value element is contained within the array ?
(* this is not homework, I'm trying to come up with most efficient solution for this case)

If the array isn't sorted in some fashion, I don't see how anything would be faster than checking every single value using two for statements. If it is sorted you can use a binary search.
Edit:
If you need to do this repeatedly, your approach would depend on the data. If the integers within this array range only up to 256, you can have a boolean array of that length, and go through the values in your data flipping the bits inside the boolean array. If the integers can range higher you can use a HashSet. The first call to your contains function would be a little slow because it would have to index the data. But subsequent calls would be O(1).
Edit1:
This will index the data on the first run, benchmarking found that the Contains takes 0 milliseconds to run after the first run, 13 to index. If I had more time I might multithread it and have it return the result, while asynchronously continuing indexing on the first call. Also since arrays are reference types, changing the value of data passed before or after it has been indexed will provide strange functionality, so this is just a sample and should be refactored prior to use.
private class DataContainer
{
private readonly int[,] _data;
private HashSet<int> _index;
public DataContainer(int[,] data)
{
_data = data;
}
public bool Contains(int value)
{
if (_index == null)
{
_index = new HashSet<int>();
for (int i = 0; i < _data.GetLength(0); i++)
{
for (int j = 0; j < _data.GetLength(1); j++)
{
_index.Add(_data[i, j]);
}
}
}
return _index.Contains(value);
}
}

Assumptions:
There is no kind of ordering in the arrays we can take advantage of
You are going to check for existence in the array several times
I think some kind of index might work nicely. If you want a yes/no answer if a given number is in the array. A hash table could be used for this, giving you a constant O(k) for lookups.
Also don't forget that realistically, for small MxN array sizes, it might actually be faster just to do a linear O(n) lookup.

create a hash out of the 2d array, where
1 --> 1st row
2 --> 2nd row
...
n --> nth row
O(n) to check the presence of a given element, assuming each hash check gives O(1).
This data structure gives you an opportunity to preserve your 2d array.
upd: ignore the above, it does not give any value. See comments

You could encapsulate the data itself, and keep a Dictionary along with it that gets modified as the data gets modified.
The key of the Dictionary would be the target element value, and the value would be the number of entries of the element. To test if an element exists, simply check the dictionary for a count > 0, which is somewhere between O(1) and O(n). You could also get other statistics on the data much quicker with this construct, particularly if the data is sparse.
The biggest drawback to this solution is that data modifications have more operations involved (still should be O(1), though), so if you're mostly doing data manipulation, then this might not be suitable.

Algorithm for matching lists of integers

For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:
DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers
This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.
The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.
Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.
We are using the following algorithm. Convert the list values to a string via
string signature = String.Join("|", values.OrderBy(n => n).ToArray());
then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)
Is there a better method?

You could probably just hash the list itself, instead of going through String.
Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)
You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).
It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.
You cannot do better than O(n + m), because every entry needs to be visited at least once.
Edit: misread, fixed.

On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List.
You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.
Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it.
If there is already an existing key, only then do you sort the corresponding Lists and compare them.
You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.
If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.
Code
Below is a first-attempt at implementing this.
Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();
public bool CheckDuplicate(List<int> theList) {
bool isIdentical = false;
int xorkey = 0;
foreach (int v in theList) xorkey ^= v;
List<List<int>> existingLists;
checkHash.TryGetValue(xorkey, out existingLists);
if (existingLists != null) {
// Already in the dictionary. Check each stored list
foreach (List<int> li in existingLists) {
isIdentical = (theList.Count == li.Count);
if (isIdentical) {
// Check all elements
foreach (int v in theList) {
if (!li.Contains(v)) {
isIdentical = false;
break;
}
}
}
if (isIdentical) break;
}
}
if (existingLists == null || !isIdentical) {
// never seen this before, add it
List<List<int>> newList = new List<List<int>>();
newList.Add(theList);
checkHash.Add(xorkey, newList);
}
return isIdentical;
}
Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.
If a duplicate key is found, we loop through each previously stored List until we found a mismatch.
The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.

Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.
If the lists are sorted, it could be as simple as this:
IntListEqualityComparer : IEqualityComparer<List<int>> {
public int GetHashCode(List<int> list) {
int code = 0;
foreach (int value in list) code ^=value;
return code;
}
public bool Equals(List<int> list1, List<int> list2) {
if (list1.Count != list2.Coount) return false;
for (int i = 0; i < list1.Count; i++) {
if (list1[i] != list2[i]) return false;
}
return true;
}
}
Now you can create a dictionary that uses the IEqualityComparer:
Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());
Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.
You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).

Does the ordering matter? i.e. [1,2] on day 1 and [2,1] on day 2, are they equal?
If they are, then hashing might not work all that well. You could use a sorted array/vector instead to help with the comparison.
Also, what kind of keys is it? Does it have a definite range (e.g. 0-63)? You might be able to concatenate them into large integer (may require precision beyond 64-bits), and hash, instead of converting to string, because that might take a while.

It might be worthwhile to place this in a SQL database. If you don't want to have a full blown DBMS you could use sqlite.
This would make uniqueness checks and unions and these types of operations very simple queries and would very efficient. It would also allow you to easily store information if it is ever needed again.

Would you consider summing up the list of values to obtain an integer which can be used as a precheck of whether different list contains the same set of values?
Though there will be much more collisions (same sum doesn't necessarily mean same set of values) but I think it can first reduce the set of comparisons required by a large part.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.