Reading a 200MiB binary file in a SortedDictionary

Reading a 200MiB binary file in a SortedDictionary - c#

I have a 200MiB binary file containing pairs of Int64 and Int32. The file is already sorted by the values on the Int64.
I'm reading it like this:
private SortedDictionary<Int64, int> edgeID = new SortedDictionary<Int64, int>();
//Carico il file del mapping arco ID
using (BinaryReader b = new BinaryReader(File.Open(#"file.bin", FileMode.Open)))
{
Int64 keyID;
Int32 grafoID;
int numArchi = b.ReadInt32();
for (int i = 0; i < numArchi; i++)
{
keyID = (b.ReadInt64());
grafoID = b.ReadInt32();
edgeID[keyID] = grafoID;
}
}
But it's super slow. I noticed if I was using a normal Dictionary I could speed up a bit if I pass the dimension of the Dictionary in the costructor but apparently SortedDictionary doesn't support that. Also I think the bigger problem is the program doesn't know I'm passing data already ordered so it checks the key at every insert.

Your second assumption is partially correct: SortedDictionary, in general, has no way of telling that data is ordered and needs to check they are inserted in the right place.
Furthermore, SortedDictionary is a red-black tree internally in the current Mono and Microsoft implementations of the BCL.
Insertion of already ordered data is actually the worst case scenario for a red black tree (see insert section here)
Also, the implementation might change in the future, so optimizing for it/agaist it is a risky move!
In comparison, Dictionary is an hash-table internally (that's why you can give it an initial size; hash-tables are implemented using arrays underneath (of different shapes, depending on the implementation)) and so both insertion and look-up times are faster (amortized, O(1)).
You have to consider: what do you need it for? If you want to just search by key, and the data is already ordered, probably the best way of doing it is loading the data in a simple contiguous array (a .NET array or a list) and use binary search on it. You will need the overload that takes a comparer, as your array will probably hold a Tuple - or you may just use two arrays, one of "indexes" and one of "values".. your choice.
Much faster than SortedDictionary and Dictionary on loading (and should be faster than SortedDictionary in searching too).
It might be that Dictionary is still faster on searching a single item; hash-tables have 0(1) access, while binary search on an ordered array is 0(log N), but locality plays an important role on this "intermediate" data dimensions, so the real result might be different from the theoretical one. You will need to profile and measure to decide which is best in your case!
(By intermediate dimensions I mean: something for which the big-O notation does not play a dominant role, yet)

Using a Dictionary with a max number of elements like this:
using (BinaryReader b = new BinaryReader(File.Open(#"filed", FileMode.Open)))
{
Int64 tomtomID;
Int32 grafoID;
int numArchi = b.ReadInt32();
edgeID = new Dictionary<long, int>(numArchi);
for (int i = 0; i < numArchi; i++)
{
tomtomID = (b.ReadInt64());
grafoID = b.ReadInt32();
edgeID[keyID] = grafoID;
}
}
This solved the issue.

Related

Big O for DynamicArray vs HashTable while fetching their values using their indexes

In general, I know that Hashtables have better performance when compared to dynamic arrays for searching a particular value. Does this hold true when we are fetching the value directly by passing in an index?
E.g.:
HashTable d=new HashTable(); //let us consider it is initialized with values
for (int i = 0; i < largeValue ; i++)
{
int doNothing=d[i];
}
versus:
DynamicArray l=new DynamicArray(); //let us consider it is initialized with values
for (int i = 0; i < largeValue ; i++)
{
int doNothing=l[i];
}
Which one has better performance?
Edit 1: I have Edited the question based on the comments.

Dictionary lookups will be slower even though both operations are O(1). Indexing into a predefined array memory location (the data structure Lists are built from) involves fewer operations; most likely, it's a matter of finding the beginning of the memory block and adding the index offset.
Hashtable lookups, on the other hand, are amortized O(1). Depending on how the hashtable is implemented, you'll need to compute a potentially complex hashcode, index into a bucket, walk part of a linked list pointed to by the bucket, and check equality to ensure that the item you've located matches the one you're after. This is a lot more work than a list index operation.
Two operations may be in the same time complexity category (in this case O(1)), but may have radically different constant factor on the work they perform. Such is the case here.
Take a look at Dictionary part 2, .NET implementation if you'd like to explore the workings of C# dictionaries further.

Normalize ConcurrentDictionary of arrays

I have a ConcurrentDictionary of arrays, where each array has the same fixed size. It looks like this: ConcurrentDictionary<int, double[]> ItemFeatures
I want to normalize the values in the list by dividing all the values by the maximum of the values in that column. For example, if my lists are of size 5, I want every element in the first position to be divided by the maximum of all the values in that position, and so on for position 2 onwards.
The naive way that I can think of doing this, is by first iterating over every list and every element in the list, and storing the max value per position. Then iterating over them again and dividing them by the previously found maximum values.
Is there a more elegant way to do this in Linq perhaps? These dictionaries would be large, so the more efficient/least time consuming, the better.

No, that will actually be the most efficient way. In the end, this is what you need to do anyway, you can't skip anything. You can probably write it in LINQ somehow, but the performance will be worse because it will have a lot of function calls and memory allocations. LINQ doesn't perform miracles, it's just a (sometimes) shorter way of writing things.
What can speed this up is if your algorithm has a good "cache locality" - in other words, if you access the computer memory in a sequential way. That's pretty hard to guarantee in an environment like .NET, but a loop like you described probably has the best chances of getting close to it.

LINQ is designed for querying data, not modifying data. You can use a little LINQ to compute the maximums, but that is about it:
var cols = ItemFeatures.First().Value.Length;
var maxv = new double[cols];
for (var j1 = 0; j1 < cols; ++j1)
maxv[j1] = ItemFeatures.Values.Select(vs => vs[j1]).Max();
foreach (var kvp in ItemFeatures)
for (var j1 = 0; j1 < cols; ++j1)
kvp.Value[j1] /= maxv[j1];

Finding nearest value in a SortedDictionary

I have a SortedDictionary
SortedDictionary<int, CPUOptimizationObject> myDict;
Now I want to find the first value above X.
I can do something like this
foreach (var iKey in MyDict.Keys)
{
if (iKey >= thresholdKey)
{
foundKey = iKey;
break;
}
}
but this isn't good performance wise.
Any better suggestion?
(is there a method for that in the collections something like Binary search for SortedDictionary ?)

While, in theory, finding the smallest item that is larger than a given value is an operation that can be performed efficiently on a Binary Search Tree (which is what a SortedDictionary is implemented as) SortedDictionary does not expose the means for you to perform such a search on that data type.
You would need to use a different implementation of a Binary Search Tree in order to efficiently perform such a search, while still using the same type of data structure. There are no suitable .NET types; you would need to use a 3rd party implementation (of which there are quite a few out there).

You could try if this is faster. But I guess it will be faster only if you are performing the search multiple times.
var keys = new List<int>(myDict.Keys);
int index = keys.BinarySearch(thresholdKey);

create a temp list n using a .toList() on your Sorteddictionary
Now since that results a List n
you could do
n.Find(item =>item >20)
to retrieve the first key in return that matches,

I don't know if this has better performance than the foreach, but this should work:
var foo = myDict.FirstOrDefault(i => i.Key > thresholdKey);

Quickest way to determine if a 2D array contains an element?

Let's assume that I've got 2d array like :
int[,] my_array = new int[100, 100];
The array is filled with ints. What would be the quickest way to check if a target-value element is contained within the array ?
(* this is not homework, I'm trying to come up with most efficient solution for this case)

If the array isn't sorted in some fashion, I don't see how anything would be faster than checking every single value using two for statements. If it is sorted you can use a binary search.
Edit:
If you need to do this repeatedly, your approach would depend on the data. If the integers within this array range only up to 256, you can have a boolean array of that length, and go through the values in your data flipping the bits inside the boolean array. If the integers can range higher you can use a HashSet. The first call to your contains function would be a little slow because it would have to index the data. But subsequent calls would be O(1).
Edit1:
This will index the data on the first run, benchmarking found that the Contains takes 0 milliseconds to run after the first run, 13 to index. If I had more time I might multithread it and have it return the result, while asynchronously continuing indexing on the first call. Also since arrays are reference types, changing the value of data passed before or after it has been indexed will provide strange functionality, so this is just a sample and should be refactored prior to use.
private class DataContainer
{
private readonly int[,] _data;
private HashSet<int> _index;
public DataContainer(int[,] data)
{
_data = data;
}
public bool Contains(int value)
{
if (_index == null)
{
_index = new HashSet<int>();
for (int i = 0; i < _data.GetLength(0); i++)
{
for (int j = 0; j < _data.GetLength(1); j++)
{
_index.Add(_data[i, j]);
}
}
}
return _index.Contains(value);
}
}

Assumptions:
There is no kind of ordering in the arrays we can take advantage of
You are going to check for existence in the array several times
I think some kind of index might work nicely. If you want a yes/no answer if a given number is in the array. A hash table could be used for this, giving you a constant O(k) for lookups.
Also don't forget that realistically, for small MxN array sizes, it might actually be faster just to do a linear O(n) lookup.

create a hash out of the 2d array, where
1 --> 1st row
2 --> 2nd row
...
n --> nth row
O(n) to check the presence of a given element, assuming each hash check gives O(1).
This data structure gives you an opportunity to preserve your 2d array.
upd: ignore the above, it does not give any value. See comments

You could encapsulate the data itself, and keep a Dictionary along with it that gets modified as the data gets modified.
The key of the Dictionary would be the target element value, and the value would be the number of entries of the element. To test if an element exists, simply check the dictionary for a count > 0, which is somewhere between O(1) and O(n). You could also get other statistics on the data much quicker with this construct, particularly if the data is sparse.
The biggest drawback to this solution is that data modifications have more operations involved (still should be O(1), though), so if you're mostly doing data manipulation, then this might not be suitable.

Algorithm for matching lists of integers

For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:
DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers
This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.
The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.
Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.
We are using the following algorithm. Convert the list values to a string via
string signature = String.Join("|", values.OrderBy(n => n).ToArray());
then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)
Is there a better method?

You could probably just hash the list itself, instead of going through String.
Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)
You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).
It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.
You cannot do better than O(n + m), because every entry needs to be visited at least once.
Edit: misread, fixed.

On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List.
You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.
Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it.
If there is already an existing key, only then do you sort the corresponding Lists and compare them.
You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.
If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.
Code
Below is a first-attempt at implementing this.
Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();
public bool CheckDuplicate(List<int> theList) {
bool isIdentical = false;
int xorkey = 0;
foreach (int v in theList) xorkey ^= v;
List<List<int>> existingLists;
checkHash.TryGetValue(xorkey, out existingLists);
if (existingLists != null) {
// Already in the dictionary. Check each stored list
foreach (List<int> li in existingLists) {
isIdentical = (theList.Count == li.Count);
if (isIdentical) {
// Check all elements
foreach (int v in theList) {
if (!li.Contains(v)) {
isIdentical = false;
break;
}
}
}
if (isIdentical) break;
}
}
if (existingLists == null || !isIdentical) {
// never seen this before, add it
List<List<int>> newList = new List<List<int>>();
newList.Add(theList);
checkHash.Add(xorkey, newList);
}
return isIdentical;
}
Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.
If a duplicate key is found, we loop through each previously stored List until we found a mismatch.
The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.

Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.
If the lists are sorted, it could be as simple as this:
IntListEqualityComparer : IEqualityComparer<List<int>> {
public int GetHashCode(List<int> list) {
int code = 0;
foreach (int value in list) code ^=value;
return code;
}
public bool Equals(List<int> list1, List<int> list2) {
if (list1.Count != list2.Coount) return false;
for (int i = 0; i < list1.Count; i++) {
if (list1[i] != list2[i]) return false;
}
return true;
}
}
Now you can create a dictionary that uses the IEqualityComparer:
Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());
Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.
You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).

Does the ordering matter? i.e. [1,2] on day 1 and [2,1] on day 2, are they equal?
If they are, then hashing might not work all that well. You could use a sorted array/vector instead to help with the comparison.
Also, what kind of keys is it? Does it have a definite range (e.g. 0-63)? You might be able to concatenate them into large integer (may require precision beyond 64-bits), and hash, instead of converting to string, because that might take a while.

It might be worthwhile to place this in a SQL database. If you don't want to have a full blown DBMS you could use sqlite.
This would make uniqueness checks and unions and these types of operations very simple queries and would very efficient. It would also allow you to easily store information if it is ever needed again.

Would you consider summing up the list of values to obtain an integer which can be used as a precheck of whether different list contains the same set of values?
Though there will be much more collisions (same sum doesn't necessarily mean same set of values) but I think it can first reduce the set of comparisons required by a large part.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading a 200MiB binary file in a SortedDictionary - c#

Related

Big O for DynamicArray vs HashTable while fetching their values using their indexes

Normalize ConcurrentDictionary of arrays

Finding nearest value in a SortedDictionary

Quickest way to determine if a 2D array contains an element?

Algorithm for matching lists of integers

Categories

Resources