What would be the best data structure for a Merge Search? - c#

I want to do a merge sort in c#. I know the premise of the sort but I can't think how to store the values when the data is split in halves. Obviously the data will be split and again and again until the individual values are found and then they are stored. After that comes the comparison. I haven't started coding yet so I am open to suggestions. For my previous bubble and binary insertion sorts I used arrays. The data sets will be variable in size so I want something to accommodate this.
MergeString MergeSort = new MergeString();
int StringCount = MergeSort.GetNumberOfStrings();
string[] UnsortedStringList = MergeSort.GetListOfStrings(StringCount);
public class MergeInt
{
public int GetNumberOfNumbers()
{
bool isParsable = false;
int NumberCount = 0;
do
{
Console.WriteLine("How many numbers are in the unsorted list?");
string NumberOfNumbers = Console.ReadLine();
isParsable = int.TryParse(NumberOfNumbers, out NumberCount);
} while (isParsable == false);
return NumberCount;
}
public int[] GetListOfNumbers(int NumberCount)
{
bool isParsable = false;
int[] UnsortedNumberList = new int[NumberCount];
for (int i = 0; i < NumberCount; i++)
{
int Number = 0;
do
{
Console.WriteLine("What number is in the list?");
string NumberString = Console.ReadLine();
isParsable = int.TryParse(NumberString, out Number);
} while (isParsable == false);
UnsortedNumberList[i] = Number;
}
return UnsortedNumberList;
}
public void SortList(int[] a)
{
int split = a.Length / 2;
Console.WriteLine(split);
}
}

Merge sort requires an extra buffer to store temporary data. To keep it simple you should probably just use arrays for everything. I'm not sure I would call this a "data structure", since it is just a plain memory buffer.
For the actual merge code I would probably recommend using Span<T>, this lets you refer to a section of memory inside a buffer by Sliceing the data. An array should be convertible to a span, but you can also call AsSpan.
For this kind of problem I would probably go for recursive solution, and I would expect the major difficulty will be to avoid of by one errors, and ensuring you are swapping and merging buffers correctly.

Related

Vectorize ND-Array to 1D-Array as fast as possible

I'm trying to vectorize a n-dimensional array as 1-dimensional array in C# to later ease working using linear indexing (this whatever the type of the elements).
So far I was using Buffer.BlockCopy to do that (and even reshaping from n-dimensions to m-dimensions as long as the number of elements was not changing) but unfortunately I came across having to reshape arrays whose elements are not primitive types (double, single, int) and in this case Buffer.BlockCopy does not work (example array of string or whatever other non primitive type).
Currently the solution I have is to make special-case for non-primitive types:
/// <summary>Vectorize ND-array</summary>
/// <param name="arrayNd">ND-Array to vectorize.</param>
/// <returns>Surface copy as 1D array.</returns>
public static Array Vectorize(Array arrayNd)
{
// Check arguments
if (arrayNd == null) { return null; }
var elementCount = arrayNd.Length;
// Create 1D array
var tarray = arrayNd.GetType();
var telem = tarray.GetElementType();
var array1D = Array.CreateInstance(telem, elementCount);
// Surface copy
if (telem.IsPrimitive)
{
// Block copy only works for array whose elements are primitive types (double, single, ...)
var numberOfBytes = Buffer.ByteLength(arrayNd);
Buffer.BlockCopy(arrayNd, 0, array1D, 0, numberOfBytes);
}
else
{
// Slow version for other element types
// NB: arrayNd.GetValue(...) does not support linear indexing so need to compute indices for each dimension (very slow !!)
var indices = new int[arrayNd.Rank];
for (var i = 0; i < elementCount; i++)
{
var idx = i;
for (var d = arrayNd.Rank - 1; d >= 0; d--)
{
var l = arrayNd.GetLength(d);
indices[d] = idx % l;
idx /= l;
}
array1D.SetValue(arrayNd.GetValue(indices), i);
}
}
// Return as 1D
return array1D;
}
So this works now all types:
var double1D = Vectorize(new double[3, 2, 5]); // Fast BlockCopy
var string1D = Vectorize(new string[3, 2, 5]); // Slow solution
I already have an NEnumerator class of my own to speed up computing indices (instead of using modulo as above) but maybe there is really fast way for just making this sort of "surface memcpy" ?
NB1: I'd like to avoid unsafe code but if it's the only way ...
NB2: I really want to work with System.Array (eventually I'll later do a bunch of T[] Vectorize(T[,,,,] array) overloads but that's not the issue)
In my experience, Multidimensional arrays are kind of a pain to work with, in large part since it is so difficult to access the backing data. As far as I know there is no direct way to just copy all the elements for arbitrary types.
Because of this I tend to prefer a custom type for my 2D types that uses a linear array as backing storage, and index like myArray[y * width + x]. With this model the whole exercise becomes a no-op, and you can get a pointer to pass to native code, it works better with serialization etc.
For 3D/4D arrays you could use the same mode, but it seems like the best option for performance is allocate slices independently, i.e. myArray[z][y * width + x], at least for large arrays. I have not worked with 4D arrays, but in general, I would avoid multidimensional arrays if performance is a concern. There might also be libraries out there that might suit your needs, but I'm not aware of any specific one.
However, looking at your code I would expect there to be some possible improvements. You are currently doing N calls to GetLength, modulus & divisions for each element. So I would expect something like this to be a bit faster:
public static Array MultidimensionalToLinear(Array arr)
{
var rank = arr.Rank;
var lengths = new int[rank];
for (int i = 0; i < rank; i++)
{
lengths[i] = arr.GetLength(i);
}
var linearLength = arr.Length;
var result = Array.CreateInstance(arr.GetType().GetElementType(), linearLength);
var index = new int[rank];
var linearIndex = 0;
CopyRecursive(0, index, result, ref linearIndex);
void CopyRecursive(int rank, int[] index, Array result, ref int linearIndex)
{
var lastIndex = index.Length - 1;
if (rank == lastIndex)
{
for (int i = 0; i < lengths[lastIndex]; i++)
{
index[lastIndex] = i;
result.SetValue(arr.GetValue(index), linearIndex);
linearIndex++;
}
}
else
{
for (int i = 0; i < lengths[rank]; i++)
{
index[rank] = i;
CopyRecursive(rank +1, index, result, ref linearIndex);
}
}
}
return result;
}
However, when measuring it seem like the performance improvement is fairly small. Probably due the code in GetValue dominating the runtime.

How to speed up the process of looping a million values array?

So I was taking an online test where i had to implement a piece of code to simply check if the value was in the array. I wrote the following code:
using System;
using System.IO;
using System.Linq;
public class Check
{
public static bool ExistsInArray(int[] ints, int val)
{
if (ints.Contains(val)) return true;
else return false;
}
}
Now i didn't see any problems here because the code works fine but somehow i still failed the test because this is "not fast enough" once the array contains a million values.
The only code i wrote myself is:
if (ints.Contains(val)) return true;
else return false;
The other code i was given to work with.
Is there a way to speed up this process?
Thanks in advance.
EDIT:
I came across a page where someone apparently took the same test as i took and it seems to come down to saving CPU cycles.
Reference: How to save CPU cycles when searching for a value in a sorted list?
Now his solution within the method is:
var lower = 0;
var upper = ints.Length - 1;
if ( k < ints[lower] || k > ints[upper] ) return false;
if ( k == ints[lower] ) return true;
if ( k == ints[upper] ) return true;
do
{
var middle = lower + ( upper - lower ) / 2;
if ( ints[middle] == k ) return true;
if ( lower == upper ) return false;
if ( k < ints[middle] )
upper = Math.Max( lower, middle - 1 );
else
lower = Math.Min( upper, middle + 1 );
} while ( true );
Now i see how this code works but it's not clear to me why this is supposed to be faster. Would be nice if someone could elaborate.
If it's sorted array you can use BinarySearch To speed Up the process
public static bool ExistsInArray(int[] ints, int val)
{
return Array.BinarySearch(ints, val) >= 0;
}
You can use Parallel, something like the code below:
namespace ParallelDemo
{
class Program
{
static void Main()
{
var options = new ParallelOptions()
{
MaxDegreeOfParallelism = 2
};
List<int> integerList = Enumerable.Range(0,10).ToList();
Parallel.ForEach(integerList, options, i =>
{
Console.WriteLine(#"value of i = {0}, thread = {1}",
i, Thread.CurrentThread.ManagedThreadId);
});
Console.WriteLine("Press any key to exist");
Console.ReadLine();
}
}
}
Note: It will speed up but you're going to use more memory
The correct answer is: it depends.
Is the list sorted?
How big is the list?
How many cores can you throw at the problem?
The simplest answer is that Linq, for all its wonder is actually quite slow. It uses a lot of reflection and generally performs a lot of work under the covers. It's great when ease of readability is your main goal. But for performance? No.
In a single threaded, unsorted list, the old fashioned for loop will give you the best results. If it is sorted then a binary search or some version of the a quick search will work best.
As for parallel, C# has the the parallel class. But beware, if the list is small enough, the overhead of creating threads can overcome your search time.
Simple, single threaded, unsorted answer:
public static bool ExistsInArray(int[] ints, int val)
{
for( int index = 0, count = ints.GetLowerBound(0); index < count; ++index)
{
if (ints[index] == val) return true;
}
return false;
}
it's possible that the site you're looking wants this instead. But this only works if the array is sorted.
public static bool ExistsInArray(int[] ints, int val)
{
return Array.BinarySearch(ints, val) > 0;
}
Supporting posts that demonstrate that Linq is not so fast.
For vs. Linq - Performance vs. Future
https://www.anujvarma.com/linq-versus-loopingperformance/
If the input array is already sorted, then using BinarySearch is the best approach.
.NET has inbuilt support of BinarySearch by using Array.BinarySearch method.
Just did a quick experiment on Contains and BinarySearch with a sorted array of 1 Million integer values as following.
public static void Main()
{
var collection = Enumerable.Range(0, 1000000).ToArray();
var st = new Stopwatch();
var val = 999999;
st.Start();
var isExist = collection.Contains(val);
st.Stop();
Console.WriteLine("Time taken for Contains : {0}", st.Elapsed.TotalMilliseconds);
t.Restart();
var p = BinarySearchArray(collection, 0, collection.Length - 1, val);
st.Stop();
if(p == -1)
{
Console.WriteLine("Not Found");
}
else
{
Console.WriteLine("Item found at position : {0}", p);
}
Console.WriteLine("Time taken for binary search {0}", st.Elapsed.TotalMilliseconds);
}
private static int BinarySearchArray(int[] inputArray, int lower, int upper, int val)
{
if(lower > upper)
return -1;
var midpoint = (upper + lower) / 2;
if(inputArray[midpoint] == val)
{
return midpoint;
}
else if(inputArray[midpoint] > val)
{
upper = midpoint - 1;
}
else if(inputArray[midpoint] < val)
{
lower = midpoint+1;
}
return BinarySearchArray(inputArray, lower, upper, val);
}
Following is the output.
Time taken for Contains : 1.0518
Item found at position : 999999
Time taken for binary search 0.1522
It is clear that the BinarySearch has the upper hand here.
.NET's Contains method does not use BinarySearch internally. Contains is good for small collections but for bigger arrays BinarySearch is better approach.

How to assert uniqueness of a huge collection of strings?

Let's say I have an algorithm which takes an unsigned 64-bit integer as input, and yields a string as a result. The string's alphabet is limited to [a-z, A-Z, 0-9] and its' maximum length is 16. So that's or 47,672,401,706,823,533,450,263,330,816 possible results.
I would like to assert the uniqueness of the algorithm's output. Read: I want to verify there are no collisions.
Is there an easy/quick 'n dirty way to do this, without having to fall back to (e.g.) some kind of database?
[EDIT]
Some clarification: the concerns uttered in the comments are legit, but no worries, I wasn't really planning on iterating over all possible combinations, my lifespan will probably be sub-1 century ;) Nor did I write my own algorithm to generate unique ID's. I just saw this and started wondering how one would go about asserting uniqueness for algorithms with very large result sets that can't be handled in-memory
[/EDIT]
As said in the comments, It would take a very long time to compute every possible entries, but just for fun, here is a try:
var workspace = new DirectoryInfo("MyWorkspace");
if (workspace.Exists)
{
workspace.Delete();
}
workspace.Create();
var limit = 23997907;
var buffer = new HashSet<string>();
ulong i = 0;
int j = 0;
var stopWatch = Stopwatch.StartNew();
while (i <= ulong.MaxValue)
{
var result = YourSuperAlgorythm(i);
// Check the result with current results
if (buffer.Contains(result))
{
throw new Exception("Failure !");
}
// Check the result with older results
foreach (var file in workspace.GetFiles())
{
var content = new HashSet<string>(File.ReadAllText(file.FullName).Split(';'));
if (content.Contains(result))
{
throw new Exception("Failure !");
}
}
buffer[j] = result;
i++;
j++;
if (j == arrayLimit)
{
stopWatch.Stop();
Console.WriteLine("Resetting. This loop takes " + stopWatch.Elapsed.TotalMilliseconds + "ms");
j = 0;
var file = Path.GetRandomFileName();
File.WriteAllText(Path.Combine(workspace.FullName, file), String.Join(";", buffer));
buffer = new HashSet<string>();
stopWatch.Restart();
}
}
You could probably optimize it but you won't have enought of a lifetime to check the results. For now, it did not even create a file to store the first set of entries :D. I will edit this post when one loop will be done!
Your only option is to prove mathematically your algorithm. Good luck with that...
EDIT1: for my test, I use this function:
private static string YourSuperAlgorythm(ulong i)
{
return i.ToString("x");
}
EDIT2: One loop takes 1477221.4261ms (~25min). And then the String.Join(";", buffer) line failed (OutOfMemory). So 23997907 is not the max value for my try. It must be decreased!

Dice Sorensen Distance error calculating Bigrams without using Intersect method

I have been programming an object to calculate the DiceSorensen Distance between two strings. The logic of the operation is not so difficult. You calculate how many two letter pairs exist in a string, compare it with a second string and then perform this equation
2(x intersect y)/ (|x| . |y|)
where |x| and |y| is the number of bigram elements in x & y. Reference can be found here for further clarity https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
So I have tried looking up how to do the code online in various spots but every method I have come across uses the 'Intersect' method between two lists and as far as I am aware this won't work because if you have a string where the bigram already exists it won't add another one. For example if I had a string
'aaaa'
I would like there to be 3 'aa' bigrams but the Intersect method will only produce one, if i am incorrect on this assumption please tell me cause i wondered why so many people used the intersect method. My assumption is based on the MSDN website https://msdn.microsoft.com/en-us/library/bb460136(v=vs.90).aspx
So here is the code I have made
public static double SorensenDiceDistance(this string source, string target)
{
// formula 2|X intersection Y|
// --------------------
// |X| + |Y|
//create variables needed
List<string> bigrams_source = new List<string>();
List<string> bigrams_target = new List<string>();
int source_length;
int target_length;
double intersect_count = 0;
double result = 0;
Console.WriteLine("DEBUG: string length source is " + source.Length);
//base case
if (source.Length == 0 || target.Length == 0)
{
return 0;
}
//extract bigrams from string 1
bigrams_source = source.ListBiGrams();
//extract bigrams from string 2
bigrams_target = target.ListBiGrams();
source_length = bigrams_source.Count();
target_length = bigrams_target.Count();
Console.WriteLine("DEBUG: bigram counts are source: " + source_length + " . target length : " + target_length);
//now we have two sets of bigrams compare them in a non distinct loop
for (int i = 0; i < bigrams_source.Count(); i++)
{
for (int y = 0; y < bigrams_target.Count(); y++)
{
if (bigrams_source.ElementAt(i) == bigrams_target.ElementAt(y))
{
intersect_count++;
//Console.WriteLine("intersect count is :" + intersect_count);
}
}
}
Console.WriteLine("intersect line value : " + intersect_count);
result = (2 * intersect_count) / (source_length + target_length);
if (result < 0)
{
result = Math.Abs(result);
}
return result;
}
In the code somewhere you can see I call a method called listBiGrams and this is how it looks
public static List<string> ListBiGrams(this string source)
{
return ListNGrams(source, 2);
}
public static List<string> ListTriGrams(this string source)
{
return ListNGrams(source, 3);
}
public static List<string> ListNGrams(this string source, int n)
{
List<string> nGrams = new List<string>();
if (n > source.Length)
{
return null;
}
else if (n == source.Length)
{
nGrams.Add(source);
return nGrams;
}
else
{
for (int i = 0; i < source.Length - n; i++)
{
nGrams.Add(source.Substring(i, n));
}
return nGrams;
}
}
So my understanding of the code step by step is
1) pass in strings
2) 0 length check
3) create list and pass up bigrams into them
4) get the lengths of each bigram list
5) nested loop to check in source position[i] against every bigram in target string and then increment i until no more source list to check against
6) perform equation mentioned above taken from wikipedia
7) if result is negative Math.Abs it to return a positive result (however i know the result should be between 0 and 1 already this is what keyed me into knowing i was doing something wrong)
the source string i used is source = "this is not a correct string" and the target string was, target = "this is a correct string"
the result I got was -0.090909090908
I'm SURE (99%) that what I'm missing is something small like a mis-calculated length somewhere or a count mis-count. If anyone could point out what i'm doing wrong I'd be really grateful. Thank you for your time!
This looks like homework, yet this similarity metric on strings is new to me so I took a look.
Algorith implementation in various languages
As you may notice the C# version uses HashSet and takes advantage of the IntersectWith method.
A set is a collection that contains no duplicate elements, and whose
elements are in no particular order.
This solves your string 'aaaa' puzzle. Only one bigram there.
My naive implementation on Rextester
If you prefer Linq then I'd suggest Enumerable.Distinct, Enumerable.Union and Enumerable.Intersect. These should mimic very well the duplicate removal capabilities of the HashSet.
Also found this nice StringMetric framework written in Scala.

What's an appropriate search/retrieval method for a VERY long list of strings?

This is not a terribly uncommon question, but I still couldn't seem to find an answer that really explained the choice.
I have a very large list of strings (ASCII representations of SHA-256 hashes, to be exact), and I need to query for the presence of a string within that list.
There will be what is likely in excess of 100 million entries in this list, and I will need to repeatably query for the presence of an entry many times.
Given the size, I doubt I can stuff it all into a HashSet<string>. What would be an appropriate retrieval system to maximize performance?
I CAN pre-sort the list, I CAN put it into a SQL table, I CAN put it into a text file, but I'm not sure what really makes the most sense given my application.
Is there a clear winner in terms of performance among these, or other methods of retrieval?
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Security.Cryptography;
namespace HashsetTest
{
abstract class HashLookupBase
{
protected const int BucketCount = 16;
private readonly HashAlgorithm _hasher;
protected HashLookupBase()
{
_hasher = SHA256.Create();
}
public abstract void AddHash(byte[] data);
public abstract bool Contains(byte[] data);
private byte[] ComputeHash(byte[] data)
{
return _hasher.ComputeHash(data);
}
protected Data256Bit GetHashObject(byte[] data)
{
var hash = ComputeHash(data);
return Data256Bit.FromBytes(hash);
}
public virtual void CompleteAdding() { }
}
class HashsetHashLookup : HashLookupBase
{
private readonly HashSet<Data256Bit>[] _hashSets;
public HashsetHashLookup()
{
_hashSets = new HashSet<Data256Bit>[BucketCount];
for(int i = 0; i < _hashSets.Length; i++)
_hashSets[i] = new HashSet<Data256Bit>();
}
public override void AddHash(byte[] data)
{
var item = GetHashObject(data);
var offset = item.GetHashCode() & 0xF;
_hashSets[offset].Add(item);
}
public override bool Contains(byte[] data)
{
var target = GetHashObject(data);
var offset = target.GetHashCode() & 0xF;
return _hashSets[offset].Contains(target);
}
}
class ArrayHashLookup : HashLookupBase
{
private Data256Bit[][] _objects;
private int[] _offsets;
private int _bucketCounter;
public ArrayHashLookup(int size)
{
size /= BucketCount;
_objects = new Data256Bit[BucketCount][];
_offsets = new int[BucketCount];
for(var i = 0; i < BucketCount; i++) _objects[i] = new Data256Bit[size + 1];
_bucketCounter = 0;
}
public override void CompleteAdding()
{
for(int i = 0; i < BucketCount; i++) Array.Sort(_objects[i]);
}
public override void AddHash(byte[] data)
{
var hashObject = GetHashObject(data);
_objects[_bucketCounter][_offsets[_bucketCounter]++] = hashObject;
_bucketCounter++;
_bucketCounter %= BucketCount;
}
public override bool Contains(byte[] data)
{
var hashObject = GetHashObject(data);
return _objects.Any(o => Array.BinarySearch(o, hashObject) >= 0);
}
}
struct Data256Bit : IEquatable<Data256Bit>, IComparable<Data256Bit>
{
public bool Equals(Data256Bit other)
{
return _u1 == other._u1 && _u2 == other._u2 && _u3 == other._u3 && _u4 == other._u4;
}
public int CompareTo(Data256Bit other)
{
var rslt = _u1.CompareTo(other._u1); if (rslt != 0) return rslt;
rslt = _u2.CompareTo(other._u2); if (rslt != 0) return rslt;
rslt = _u3.CompareTo(other._u3); if (rslt != 0) return rslt;
return _u4.CompareTo(other._u4);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj))
return false;
return obj is Data256Bit && Equals((Data256Bit) obj);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = _u1.GetHashCode();
hashCode = (hashCode * 397) ^ _u2.GetHashCode();
hashCode = (hashCode * 397) ^ _u3.GetHashCode();
hashCode = (hashCode * 397) ^ _u4.GetHashCode();
return hashCode;
}
}
public static bool operator ==(Data256Bit left, Data256Bit right)
{
return left.Equals(right);
}
public static bool operator !=(Data256Bit left, Data256Bit right)
{
return !left.Equals(right);
}
private readonly long _u1;
private readonly long _u2;
private readonly long _u3;
private readonly long _u4;
private Data256Bit(long u1, long u2, long u3, long u4)
{
_u1 = u1;
_u2 = u2;
_u3 = u3;
_u4 = u4;
}
public static Data256Bit FromBytes(byte[] data)
{
return new Data256Bit(
BitConverter.ToInt64(data, 0),
BitConverter.ToInt64(data, 8),
BitConverter.ToInt64(data, 16),
BitConverter.ToInt64(data, 24)
);
}
}
class Program
{
private const int TestSize = 150000000;
static void Main(string[] args)
{
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var arrayHashLookup = new ArrayHashLookup(TestSize);
PerformBenchmark(arrayHashLookup, TestSize);
}
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var hashsetHashLookup = new HashsetHashLookup();
PerformBenchmark(hashsetHashLookup, TestSize);
}
Console.ReadLine();
}
private static void PerformBenchmark(HashLookupBase hashClass, int size)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < size; i++)
hashClass.AddHash(BitConverter.GetBytes(i * 2));
Console.WriteLine("Hashing and addition took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
hashClass.CompleteAdding();
Console.WriteLine("Hash cleanup (sorting, usually) took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
var found = 0;
for (int i = 0; i < size * 2; i += 10)
{
found += hashClass.Contains(BitConverter.GetBytes(i)) ? 1 : 0;
}
Console.WriteLine("Found " + found + " elements (expected " + (size / 5) + ") in " + sw.ElapsedMilliseconds + "ms");
}
}
}
Results are pretty promising. They run single-threaded. The hashset version can hit a little over 1 million lookups per second at 7.9GB RAM usage. The array-based version uses less RAM (4.6GB). Startup times between the two are nearly identical (388 vs 391 seconds). The hashset trades RAM for lookup performance. Both had to be bucketized because of memory allocation constraints.
Array performance:
Hashing and addition took 307408ms
Hash cleanup (sorting, usually) took 81892ms
Found 30000000 elements (expected 30000000) in 562585ms [53k searches per second]
======================================
Hashset performance:
Hashing and addition took 391105ms
Hash cleanup (sorting, usually) took 0ms
Found 30000000 elements (expected 30000000) in 74864ms [400k searches per second]
If the list changes over time, I would put it in a database.
If the list doesn't change, I would put it in a sorted file and do a binary search for every query.
In both cases, I would use a Bloom filter to minimize I/O. And I would stop using strings and use the binary representation with four ulongs (to avoid the object reference cost).
If you have more than 16 GB (2*64*4/3*100M, assuming Base64 encoding) to spare, an option is to make a Set&ltstring> and be happy. Of course it would fit in less than 7 GB if you use the binary representation.
David Haney's answer shows us that the memory cost is not so easily calculated.
With <gcAllowVeryLargeObjects>, you can have arrays that are much larger. Why not convert those ASCII representations of 256-bit hash codes to a custom struct that implements IComparable<T>? It would look like this:
struct MyHashCode: IComparable<MyHashCode>
{
// make these readonly and provide a constructor
ulong h1, h2, h3, h4;
public int CompareTo(MyHashCode other)
{
var rslt = h1.CompareTo(other.h1);
if (rslt != 0) return rslt;
rslt = h2.CompareTo(other.h2);
if (rslt != 0) return rslt;
rslt = h3.CompareTo(other.h3);
if (rslt != 0) return rslt;
return h4.CompareTo(other.h4);
}
}
You can then create an array of these, which would occupy approximately 3.2 GB. You can search it easy enough with Array.BinarySearch.
Of course, you'll need to convert the user's input from ASCII to one of those hash code structures, but that's easy enough.
As for performance, this isn't going to be as fast as a hash table, but it's certainly going to be faster than a database lookup or file operations.
Come to think of it, you could create a HashSet<MyHashCode>. You'd have to override the Equals method on MyHashCode, but that's really easy. As I recall, the HashSet costs something like 24 bytes per entry, and you'd have the added cost of the larger struct. Figure five or six gigabytes, total, if you were to use a HashSet. More memory, but still doable, and you get O(1) lookup.
These answers don't factor the string memory into the application. Strings are not 1 char == 1 byte in .NET. Each string object requires a constant 20 bytes for the object data. And the buffer requires 2 bytes per character. Therefore: the memory usage estimate for a string instance is 20 + (2 * Length) bytes.
Let's do some math.
100,000,000 UNIQUE strings
SHA256 = 32 bytes (256 bits)
size of each string = 20 + (2 * 32 bytes) = 84 bytes
Total required memory: 8,400,000,000 bytes = 8.01 gigabytes
It is possible to do so, but this will not store well in .NET memory. Your goal should be to load all of this data into a form that can be accessed/paged without holding it all in memory at once. For that I'd use Lucene.net which will store your data on disk and intelligently search it. Write each string as searchable to an index and then search the index for the string. Now you have a scalable app that can handle this problem; your only limitation will be disk space (and it would take a lot of string to fill up a terabyte drive). Alternatively, put these records in a database and query against it. That's why databases exist: to persist things outside of RAM. :)
For maximum speed, keep them in RAM. It's only ~3GB worth of data, plus whatever overhead your data structure needs. A HashSet<byte[]> should work just fine. If you want to lower overhead and GC pressure, turn on <gcAllowVeryLargeObjects>, use a single byte[], and a HashSet<int> with a custom comparer to index into it.
For speed and low memory usage, store them in a disk-based hash table.
For simplicity, store them in a database.
Whatever you do, you should store them as plain binary data, not strings.
A hashset splits your data into buckets (arrays). On a 64-bit system, the size limit for an array is 2 GB, which is roughly 2,000,000,000 bytes.
Since a string is a reference type, and since a reference takes eight bytes (assuming a 64-bit system), each bucket can hold approximately 250,000,000 (250 million) references to strings. It seems to be way more than what you need.
That being said, as Tim S. pointed out, it's highly unlikely you'll have the necessary memory to hold the strings themselves, even though the references would fit into the hashset. A database would me a much better fit for this.
You need to be careful in this sort of situation as most collections in most languages are not really designed or optimized for that sort of scale. As you have already identified memory usage will be a problem too.
The clear winner here is to use some form of database. Either a SQL database or there are a number of NoSQL ones that would be appropriate.
The SQL server is already designed and optimized for keeping track of large amounts of data, indexing it and searching and querying across those indexes. It's designed for doing exactly what you are trying to do so really would be the best way to go.
For performance you could consider using an embedded database that will run within your process and save the resulting communications overhead. For Java I could recommend a Derby database for that purpose, I'm not aware of the C# equivalents enough to make a recommendation there but I imagine suitable databases exist.
It might take a while (1) to dump all the records in a (clustered indexed) table (preferably use their values, not their string representation (2)) and let SQL do the searching. It will handle binary searching for you, it will handle caching for you and it's probably the easiest thing to work with if you need to make changes to the list. And I'm pretty sure that querying things will be just as fast (or faster) than building your own.
(1): For loading the data have a look at the SqlBulkCopy object, things like ADO.NET or Entity Framework are going to be too slow as they load the data row by row.
(2): SHA-256 = 256 bits, so a binary(32) will do; which is only half of the 64 characters you're using now. (Or a quarter of it if you're using Unicode numbers =P) Then again, if you currently have the information in a plain text-file you could still go the char(64) way and simply dump the data in the table using bcp.exe. The database will be bigger, the queries slightly slower (as more I/O is needed + the cache holds only half of the information for the same amount of RAM), etc... But it's quite straightforward to do, and if you're not happy with the result you can still write your own database-loader.
If the set is constant then just make a big sorted hash list (in raw format, 32 bytes each). Store all hashes so that they fit to disk sectors (4KB), and that the beginning of each sector is also the beginning of a hash. Save the first hash in every Nth sector in a special index list, which will easily fit into memory. Use binary search on this index list to determine the starting sector of a sector cluster where the hash should be, and then use another binary search within this sector cluster to find your hash. Value N should be determined based on measuring with test data.
EDIT: alternative would be to implement your own hash table on disk. The table should use open addressing strategy, and the probe sequence should be restricted to the same disk sector as much as possible. Empty slot have to be marked with a special value (all zeroes for instance) so this special value should be specially handled when queried for existence. To avoid collisions the table should not be less than 80% full with values, so in your case with 100 million entries with size of 32 bytes that means the table should have at least 100M/80%= 125 millions slots, and have the size of 125M*32= 4 GB. You only need to create the hashing function that would convert 2^256 domain to 125M, and some nice probe sequence.
You can try a Suffix Tree, this question goes over how to do it in C#
Or you can try a search like so
var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();
AsParallel will help speed things up as it creates a parallelization of a query.
Store your hashes as UInt32[8]
2a. Use sorted list. To compare two hashes, first compare their first elements; if they are equals, then compare second ones and so on.
2b. Use prefix tree
First of all I would really recommend that you use data compression in order to minimize resource consumption. Cache and memory bandwidth are usually the most limited resource in a modern computer. No matter how you implement this the biggest bottleneck will be waiting for data.
Also I would recommend using an existing database engine. Many of them have build-in compression and any database would make use of the RAM you have available. If you have a decent operating system, the system cache will store as much of the file as it can. But most databases have their own caching subsystem.
I cant really tell what db engine will be best for you, you have to try them out. Personally I often use H2 which have decent performance and can be used as both in-memory and file-based database, and have build in transparent compression.
I see that some have stated that importing your data to a database and building the search index may take longer than some custom solution. That may be true but importing are usually something that's quite rare. I am going to assume that you are more interested in fast searches as they are probable to be the most common operation.
Also why SQL databases are both reliable and quite fast, you may want to consider NoSQL databases. Try out a few alternatives. The only way to know which solution will give you the best performance are by benchmarking them.
Also you should consider if storing your list as text makes sense. Perhaps you should convert the list to numeric values. That will use less space and therefore give you faster queries. Database import may be significantly slower, but queries may become significantly faster.
If you want really fast, and the elements are more or less immutable and require exact matches, you can build something that operates like a virus scanner: set the scope to collect the minimum number of potential elements using whatever algorithms are relevant to your entries and search criteria, then iterate through those items, testing against the search item using RtlCompareMemory.. You can pull the items from disk if they are fairly contiguous and compare using something like this:
private Boolean CompareRegions(IntPtr hFile, long nPosition, IntPtr pCompare, UInt32 pSize)
{
IntPtr pBuffer = IntPtr.Zero;
UInt32 iRead = 0;
try
{
pBuffer = VirtualAlloc(IntPtr.Zero, pSize, MEM_COMMIT, PAGE_READWRITE);
SetFilePointerEx(hFile, nPosition, IntPtr.Zero, FILE_BEGIN);
if (ReadFile(hFile, pBuffer, pSize, ref iRead, IntPtr.Zero) == 0)
return false;
if (RtlCompareMemory(pCompare, pBuffer, pSize) == pSize)
return true; // equal
return false;
}
finally
{
if (pBuffer != IntPtr.Zero)
VirtualFree(pBuffer, pSize, MEM_RELEASE);
}
}
I would modify this example to grab a large buffer full of entries, and loop through those. But managed code may not be the way to go.. Fastest is always closer to the calls that do the actual work, so a driver with kernel mode access built on straight C would be much faster..
Firstly, you say the strings are really SHA256 hashes. Observe that 100 million * 256 bits = 3.2 gigabytes, so it is possible to fit the entire list in memory, assuming you use a memory-efficient data structure.
If you forgive occasional false positives, you can actually use less memory than that. See bloom filters http://billmill.org/bloomfilter-tutorial/
Otherwise, use a sorted data structure to achieve fast querying (time complexity O(log n)).
If you really do want to store the data in memory (because you're querying frequently and need fast results), try Redis. http://redis.io/
Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
It has a set datatype http://redis.io/topics/data-types#sets
Redis Sets are an unordered collection of Strings. It is possible to add, remove, and test for existence of members in O(1) (constant time regardless of the number of elements contained inside the Set).
Otherwise, use a database that saves the data on disk.
A plain vanilla binary search tree will give excellent lookup performance on large lists. However, if you don't really need to store the strings and simple membership is what you want to know, a Bloom Filter may be a terric solution. Bloom filters are a compact data structure that you train with all the strings. Once trained, it can quickly tell you if it has seen a string before. It rarely reports.false positives, but never reports false negatives. Depending on the application, they can produce amazing results quickly and with relatively little memory.
I developed a solution similar to Insta's approach, but with some differences. In effect, it looks a lot like his chunked array solution. However, instead of just simply splitting the data, my approach builds an index of chunks and directs the search only to the appropriate chunk.
The way the index is built is very similar to a hashtable, with each bucket being an sorted array that can be search with a binary search. However, I figured that there's little point in computing a hash of an SHA256 hash, so instead I simply take a prefix of the value.
The interesting thing about this technique is that you can tune it by extending the length of the index keys. A longer key means a larger index and smaller buckets. My test case of 8 bits is probably on the small side; 10-12 bits would probably be more effective.
I attempted to benchmark this approach, but it quickly ran out of memory so I wasn't able to see anything interesting in terms of performance.
I also wrote a C implementation. The C implementation wasn't able to deal with a data set of the specified size either (the test machine has only 4GB of RAM), but it did manage somewhat more. (The target data set actually wasn't so much of a problem in that case, it was the test data that filled up the RAM.) I wasn't able to figure out a good way to throw data at it fast enough to really see its performance tested.
While I enjoyed writing this, I'd say overall it mostly provides evidence in favor of the argument that you shouldn't be trying to do this in memory with C#.
public interface IKeyed
{
int ExtractKey();
}
struct Sha256_Long : IComparable<Sha256_Long>, IKeyed
{
private UInt64 _piece1;
private UInt64 _piece2;
private UInt64 _piece3;
private UInt64 _piece4;
public Sha256_Long(string hex)
{
if (hex.Length != 64)
{
throw new ArgumentException("Hex string must contain exactly 64 digits.");
}
UInt64[] pieces = new UInt64[4];
for (int i = 0; i < 4; i++)
{
pieces[i] = UInt64.Parse(hex.Substring(i * 8, 1), NumberStyles.HexNumber);
}
_piece1 = pieces[0];
_piece2 = pieces[1];
_piece3 = pieces[2];
_piece4 = pieces[3];
}
public Sha256_Long(byte[] bytes)
{
if (bytes.Length != 32)
{
throw new ArgumentException("Sha256 values must be exactly 32 bytes.");
}
_piece1 = BitConverter.ToUInt64(bytes, 0);
_piece2 = BitConverter.ToUInt64(bytes, 8);
_piece3 = BitConverter.ToUInt64(bytes, 16);
_piece4 = BitConverter.ToUInt64(bytes, 24);
}
public override string ToString()
{
return String.Format("{0:X}{0:X}{0:X}{0:X}", _piece1, _piece2, _piece3, _piece4);
}
public int CompareTo(Sha256_Long other)
{
if (this._piece1 < other._piece1) return -1;
if (this._piece1 > other._piece1) return 1;
if (this._piece2 < other._piece2) return -1;
if (this._piece2 > other._piece2) return 1;
if (this._piece3 < other._piece3) return -1;
if (this._piece3 > other._piece3) return 1;
if (this._piece4 < other._piece4) return -1;
if (this._piece4 > other._piece4) return 1;
return 0;
}
//-------------------------------------------------------------------
// Implementation of key extraction
public const int KeyBits = 8;
private static UInt64 _keyMask;
private static int _shiftBits;
static Sha256_Long()
{
_keyMask = 0;
for (int i = 0; i < KeyBits; i++)
{
_keyMask |= (UInt64)1 << i;
}
_shiftBits = 64 - KeyBits;
}
public int ExtractKey()
{
UInt64 keyRaw = _piece1 & _keyMask;
return (int)(keyRaw >> _shiftBits);
}
}
class IndexedSet<T> where T : IComparable<T>, IKeyed
{
private T[][] _keyedSets;
public IndexedSet(IEnumerable<T> source, int keyBits)
{
// Arrange elements into groups by key
var keyedSetsInit = new Dictionary<int, List<T>>();
foreach (T item in source)
{
int key = item.ExtractKey();
List<T> vals;
if (!keyedSetsInit.TryGetValue(key, out vals))
{
vals = new List<T>();
keyedSetsInit.Add(key, vals);
}
vals.Add(item);
}
// Transform the above structure into a more efficient array-based structure
int nKeys = 1 << keyBits;
_keyedSets = new T[nKeys][];
for (int key = 0; key < nKeys; key++)
{
List<T> vals;
if (keyedSetsInit.TryGetValue(key, out vals))
{
_keyedSets[key] = vals.OrderBy(x => x).ToArray();
}
}
}
public bool Contains(T item)
{
int key = item.ExtractKey();
if (_keyedSets[key] == null)
{
return false;
}
else
{
return Search(item, _keyedSets[key]);
}
}
private bool Search(T item, T[] set)
{
int first = 0;
int last = set.Length - 1;
while (first <= last)
{
int midpoint = (first + last) / 2;
int cmp = item.CompareTo(set[midpoint]);
if (cmp == 0)
{
return true;
}
else if (cmp < 0)
{
last = midpoint - 1;
}
else
{
first = midpoint + 1;
}
}
return false;
}
}
class Program
{
//private const int NTestItems = 100 * 1000 * 1000;
private const int NTestItems = 1 * 1000 * 1000;
private static Sha256_Long RandomHash(Random rand)
{
var bytes = new byte[32];
rand.NextBytes(bytes);
return new Sha256_Long(bytes);
}
static IEnumerable<Sha256_Long> GenerateRandomHashes(
Random rand, int nToGenerate)
{
for (int i = 0; i < nToGenerate; i++)
{
yield return RandomHash(rand);
}
}
static void Main(string[] args)
{
Console.WriteLine("Generating test set.");
var rand = new Random();
IndexedSet<Sha256_Long> set =
new IndexedSet<Sha256_Long>(
GenerateRandomHashes(rand, NTestItems),
Sha256_Long.KeyBits);
Console.WriteLine("Testing with random input.");
int nFound = 0;
int nItems = NTestItems;
int waypointDistance = 100000;
int waypoint = 0;
for (int i = 0; i < nItems; i++)
{
if (++waypoint == waypointDistance)
{
Console.WriteLine("Test lookups complete: " + (i + 1));
waypoint = 0;
}
var item = RandomHash(rand);
nFound += set.Contains(item) ? 1 : 0;
}
Console.WriteLine("Testing complete.");
Console.WriteLine(String.Format("Found: {0} / {0}", nFound, nItems));
Console.ReadKey();
}
}

Categories

Resources