Listing Coin Toss Possibilities - c#

Good day to all. I'm learning programming and I'm trying to put in a list of strings all possible outcomes of a series of coin tosses and while the code below works for small number of coin tosses (<20, I get out of memory exception), I want something that could handle higher numbers (possibly around 40).
I tried searching for similar problems with solution but nothing seems to catch my eye or at my level.
Can anyone please tell me what I'm doing wrong or nudge me in the right direction?
public static class CoinToss {
private const int numberTosses = 11;
private static List<string> allPossibilities = new List<string>();
public static void WriteAllPossibilities() {
List<string> temp = new List<string>(); ;
allPossibilities.Add("H");
allPossibilities.Add("T");
int tossIndex = 1;
while (tossIndex++ < numberTosses) {
/* Clear the temporary */
temp.Clear();
foreach (string outcome in allPossibilities) {
temp.Add(outcome + "H");
temp.Add(outcome + "T");
}
/* Remove all items in all possibilities */
allPossibilities.Clear();
/* Copy allPaths from temp */
foreach (string path in temp) {
allPossibilities.Add(path);
}
}
foreach (string path in allPossibilities) {
Console.WriteLine(path);
}
Console.WriteLine(allPossibilities.Count);
}
}

I am surprised that there's not an example of this problem, among all the "coin flip" questions already on Stack Overflow. But, I couldn't find a suitable duplicate. You may be the first to post a question about this specific issue in the specific context of coin flips.
In a comment, you wrote:
I need to process the list afterwards thats why I'm keeping it in a list
That is never going to work. As has been pointed out in the comments, to represent all possible combinations of 40 coin flips will require 2^40 different sequences. Even if you're conservative and store them in a packed data structure five bytes at a time, that's still 2^40 times five bytes of memory, which is nearly 5-1/2 terabytes. That is technically within the limits of a 64-bit process, but not within any reasonable practical limits in terms of what Windows or a typical computer running Windows will provide.
Furthermore, even if you could store all those sequences, it would take an exceptional amount of time to do anything useful with them. And no human would be able to read fast enough to see the entire list; assuming they are super-fast readers and can read 10 per second, that's still 17,000 years.
That said, the basic logic to generate them is easy enough, even if there's not enough time for the program to complete. One approach would look something like this:
class Program
{
const int flipCount = 40;
const long counterMax = (long)1 << flipCount;
static void Main(string[] args)
{
for (long flipSequence = 0; flipSequence < counterMax; flipSequence++)
{
DisplayFlips(flipSequence);
}
}
private static readonly StringBuilder sb = new StringBuilder(flipCount);
private static void DisplayFlips(long flipSequence)
{
sb.Clear();
for (int i = 0; i < flipCount; i++)
{
sb.Append((flipSequence & 1) != 0 ? 'T' : 'H');
flipSequence >>= 1;
}
Console.WriteLine(sb);
}
}

Related

Optimize recursive function that generates a sequence of chars

I have a function that needs to generate a sequence of num chars to test a security algorithm.
For instance, patternLength of 4 it would generate:
["0000", "0001", "0002", ... , "9999"]
and 3 would generate:
["000", "001", ... "999"] and so on.
As we know, recursion can be pretty expensive and begins to slow to a crawl at higher lengths so I'm hoping to speed it up by using some caching or DP. Is this at all possible?
Current function with recursion:
private static List<char> PossibleCharacters = new List<char>()
{
'0','1','2','3','4','5','6','7','8','9'
};
public static List<string> SequenceGenerator(int patternLength)
{
List<string> result = new List<string>();
if (patternLength > 0)
{
List<string> prev = SequenceGenerator(patternLength - 1);
foreach (string entry in prev)
{
foreach (char ch in PossibleCharacters)
{
result.Add(entry + ch);
}
}
}
else
{
result.Add("");
}
return result;
}
My messy attempt. I'm building the list starting with length 1, then 2, and so on. It gets to 999 -> 1000 where it becomes wrong, it should be 999 -> 0000. Of course, I'll need to clear the contents of the cache where the lengths aren't what I want.
// patternLength = 4
string[] result = new string[10000];
string[] cache = new string[10000];
result.Append("");
dp[0] = "0";
int i = 1;
int j = 0;
foreach (string entry in result)
{
foreach (char ch in PossibleCharacters)
{
cache[i] = entry + ch;
i++;
}
result[j + 1] = j.ToString();
j++;
}
return cache.toList();
Thanks all for your time.
Your code is inherently bounded by an O(10^n) minimum time complexity (where n is the length of the tested string) and there is no way to bypass this limit without changing its' functionality. In other words, your code is likely mostly slow because you are doing a lot of work and not because you implemented the aforementioned work in a particularly inefficient way. Optimizing string generation is very unlikely to provide any significant improvement in run time. Utilizing multithreading is more likely to provide real-world performance improvements in your described scenario, so personally I would start from there. Additionally, if you only need to access each tested string once, it is preferable to to generate them one-by-one as they are used in order to reduce your program's space complexity. Currently you are storing every string for the entire duration of your program or function's execution, which is a potentially wasteful usage of memory if they are only needed once.

What is the best way to use Parallel.For in a recursive algorithm?

I am building software to evaluate many possible solutions and am trying to introduce parallel processing to speed up the calculations. My first attempt was to build a datatable with each row being a solution to evaluate but building the datatable takes quite some time and I am running into memory issues when the number of possible solutions goes into the millions.
The problem which warrants these solutions is structured as follows:
There is a range dates for x number of events which must be done in order. The solutions to evaluate could look as follows with each solution being a row, the events being the columns and the day number being the values.
Given 3 days (0 to 2) and three events:
0 0 0
0 0 1
0 0 2
0 1 1
0 1 2
0 2 2
1 1 1
1 1 2
1 2 2
2 2 2
My new plan was to use recursion and evaluate the solutions as I go rather than build a solution set to then evaluate.
for(int day = 0; day < maxdays; day++)
{
List<int> mydays = new List<int>();
mydays.Add(day);
EvalEvent(0,day,mydays);
}
private void EvalEvent(int eventnum,
int day, List<int> mydays)
{
Parallel.For(day,maxdays, day2 =>
// events must be on same day or after previous events
{
List<int> mydays2 = new List<int>();
for(int a = 0; a <mydays.Count;a++)
{
mydays2.Add(mydays[a]);
}
mydays2.Add(day2);
if(eventnum< eventcount - 1) // proceed to next event
{
EvalEvent(eventnum+1, day2,mydays2);
}
else
{
EvalSolution(mydays2);
}
});
}
My question is if this is actually an efficient use of parallel processing or will too many threads be spawned and slow it down? Should the parallel loop only be done on the last or maybe last few values of eventnum or is there a better way to approach the problem?
Requested old code pretty much is as follows:
private int daterange;
private int events;
private void ScheduleIt()
{
daterange = 10;
events = 6;
CreateSolutions();
int best = GetBest();
}
private DataTable Options();
private bool CreateSolutions()
{
Options= new DataTable();
Options.Columns.Add();
for (int day1=0;day1<=daterange ;day1++)
{
Options.Rows.Add(day1);
}
for (int event =1; event<events; event++)
{
Options.Columns.Add();
foreach(DataRow dr in Options.Rows)
{dr[Options.Columns.Count-1] = dr[Options.Columns.Count-2] ;}
int rows = Options.Rows.Count;
for (int day1=1;day1<=daterange ;day1++)
{
for(int i = 0; i <rows; i++)
{
if(day1 > Convert.ToInt32(Options.Rows[i][Options.Columns.Count-2]))
{
try{
Options.Rows.Add();
for (int col=0;col<Options.Columns.Count-1;col++)
{
Options.Rows[Options.Rows.Count-1][col] =Options.Rows[i][col];
}
Options.Rows[Options.Rows.Count-1][Options.Columns.Count-1] = day1;
}
catch(Exception ex)
{
return false;
}
}
}
}
}
return true;
}
private intGetBest()
{
int bestopt = 0;
double bestscore =999999999;
Parallel.For( 0, Options.Rows.Count,opt =>
{
double score = 0;
for(int i = 0; i <Options.Columns.Count;i++)
{score += Options.Rows[opt][i]}// just a stand in calc for a score
if (score < bestscore)
{bestscore = score;
bestopt = opt;
}
});
return bestopt;
}
Even if done without errors it can not significantly speed up your solution.
It looks like each level of recursion tries to start multiple (let say up to "k") next level calls for let's "n" level. This essentially mean code is O(k ^ n) which grows very fast. Non-algorithmic speedup of such O(k^n) solution is essentially useless (unless both k and n are very small). In particular, running code in parallel will only give you constant factor of speed up (roughly number of threads supported by your CPUs) which really does not change complexity at all.
Indeed creation of exponentially large number of requests for new threads will likely cause more problems by itself for just managing threads.
In addition to not significantly improving performance parallel code is harder to write as it needs proper synchronization or cleaver data partitioning - neither seem to be present in your case.
Parallelization works best when the workload is bulky and balanced. Ideally you would like your work splited to as many independent partitions as the logical processors of your machine, ensuring that all partitions have approximately the same size. This way all available processors will work with the maximum efficiency for approximately the same duration, and you'll get the results after the shortest time possible.
Of course you should start with a working and bug-free serial implementation, and then think about ways to partition your work. The easiest way is usually not optimal. For example an easy path is to convert your work to a LINQ query, and then parallelize it with AsParallel() (making it PLINQ). This usually results to a too granular partitioning, that introduces too much overhead. If you can't find ways to improve it you then can go the way of Parallel.For or Parallel.ForEach, which is a bit more complex.
A LINQ implementation should probably start by creating an iterator that produces all your units of work (Events or Solutions, it's not very clear to me).
public static IEnumerable<Solution> GetAllSolutions()
{
for (int day = 0; day < 3; day++)
{
for (int ev = 0; ev < 3; ev++)
{
yield return new Solution(); // ???
}
}
}
It will certainly be helpful if you have created concrete classes to represent the entities you are dealing with.

Deleting from array, mirrored (strange) behavior

The title may seem a little odd, because I have no idea how to describe this in one sentence.
For the course Algorithms we have to micro-optimize some stuff, one is finding out how deleting from an array works. The assignment is delete something from an array and re-align the contents so that there are no gaps, I think it is quite similar to how std::vector::erase works from c++.
Because I like the idea of understanding everything low-level, I went a little further and tried to bench my solutions. This presented some weird results.
At first, here is a little code that I used:
class Test {
Stopwatch sw;
Obj[] objs;
public Test() {
this.sw = new Stopwatch();
this.objs = new Obj[1000000];
// Fill objs
for (int i = 0; i < objs.Length; i++) {
objs[i] = new Obj(i);
}
}
public void test() {
// Time deletion
sw.Restart();
deleteValue(400000, objs);
sw.Stop();
// Show timings
Console.WriteLine(sw.Elapsed);
}
// Delete function
// value is the to-search-for item in the list of objects
private static void deleteValue(int value, Obj[] list) {
for (int i = 0; i < list.Length; i++) {
if (list[i].Value == value) {
for (int j = i; j < list.Length - 1; j++) {
list[j] = list[j + 1];
//if (list[j + 1] == null) {
// break;
//}
}
list[list.Length - 1] = null;
break;
}
}
}
}
I would just create this class and call the test() method. I did this in a loop for 25 times.
My findings:
The first round it takes a lot longer than the other 24, I think this is because of caching, but I am not sure.
When I use a value that is in the start of the list, it has to move more items in memory than when I use a value at the end, though it still seems to take less time.
Benchtimes differ quite a bit.
When I enable the commented if, performance goes up (10-20%) even if the value I search for is almost at the end of the list (which means the if goes off a lot of times without actually being useful).
I have no idea why these things happen, is there someone who can explain (some of) them? And maybe if someone sees this who is a pro at this, where can I find more info to do this the most efficient way?
Edit after testing:
I did some testing and found some interesting results. I run the test on an array with a size of a million items, filled with a million objects. I run that 25 times and report the cumulative time in milliseconds. I do that 10 times and take the average of that as a final value.
When I run the test with my function described just above here I get a score of:
362,1
When I run it with the answer of dbc I get a score of:
846,4
So mine was faster, but then I started to experiment with a half empty empty array and things started to get weird. To get rid of the inevitable nullPointerExceptions I added an extra check to the if (thinking it would ruin a bit more of the performance) like so:
if (fromItem != null && fromItem.Value != value)
list[to++] = fromItem;
This seemed to not only work, but improve performance dramatically! Now I get a score of:
247,9
The weird thing is, the scores seem to low to be true, but sometimes spike, this is the set I took the avg from:
94, 26, 966, 36, 632, 95, 47, 35, 109, 439
So the extra evaluation seems to improve my performance, despite of doing an extra check. How is this possible?
You are using Stopwatch to time your method. This calculates the total clock time taken during your method call, which could include the time required for .Net to initially JIT your method, interruptions for garbage collection, or slowdowns caused by system loads from other processes. Noise from these sources will likely dominate noise due to cache misses.
This answer gives some suggestions as to how you can minimize some of the noise from garbage collection or other processes. To eliminate JIT noise, you should call your method once without timing it -- or show the time taken by the first call in a separate column in your results table since it will be so different. You might also consider using a proper profiler which will report exactly how much time your code used exclusive of "noise" from other threads or processes.
Finally, I'll note that your algorithm to remove matching items from an array and shift everything else down uses a nested loop, which is not necessary and will access items in the array after the matching index twice. The standard algorithm looks like this:
public static void RemoveFromArray(this Obj[] array, int value)
{
int to = 0;
for (int from = 0; from < array.Length; from++)
{
var fromItem = array[from];
if (fromItem.Value != value)
array[to++] = fromItem;
}
for (; to < array.Length; to++)
{
array[to] = default(Obj);
}
}
However, instead of using the standard algorithm you might experiment by using Array.RemoveAt() with your version, since (I believe) internally it does the removal in unmanaged code.

What's an appropriate search/retrieval method for a VERY long list of strings?

This is not a terribly uncommon question, but I still couldn't seem to find an answer that really explained the choice.
I have a very large list of strings (ASCII representations of SHA-256 hashes, to be exact), and I need to query for the presence of a string within that list.
There will be what is likely in excess of 100 million entries in this list, and I will need to repeatably query for the presence of an entry many times.
Given the size, I doubt I can stuff it all into a HashSet<string>. What would be an appropriate retrieval system to maximize performance?
I CAN pre-sort the list, I CAN put it into a SQL table, I CAN put it into a text file, but I'm not sure what really makes the most sense given my application.
Is there a clear winner in terms of performance among these, or other methods of retrieval?
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Security.Cryptography;
namespace HashsetTest
{
abstract class HashLookupBase
{
protected const int BucketCount = 16;
private readonly HashAlgorithm _hasher;
protected HashLookupBase()
{
_hasher = SHA256.Create();
}
public abstract void AddHash(byte[] data);
public abstract bool Contains(byte[] data);
private byte[] ComputeHash(byte[] data)
{
return _hasher.ComputeHash(data);
}
protected Data256Bit GetHashObject(byte[] data)
{
var hash = ComputeHash(data);
return Data256Bit.FromBytes(hash);
}
public virtual void CompleteAdding() { }
}
class HashsetHashLookup : HashLookupBase
{
private readonly HashSet<Data256Bit>[] _hashSets;
public HashsetHashLookup()
{
_hashSets = new HashSet<Data256Bit>[BucketCount];
for(int i = 0; i < _hashSets.Length; i++)
_hashSets[i] = new HashSet<Data256Bit>();
}
public override void AddHash(byte[] data)
{
var item = GetHashObject(data);
var offset = item.GetHashCode() & 0xF;
_hashSets[offset].Add(item);
}
public override bool Contains(byte[] data)
{
var target = GetHashObject(data);
var offset = target.GetHashCode() & 0xF;
return _hashSets[offset].Contains(target);
}
}
class ArrayHashLookup : HashLookupBase
{
private Data256Bit[][] _objects;
private int[] _offsets;
private int _bucketCounter;
public ArrayHashLookup(int size)
{
size /= BucketCount;
_objects = new Data256Bit[BucketCount][];
_offsets = new int[BucketCount];
for(var i = 0; i < BucketCount; i++) _objects[i] = new Data256Bit[size + 1];
_bucketCounter = 0;
}
public override void CompleteAdding()
{
for(int i = 0; i < BucketCount; i++) Array.Sort(_objects[i]);
}
public override void AddHash(byte[] data)
{
var hashObject = GetHashObject(data);
_objects[_bucketCounter][_offsets[_bucketCounter]++] = hashObject;
_bucketCounter++;
_bucketCounter %= BucketCount;
}
public override bool Contains(byte[] data)
{
var hashObject = GetHashObject(data);
return _objects.Any(o => Array.BinarySearch(o, hashObject) >= 0);
}
}
struct Data256Bit : IEquatable<Data256Bit>, IComparable<Data256Bit>
{
public bool Equals(Data256Bit other)
{
return _u1 == other._u1 && _u2 == other._u2 && _u3 == other._u3 && _u4 == other._u4;
}
public int CompareTo(Data256Bit other)
{
var rslt = _u1.CompareTo(other._u1); if (rslt != 0) return rslt;
rslt = _u2.CompareTo(other._u2); if (rslt != 0) return rslt;
rslt = _u3.CompareTo(other._u3); if (rslt != 0) return rslt;
return _u4.CompareTo(other._u4);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj))
return false;
return obj is Data256Bit && Equals((Data256Bit) obj);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = _u1.GetHashCode();
hashCode = (hashCode * 397) ^ _u2.GetHashCode();
hashCode = (hashCode * 397) ^ _u3.GetHashCode();
hashCode = (hashCode * 397) ^ _u4.GetHashCode();
return hashCode;
}
}
public static bool operator ==(Data256Bit left, Data256Bit right)
{
return left.Equals(right);
}
public static bool operator !=(Data256Bit left, Data256Bit right)
{
return !left.Equals(right);
}
private readonly long _u1;
private readonly long _u2;
private readonly long _u3;
private readonly long _u4;
private Data256Bit(long u1, long u2, long u3, long u4)
{
_u1 = u1;
_u2 = u2;
_u3 = u3;
_u4 = u4;
}
public static Data256Bit FromBytes(byte[] data)
{
return new Data256Bit(
BitConverter.ToInt64(data, 0),
BitConverter.ToInt64(data, 8),
BitConverter.ToInt64(data, 16),
BitConverter.ToInt64(data, 24)
);
}
}
class Program
{
private const int TestSize = 150000000;
static void Main(string[] args)
{
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var arrayHashLookup = new ArrayHashLookup(TestSize);
PerformBenchmark(arrayHashLookup, TestSize);
}
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var hashsetHashLookup = new HashsetHashLookup();
PerformBenchmark(hashsetHashLookup, TestSize);
}
Console.ReadLine();
}
private static void PerformBenchmark(HashLookupBase hashClass, int size)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < size; i++)
hashClass.AddHash(BitConverter.GetBytes(i * 2));
Console.WriteLine("Hashing and addition took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
hashClass.CompleteAdding();
Console.WriteLine("Hash cleanup (sorting, usually) took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
var found = 0;
for (int i = 0; i < size * 2; i += 10)
{
found += hashClass.Contains(BitConverter.GetBytes(i)) ? 1 : 0;
}
Console.WriteLine("Found " + found + " elements (expected " + (size / 5) + ") in " + sw.ElapsedMilliseconds + "ms");
}
}
}
Results are pretty promising. They run single-threaded. The hashset version can hit a little over 1 million lookups per second at 7.9GB RAM usage. The array-based version uses less RAM (4.6GB). Startup times between the two are nearly identical (388 vs 391 seconds). The hashset trades RAM for lookup performance. Both had to be bucketized because of memory allocation constraints.
Array performance:
Hashing and addition took 307408ms
Hash cleanup (sorting, usually) took 81892ms
Found 30000000 elements (expected 30000000) in 562585ms [53k searches per second]
======================================
Hashset performance:
Hashing and addition took 391105ms
Hash cleanup (sorting, usually) took 0ms
Found 30000000 elements (expected 30000000) in 74864ms [400k searches per second]
If the list changes over time, I would put it in a database.
If the list doesn't change, I would put it in a sorted file and do a binary search for every query.
In both cases, I would use a Bloom filter to minimize I/O. And I would stop using strings and use the binary representation with four ulongs (to avoid the object reference cost).
If you have more than 16 GB (2*64*4/3*100M, assuming Base64 encoding) to spare, an option is to make a Set&ltstring> and be happy. Of course it would fit in less than 7 GB if you use the binary representation.
David Haney's answer shows us that the memory cost is not so easily calculated.
With <gcAllowVeryLargeObjects>, you can have arrays that are much larger. Why not convert those ASCII representations of 256-bit hash codes to a custom struct that implements IComparable<T>? It would look like this:
struct MyHashCode: IComparable<MyHashCode>
{
// make these readonly and provide a constructor
ulong h1, h2, h3, h4;
public int CompareTo(MyHashCode other)
{
var rslt = h1.CompareTo(other.h1);
if (rslt != 0) return rslt;
rslt = h2.CompareTo(other.h2);
if (rslt != 0) return rslt;
rslt = h3.CompareTo(other.h3);
if (rslt != 0) return rslt;
return h4.CompareTo(other.h4);
}
}
You can then create an array of these, which would occupy approximately 3.2 GB. You can search it easy enough with Array.BinarySearch.
Of course, you'll need to convert the user's input from ASCII to one of those hash code structures, but that's easy enough.
As for performance, this isn't going to be as fast as a hash table, but it's certainly going to be faster than a database lookup or file operations.
Come to think of it, you could create a HashSet<MyHashCode>. You'd have to override the Equals method on MyHashCode, but that's really easy. As I recall, the HashSet costs something like 24 bytes per entry, and you'd have the added cost of the larger struct. Figure five or six gigabytes, total, if you were to use a HashSet. More memory, but still doable, and you get O(1) lookup.
These answers don't factor the string memory into the application. Strings are not 1 char == 1 byte in .NET. Each string object requires a constant 20 bytes for the object data. And the buffer requires 2 bytes per character. Therefore: the memory usage estimate for a string instance is 20 + (2 * Length) bytes.
Let's do some math.
100,000,000 UNIQUE strings
SHA256 = 32 bytes (256 bits)
size of each string = 20 + (2 * 32 bytes) = 84 bytes
Total required memory: 8,400,000,000 bytes = 8.01 gigabytes
It is possible to do so, but this will not store well in .NET memory. Your goal should be to load all of this data into a form that can be accessed/paged without holding it all in memory at once. For that I'd use Lucene.net which will store your data on disk and intelligently search it. Write each string as searchable to an index and then search the index for the string. Now you have a scalable app that can handle this problem; your only limitation will be disk space (and it would take a lot of string to fill up a terabyte drive). Alternatively, put these records in a database and query against it. That's why databases exist: to persist things outside of RAM. :)
For maximum speed, keep them in RAM. It's only ~3GB worth of data, plus whatever overhead your data structure needs. A HashSet<byte[]> should work just fine. If you want to lower overhead and GC pressure, turn on <gcAllowVeryLargeObjects>, use a single byte[], and a HashSet<int> with a custom comparer to index into it.
For speed and low memory usage, store them in a disk-based hash table.
For simplicity, store them in a database.
Whatever you do, you should store them as plain binary data, not strings.
A hashset splits your data into buckets (arrays). On a 64-bit system, the size limit for an array is 2 GB, which is roughly 2,000,000,000 bytes.
Since a string is a reference type, and since a reference takes eight bytes (assuming a 64-bit system), each bucket can hold approximately 250,000,000 (250 million) references to strings. It seems to be way more than what you need.
That being said, as Tim S. pointed out, it's highly unlikely you'll have the necessary memory to hold the strings themselves, even though the references would fit into the hashset. A database would me a much better fit for this.
You need to be careful in this sort of situation as most collections in most languages are not really designed or optimized for that sort of scale. As you have already identified memory usage will be a problem too.
The clear winner here is to use some form of database. Either a SQL database or there are a number of NoSQL ones that would be appropriate.
The SQL server is already designed and optimized for keeping track of large amounts of data, indexing it and searching and querying across those indexes. It's designed for doing exactly what you are trying to do so really would be the best way to go.
For performance you could consider using an embedded database that will run within your process and save the resulting communications overhead. For Java I could recommend a Derby database for that purpose, I'm not aware of the C# equivalents enough to make a recommendation there but I imagine suitable databases exist.
It might take a while (1) to dump all the records in a (clustered indexed) table (preferably use their values, not their string representation (2)) and let SQL do the searching. It will handle binary searching for you, it will handle caching for you and it's probably the easiest thing to work with if you need to make changes to the list. And I'm pretty sure that querying things will be just as fast (or faster) than building your own.
(1): For loading the data have a look at the SqlBulkCopy object, things like ADO.NET or Entity Framework are going to be too slow as they load the data row by row.
(2): SHA-256 = 256 bits, so a binary(32) will do; which is only half of the 64 characters you're using now. (Or a quarter of it if you're using Unicode numbers =P) Then again, if you currently have the information in a plain text-file you could still go the char(64) way and simply dump the data in the table using bcp.exe. The database will be bigger, the queries slightly slower (as more I/O is needed + the cache holds only half of the information for the same amount of RAM), etc... But it's quite straightforward to do, and if you're not happy with the result you can still write your own database-loader.
If the set is constant then just make a big sorted hash list (in raw format, 32 bytes each). Store all hashes so that they fit to disk sectors (4KB), and that the beginning of each sector is also the beginning of a hash. Save the first hash in every Nth sector in a special index list, which will easily fit into memory. Use binary search on this index list to determine the starting sector of a sector cluster where the hash should be, and then use another binary search within this sector cluster to find your hash. Value N should be determined based on measuring with test data.
EDIT: alternative would be to implement your own hash table on disk. The table should use open addressing strategy, and the probe sequence should be restricted to the same disk sector as much as possible. Empty slot have to be marked with a special value (all zeroes for instance) so this special value should be specially handled when queried for existence. To avoid collisions the table should not be less than 80% full with values, so in your case with 100 million entries with size of 32 bytes that means the table should have at least 100M/80%= 125 millions slots, and have the size of 125M*32= 4 GB. You only need to create the hashing function that would convert 2^256 domain to 125M, and some nice probe sequence.
You can try a Suffix Tree, this question goes over how to do it in C#
Or you can try a search like so
var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();
AsParallel will help speed things up as it creates a parallelization of a query.
Store your hashes as UInt32[8]
2a. Use sorted list. To compare two hashes, first compare their first elements; if they are equals, then compare second ones and so on.
2b. Use prefix tree
First of all I would really recommend that you use data compression in order to minimize resource consumption. Cache and memory bandwidth are usually the most limited resource in a modern computer. No matter how you implement this the biggest bottleneck will be waiting for data.
Also I would recommend using an existing database engine. Many of them have build-in compression and any database would make use of the RAM you have available. If you have a decent operating system, the system cache will store as much of the file as it can. But most databases have their own caching subsystem.
I cant really tell what db engine will be best for you, you have to try them out. Personally I often use H2 which have decent performance and can be used as both in-memory and file-based database, and have build in transparent compression.
I see that some have stated that importing your data to a database and building the search index may take longer than some custom solution. That may be true but importing are usually something that's quite rare. I am going to assume that you are more interested in fast searches as they are probable to be the most common operation.
Also why SQL databases are both reliable and quite fast, you may want to consider NoSQL databases. Try out a few alternatives. The only way to know which solution will give you the best performance are by benchmarking them.
Also you should consider if storing your list as text makes sense. Perhaps you should convert the list to numeric values. That will use less space and therefore give you faster queries. Database import may be significantly slower, but queries may become significantly faster.
If you want really fast, and the elements are more or less immutable and require exact matches, you can build something that operates like a virus scanner: set the scope to collect the minimum number of potential elements using whatever algorithms are relevant to your entries and search criteria, then iterate through those items, testing against the search item using RtlCompareMemory.. You can pull the items from disk if they are fairly contiguous and compare using something like this:
private Boolean CompareRegions(IntPtr hFile, long nPosition, IntPtr pCompare, UInt32 pSize)
{
IntPtr pBuffer = IntPtr.Zero;
UInt32 iRead = 0;
try
{
pBuffer = VirtualAlloc(IntPtr.Zero, pSize, MEM_COMMIT, PAGE_READWRITE);
SetFilePointerEx(hFile, nPosition, IntPtr.Zero, FILE_BEGIN);
if (ReadFile(hFile, pBuffer, pSize, ref iRead, IntPtr.Zero) == 0)
return false;
if (RtlCompareMemory(pCompare, pBuffer, pSize) == pSize)
return true; // equal
return false;
}
finally
{
if (pBuffer != IntPtr.Zero)
VirtualFree(pBuffer, pSize, MEM_RELEASE);
}
}
I would modify this example to grab a large buffer full of entries, and loop through those. But managed code may not be the way to go.. Fastest is always closer to the calls that do the actual work, so a driver with kernel mode access built on straight C would be much faster..
Firstly, you say the strings are really SHA256 hashes. Observe that 100 million * 256 bits = 3.2 gigabytes, so it is possible to fit the entire list in memory, assuming you use a memory-efficient data structure.
If you forgive occasional false positives, you can actually use less memory than that. See bloom filters http://billmill.org/bloomfilter-tutorial/
Otherwise, use a sorted data structure to achieve fast querying (time complexity O(log n)).
If you really do want to store the data in memory (because you're querying frequently and need fast results), try Redis. http://redis.io/
Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
It has a set datatype http://redis.io/topics/data-types#sets
Redis Sets are an unordered collection of Strings. It is possible to add, remove, and test for existence of members in O(1) (constant time regardless of the number of elements contained inside the Set).
Otherwise, use a database that saves the data on disk.
A plain vanilla binary search tree will give excellent lookup performance on large lists. However, if you don't really need to store the strings and simple membership is what you want to know, a Bloom Filter may be a terric solution. Bloom filters are a compact data structure that you train with all the strings. Once trained, it can quickly tell you if it has seen a string before. It rarely reports.false positives, but never reports false negatives. Depending on the application, they can produce amazing results quickly and with relatively little memory.
I developed a solution similar to Insta's approach, but with some differences. In effect, it looks a lot like his chunked array solution. However, instead of just simply splitting the data, my approach builds an index of chunks and directs the search only to the appropriate chunk.
The way the index is built is very similar to a hashtable, with each bucket being an sorted array that can be search with a binary search. However, I figured that there's little point in computing a hash of an SHA256 hash, so instead I simply take a prefix of the value.
The interesting thing about this technique is that you can tune it by extending the length of the index keys. A longer key means a larger index and smaller buckets. My test case of 8 bits is probably on the small side; 10-12 bits would probably be more effective.
I attempted to benchmark this approach, but it quickly ran out of memory so I wasn't able to see anything interesting in terms of performance.
I also wrote a C implementation. The C implementation wasn't able to deal with a data set of the specified size either (the test machine has only 4GB of RAM), but it did manage somewhat more. (The target data set actually wasn't so much of a problem in that case, it was the test data that filled up the RAM.) I wasn't able to figure out a good way to throw data at it fast enough to really see its performance tested.
While I enjoyed writing this, I'd say overall it mostly provides evidence in favor of the argument that you shouldn't be trying to do this in memory with C#.
public interface IKeyed
{
int ExtractKey();
}
struct Sha256_Long : IComparable<Sha256_Long>, IKeyed
{
private UInt64 _piece1;
private UInt64 _piece2;
private UInt64 _piece3;
private UInt64 _piece4;
public Sha256_Long(string hex)
{
if (hex.Length != 64)
{
throw new ArgumentException("Hex string must contain exactly 64 digits.");
}
UInt64[] pieces = new UInt64[4];
for (int i = 0; i < 4; i++)
{
pieces[i] = UInt64.Parse(hex.Substring(i * 8, 1), NumberStyles.HexNumber);
}
_piece1 = pieces[0];
_piece2 = pieces[1];
_piece3 = pieces[2];
_piece4 = pieces[3];
}
public Sha256_Long(byte[] bytes)
{
if (bytes.Length != 32)
{
throw new ArgumentException("Sha256 values must be exactly 32 bytes.");
}
_piece1 = BitConverter.ToUInt64(bytes, 0);
_piece2 = BitConverter.ToUInt64(bytes, 8);
_piece3 = BitConverter.ToUInt64(bytes, 16);
_piece4 = BitConverter.ToUInt64(bytes, 24);
}
public override string ToString()
{
return String.Format("{0:X}{0:X}{0:X}{0:X}", _piece1, _piece2, _piece3, _piece4);
}
public int CompareTo(Sha256_Long other)
{
if (this._piece1 < other._piece1) return -1;
if (this._piece1 > other._piece1) return 1;
if (this._piece2 < other._piece2) return -1;
if (this._piece2 > other._piece2) return 1;
if (this._piece3 < other._piece3) return -1;
if (this._piece3 > other._piece3) return 1;
if (this._piece4 < other._piece4) return -1;
if (this._piece4 > other._piece4) return 1;
return 0;
}
//-------------------------------------------------------------------
// Implementation of key extraction
public const int KeyBits = 8;
private static UInt64 _keyMask;
private static int _shiftBits;
static Sha256_Long()
{
_keyMask = 0;
for (int i = 0; i < KeyBits; i++)
{
_keyMask |= (UInt64)1 << i;
}
_shiftBits = 64 - KeyBits;
}
public int ExtractKey()
{
UInt64 keyRaw = _piece1 & _keyMask;
return (int)(keyRaw >> _shiftBits);
}
}
class IndexedSet<T> where T : IComparable<T>, IKeyed
{
private T[][] _keyedSets;
public IndexedSet(IEnumerable<T> source, int keyBits)
{
// Arrange elements into groups by key
var keyedSetsInit = new Dictionary<int, List<T>>();
foreach (T item in source)
{
int key = item.ExtractKey();
List<T> vals;
if (!keyedSetsInit.TryGetValue(key, out vals))
{
vals = new List<T>();
keyedSetsInit.Add(key, vals);
}
vals.Add(item);
}
// Transform the above structure into a more efficient array-based structure
int nKeys = 1 << keyBits;
_keyedSets = new T[nKeys][];
for (int key = 0; key < nKeys; key++)
{
List<T> vals;
if (keyedSetsInit.TryGetValue(key, out vals))
{
_keyedSets[key] = vals.OrderBy(x => x).ToArray();
}
}
}
public bool Contains(T item)
{
int key = item.ExtractKey();
if (_keyedSets[key] == null)
{
return false;
}
else
{
return Search(item, _keyedSets[key]);
}
}
private bool Search(T item, T[] set)
{
int first = 0;
int last = set.Length - 1;
while (first <= last)
{
int midpoint = (first + last) / 2;
int cmp = item.CompareTo(set[midpoint]);
if (cmp == 0)
{
return true;
}
else if (cmp < 0)
{
last = midpoint - 1;
}
else
{
first = midpoint + 1;
}
}
return false;
}
}
class Program
{
//private const int NTestItems = 100 * 1000 * 1000;
private const int NTestItems = 1 * 1000 * 1000;
private static Sha256_Long RandomHash(Random rand)
{
var bytes = new byte[32];
rand.NextBytes(bytes);
return new Sha256_Long(bytes);
}
static IEnumerable<Sha256_Long> GenerateRandomHashes(
Random rand, int nToGenerate)
{
for (int i = 0; i < nToGenerate; i++)
{
yield return RandomHash(rand);
}
}
static void Main(string[] args)
{
Console.WriteLine("Generating test set.");
var rand = new Random();
IndexedSet<Sha256_Long> set =
new IndexedSet<Sha256_Long>(
GenerateRandomHashes(rand, NTestItems),
Sha256_Long.KeyBits);
Console.WriteLine("Testing with random input.");
int nFound = 0;
int nItems = NTestItems;
int waypointDistance = 100000;
int waypoint = 0;
for (int i = 0; i < nItems; i++)
{
if (++waypoint == waypointDistance)
{
Console.WriteLine("Test lookups complete: " + (i + 1));
waypoint = 0;
}
var item = RandomHash(rand);
nFound += set.Contains(item) ? 1 : 0;
}
Console.WriteLine("Testing complete.");
Console.WriteLine(String.Format("Found: {0} / {0}", nFound, nItems));
Console.ReadKey();
}
}

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

Recently I came across an interview question to create a algorithm in any language which should do the following
Read 1 terabyte of content
Make a count for each reoccuring word in that content
List the top 10 most frequently occurring words
Could you let me know the best possible way to create an algorithm for this?
Edit:
OK, let's say the content is in English. How we can find the top 10 words that occur most frequently in that content? My other doubt is, if purposely they are giving unique data then our buffer will expire with heap size overflow. We need to handle that as well.
Interview Answer
This task is interesting without being too complex, so a great way to start a good technical discussion. My plan to tackle this task would be:
Split input data in words, using white space and punctuation as delimiters
Feed every word found into a Trie structure, with counter updated in nodes representing a word's last letter
Traverse the fully populated tree to find nodes with highest counts
In the context of an interview ... I would demonstrate the idea of Trie by drawing the tree on a board or paper. Start from empty, then build the tree based on a single sentence containing at least one recurring word. Say "the cat can catch the mouse". Finally show how the tree can then be traversed to find highest counts. I would then justify how this tree provides good memory usage, good word lookup speed (especially in the case of natural language for which many words derive from each other), and is suitable for parallel processing.
Draw on the board
Demo
The C# program below goes through 2GB of text in 75secs on an 4 core xeon W3520, maxing out 8 threads. Performance is around 4.3 million words per second with less than optimal input parsing code. With the Trie structure to store words, memory is not an issue when processing natural language input.
Notes:
test text obtained from the Gutenberg project
input parsing code assumes line breaks and is pretty sub-optimal
removal of punctuation and other non-word is not done very well
handling one large file instead of several smaller one would require a small amount of code to start reading threads between specified offset within the file.
using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
TrieNode root = new TrieNode(null, '?');
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
if (args.Length > 0)
{
foreach (string path in args)
{
DataReader new_reader = new DataReader(path, ref root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
int distinct_word_count = 0;
int total_word_count = 0;
root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 1;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ref TrieNode root)
{
m_root = root;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
{
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private ConcurrentDictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new ConcurrentDictionary<char, TrieNode>();
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.TryAdd(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
lock (this)
{
m_word_count++;
}
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
if (m_parent == null) return "";
else return m_parent.ToString() + m_char;
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
Here the output from processing the same 20MB of text 100 times across 8 threads.
Counting words...
Input data processed in 75.2879952 secs
Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times
323618000 words counted
60896 distinct words found
A great deal here depends on some things that haven't been specified. For example, are we trying to do this once, or are we trying to build a system that will do this on a regular and ongoing basis? Do we have any control over the input? Are we dealing with text that's all in a single language (e.g., English) or are many languages represented (and if so, how many)?
These matter because:
If the data starts out on a single hard drive, parallel counting (e.g., map-reduce) isn't going to do any real good -- the bottleneck is going to be the transfer speed from the disk. Making copies to more disks so we can count faster will be slower than just counting directly from the one disk.
If we're designing a system to do this on a regular basis, most of our emphasis is really on the hardware -- specifically, have lots of disks in parallel to increase our bandwidth and at least get a little closer to keeping up with the CPU.
No matter how much text you're reading, there's a limit on the number of discrete words you need to deal with -- whether you have a terabyte of even a petabyte of English text, you're not going to see anything like billions of different words in English. Doing a quick check, the Oxford English Dictionary lists approximately 600,000 words in English.
Although the actual words are obviously different between languages, the number of words per language is roughly constant, so the size of the map we build will depend heavily on the number of languages represented.
That mostly leaves the question of how many languages could be represented. For the moment, let's assume the worst case. ISO 639-2 has codes for 485 human languages. Let's assume an average of 700,000 words per language, and an average word length of, say, 10 bytes of UTF-8 per word.
Just stored as simple linear list, that means we can store every word in every language on earth along with an 8-byte frequency count in a little less than 6 gigabytes. If we use something like a Patricia trie instead, we can probably plan on that shrinking at least somewhat -- quite possibly to 3 gigabytes or less, though I don't know enough about all those languages to be at all sure.
Now, the reality is that we've almost certainly overestimated the numbers in a number of places there -- quite a few languages share a fair number of words, many (especially older) languages probably have fewer words than English, and glancing through the list, it looks like some are included that probably don't have written forms at all.
Summary: Almost any reasonably new desktop/server has enough memory to hold the map entirely in RAM -- and more data won't change that. For one (or a few) disks in parallel, we're going to be I/O-bound anyway, so parallel counting (and such) will probably be a net loss. We probably need tens of disks in parallel before any other optimization means much.
You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.
First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:
map(document):
for each word w:
EmitIntermediate(w,"1")
reduce(word,list<val>):
Emit(word,size(list))
Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.
A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.
Three things of note for this.
Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.
Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.
If it's any easier (to keep your head from spinning).
"You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."
Same exact problem.
It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:
Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
Count-min sketch, to get an estimated count for any word
Those data structures require only very little constant space (exact amount depends on error you can tolerate).
See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.
I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.
EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?
editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.
A different solution could be using an SQL table, and let the system handle the data as good as it can. First create the table with the single field word, for each word in the collection.
Then use the query (sorry for syntax issue, my SQL is rusty - this is a pseudo-code actually):
SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC
The general idea is to first generate a table (which is stored on disk) with all words, and then use a query to count and sort (word,occurances) for you. You can then just take the top K from the retrieved list.
To all: If I indeed have any syntax or other issues in the SQL statement: feel free to edit
First, I only recently "discovered" the Trie data structure and zeFrenchy's answer was great for getting me up to speed on it.
I did see in the comments several people making suggestions on how to improve its performance, but these were only minor tweaks so I'd thought I'd share with you what I found to be the real bottle neck... the ConcurrentDictionary.
I'd wanted to play around with thread local storage and your sample gave me a great opportunity to do that and after some minor changes to use a Dictionary per thread and then combine the dictionaries after the Join() saw the performance improve ~30% (processing 20MB 100 times across 8 threads went from ~48 sec to ~33 sec on my box).
The code is pasted below and you'll notice not much changed from the approved answer.
P.S. I don't have more than 50 reputation points so I could not put this in a comment.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
List<ThreadLocal<TrieNode>> roots;
if (args.Length == 0)
{
roots = new List<ThreadLocal<TrieNode>>(1);
}
else
{
roots = new List<ThreadLocal<TrieNode>>(args.Length);
foreach (string path in args)
{
ThreadLocal<TrieNode> root = new ThreadLocal<TrieNode>(() =>
{
return new TrieNode(null, '?');
});
roots.Add(root);
DataReader new_reader = new DataReader(path, root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
{
roots[0].Value.CombineNode(root.Value);
root.Dispose();
}
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
int distinct_word_count = 0;
int total_word_count = 0;
roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
roots[0].Dispose();
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
Console.ReadLine();
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 100;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ThreadLocal<TrieNode> root)
{
m_root = root.Value;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private Dictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new Dictionary<char, TrieNode>();
}
public void CombineNode(TrieNode from)
{
foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
{
char keyChar = fromChild.Key;
if (!m_children.ContainsKey(keyChar))
{
m_children.Add(keyChar, new TrieNode(this, keyChar));
}
m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
m_children[keyChar].CombineNode(fromChild.Value);
}
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.Add(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
m_word_count++;
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
return BuildString(new StringBuilder()).ToString();
}
private StringBuilder BuildString(StringBuilder builder)
{
if (m_parent == null)
{
return builder;
}
else
{
return m_parent.BuildString(builder).Append(m_char);
}
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
As a quick general algorithm I would do this.
Create a map with entries being the count for a specific word and the key being the actual string.
for each string in content:
if string is a valid key for the map:
increment the value associated with that key
else
add a new key/value pair to the map with the key being the word and the count being one
done
Then you could just find the largest value in the map
create an array size 10 with data pairs of (word, count)
for each value in the map
if current pair has a count larger than the smallest count in the array
replace that pair with the current one
print all pairs in array
Well, personally, I'd split the file into different sizes of say 128mb, maintaining two in memory all the time while scannng, any discovered word is added to a Hash list, and List of List count, then I'd iterate the list of list at the end to find the top 10...
Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:
Get the 1st 10 found words where occurence >= 2
Get how many times these words occure in the entire string and delete them while counting
Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
Do the same for the rest of the string(which dosnt contain these
words anymore).
You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)
But in the worst case you may have more rounds than in a conventional algorithm.
By the way its a problem i would try to solve with a functional programing language rather than OOP.
The method below will only read your data once and can be tuned for memory sizes.
Read the file in chunks of say 1GB
For each chunk make a list of say the 5000 most occurring words with their frequency
Merge the lists based on frequency (1000 lists with 5000 words each)
Return the top 10 of the merged list
Theoretically you might miss words, althoug I think that chance is very very small.
Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well -
http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010
Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .
This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.
Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.
You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.
As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.
Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
here is implementation of tire in java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
MapReduce
WordCount can be acheived effciently through mapreduce using hadoop.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0
Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Categories

Resources