Streaming int as text into a file without creating string objects - c#

I'm streaming some text into a file. The following code is an MCVE to demonstrate the issue:
using (var fileStream = File.OpenWrite(#"d:\test.txt"))
{
using (var streamWriter = new StreamWriter(fileStream))
{
for (int i = 0; i < 20000000; i++)
{
streamWriter.Write("Just ");
streamWriter.Write(i);
streamWriter.Write(" example.\r\n");
}
}
}
As you can see, it does not only stream strings into a file but also ints.
Unfortunately, the Write(int) method seems to call ToString() on the numbers, so above code creates 20M string objects. This can be proven with a memory profiler like dotMemory. These strings will be garbage collected, but they cause a 577 MB throughput on the garbage collector.
Is there a way to directly stream the ints (in a human readable format, not binary) into the file without creating temporary strings?

You can do it one digit at a time using recursion:
void writeInt(StreamWriter w, int i)
{
if (i<0)
{
w.Write('-');
writeInt(w, -i);
}
else if (i>=10)
{
writeInt(w, i/10); // Write all but the last digit
w.Write((char)('0' + i%10)); // Write the last digit
}
else
{
w.Write((char)('0' + i)); // Write single digit number
}
}
Call it like this for a given integer:
int i = 42;
streamWriter.Write("Just ");
writeInt(streamWriter, i);
streamWriter.Write(" example.\r\n");

Related

What would be the best data structure for a Merge Search?

I want to do a merge sort in c#. I know the premise of the sort but I can't think how to store the values when the data is split in halves. Obviously the data will be split and again and again until the individual values are found and then they are stored. After that comes the comparison. I haven't started coding yet so I am open to suggestions. For my previous bubble and binary insertion sorts I used arrays. The data sets will be variable in size so I want something to accommodate this.
MergeString MergeSort = new MergeString();
int StringCount = MergeSort.GetNumberOfStrings();
string[] UnsortedStringList = MergeSort.GetListOfStrings(StringCount);
public class MergeInt
{
public int GetNumberOfNumbers()
{
bool isParsable = false;
int NumberCount = 0;
do
{
Console.WriteLine("How many numbers are in the unsorted list?");
string NumberOfNumbers = Console.ReadLine();
isParsable = int.TryParse(NumberOfNumbers, out NumberCount);
} while (isParsable == false);
return NumberCount;
}
public int[] GetListOfNumbers(int NumberCount)
{
bool isParsable = false;
int[] UnsortedNumberList = new int[NumberCount];
for (int i = 0; i < NumberCount; i++)
{
int Number = 0;
do
{
Console.WriteLine("What number is in the list?");
string NumberString = Console.ReadLine();
isParsable = int.TryParse(NumberString, out Number);
} while (isParsable == false);
UnsortedNumberList[i] = Number;
}
return UnsortedNumberList;
}
public void SortList(int[] a)
{
int split = a.Length / 2;
Console.WriteLine(split);
}
}
Merge sort requires an extra buffer to store temporary data. To keep it simple you should probably just use arrays for everything. I'm not sure I would call this a "data structure", since it is just a plain memory buffer.
For the actual merge code I would probably recommend using Span<T>, this lets you refer to a section of memory inside a buffer by Sliceing the data. An array should be convertible to a span, but you can also call AsSpan.
For this kind of problem I would probably go for recursive solution, and I would expect the major difficulty will be to avoid of by one errors, and ensuring you are swapping and merging buffers correctly.

How to read the from a text file then calculate an average

I plan on reading the marks from a text file and then calculating what the average mark is based upon data written in previous code. I haven't been able to read the marks though or calculate how many marks there are as BinaryReader doesn't let you use .Length.
I have tried using an array to hold each mark but it doesn't like each mark being an integer
public static int CalculateAverage()
{
int count = 0;
int total = 0;
float average;
BinaryReader markFile;
markFile = new BinaryReader(new FileStream("studentMarks.txt", FileMode.Open));
//A loop to read each line of the file and add it to the total
{
//total = total + eachMark;
//count++;
}
//average = total / count;
//markFile.Close();
//Console.WriteLine("Average mark:", average);
return 0;
}
This is my studentMark.txt file in VS
First of all, don't use BinerayRead you can use StreamReader for example.
Also with using statement is not necessary implement the close().
There is an answer using a while loop, so using Linq you can do in one line:
var avg = File.ReadAllLines("file.txt").ToArray().Average(a => Int32.Parse(a));
Console.WriteLine("avg = "+avg); //5
Also using File.ReadAllLines() according too docs the file is loaded into memory and then close, so there is no leak memory problem or whatever.
Opens a text file, reads all lines of the file into a string array, and then closes the file.
Edit to add the way to read using BinaryReader.
First thing to know is you are reading a txt file. Unless you have created the file using BinaryWriter, the binary reader will not work. And, if you are creating a binary file, there is not a good practice name as .txt.
So, assuming your file is binary, you need to loop and read every integer, so this code shoul work.
var fileName = "file.txt";
if (File.Exists(fileName))
{
using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
{
while (reader.BaseStream.Position < reader.BaseStream.Length)
{
total +=reader.ReadInt32();
count++;
}
}
average = total/count;
Console.WriteLine("Average = "+average); // 5
}
I've used using to ensure file is close at the end.
If your file only contains numbers, you only have to use ReadInt32() and it will work.
Also, if your file is not binary, obviously, binary writer will not work. By the way, my binary file.txt created using BinaryWriter looks like this:
So I'm assuming you dont have a binary file...

How to assert uniqueness of a huge collection of strings?

Let's say I have an algorithm which takes an unsigned 64-bit integer as input, and yields a string as a result. The string's alphabet is limited to [a-z, A-Z, 0-9] and its' maximum length is 16. So that's or 47,672,401,706,823,533,450,263,330,816 possible results.
I would like to assert the uniqueness of the algorithm's output. Read: I want to verify there are no collisions.
Is there an easy/quick 'n dirty way to do this, without having to fall back to (e.g.) some kind of database?
[EDIT]
Some clarification: the concerns uttered in the comments are legit, but no worries, I wasn't really planning on iterating over all possible combinations, my lifespan will probably be sub-1 century ;) Nor did I write my own algorithm to generate unique ID's. I just saw this and started wondering how one would go about asserting uniqueness for algorithms with very large result sets that can't be handled in-memory
[/EDIT]
As said in the comments, It would take a very long time to compute every possible entries, but just for fun, here is a try:
var workspace = new DirectoryInfo("MyWorkspace");
if (workspace.Exists)
{
workspace.Delete();
}
workspace.Create();
var limit = 23997907;
var buffer = new HashSet<string>();
ulong i = 0;
int j = 0;
var stopWatch = Stopwatch.StartNew();
while (i <= ulong.MaxValue)
{
var result = YourSuperAlgorythm(i);
// Check the result with current results
if (buffer.Contains(result))
{
throw new Exception("Failure !");
}
// Check the result with older results
foreach (var file in workspace.GetFiles())
{
var content = new HashSet<string>(File.ReadAllText(file.FullName).Split(';'));
if (content.Contains(result))
{
throw new Exception("Failure !");
}
}
buffer[j] = result;
i++;
j++;
if (j == arrayLimit)
{
stopWatch.Stop();
Console.WriteLine("Resetting. This loop takes " + stopWatch.Elapsed.TotalMilliseconds + "ms");
j = 0;
var file = Path.GetRandomFileName();
File.WriteAllText(Path.Combine(workspace.FullName, file), String.Join(";", buffer));
buffer = new HashSet<string>();
stopWatch.Restart();
}
}
You could probably optimize it but you won't have enought of a lifetime to check the results. For now, it did not even create a file to store the first set of entries :D. I will edit this post when one loop will be done!
Your only option is to prove mathematically your algorithm. Good luck with that...
EDIT1: for my test, I use this function:
private static string YourSuperAlgorythm(ulong i)
{
return i.ToString("x");
}
EDIT2: One loop takes 1477221.4261ms (~25min). And then the String.Join(";", buffer) line failed (OutOfMemory). So 23997907 is not the max value for my try. It must be decreased!

Need a fast method of deserializing 1 million Strings & Guids in c#

I want to deserialize a list of 1 million pairs of (String,Guid) for a performance critical app. The format can be anything I choose, and serialization does not have the same performance requirements.
What sort of approach is best? Text or binary? Write each pair (string,guid) consecutively, or write all strings followed by all guids?
I started playing with LinqPad, (and the simpler example of deserializing strings only) and found that (slightly counter-intuitively), using a TextReader and ReadLine() was a fair bit faster than using a BinaryReader and ReadString(). (Is the filesystem cache playing tricks on me?)
public string[] DeSerializeBinary()
{
var tmr = System.Diagnostics.Stopwatch.StartNew();
long ms = 0;
string[] arr = null;
using (var rdr = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read)))
{
var num = rdr.ReadInt32();
arr = new String[num];
for (int i = 0; i < num; i++)
{
arr[i] = rdr.ReadString();
}
tmr.Stop();
ms = tmr.ElapsedMilliseconds;
Console.WriteLine("DeSerializeBinary took {0}ms", ms);
}
return arr;
}
public string[] DeserializeText()
{
var tmr = System.Diagnostics.Stopwatch.StartNew();
long ms = 0;
string[] arr = null;
using (var rdr = File.OpenText(file))
{
var num = Int32.Parse(rdr.ReadLine());
arr = new String[num];
for (int i = 0; i < num; i++)
{
arr[i] = rdr.ReadLine();
}
tmr.Stop();
ms = tmr.ElapsedMilliseconds;
Console.WriteLine("DeserializeText took {0}ms", ms);
}
return arr;
}
Some Edits:
I used RamMap to clear the file system cache, and it turns out there was very little difference to Text & Binary reader for strings only.
I have a fairly simple class that holds the string and guid. It also holds an int index which corresponds to its position in the list. Obviously there's no need to include this in serialization.
In a test for (binary) deSerializing Strings and Guids alternately, I get around 500ms.
Ideal timing is 50ms, or as close as I can get. However, a simple experiment showed it takes at least 120ms to read the (compressed) file into memory from a reasonably fast SSD drive, without any sort of parsing at all. So 50ms seems unlikely.
Our strings have no theoretical length restrictions. However, we can assume that the performance target only applies if they are all 20 characters or less.
Timings include opening the file.
Reading the Strings is the clear bottleneck now (hence my experiments with serializing strings only). The JIT_NewFast took 30% before I preallocated an array of 16bytes for reading GUIDs.
It's not surprising that reading a bunch of strings is faster with StreamReader than with BinaryReader. StreamReader reads in blocks from the underlying stream, and parses the strings from that buffer. BinaryReader doesn't have a buffer like that. It reads the string length from the underlying stream, and then reads that many characters. So BinaryReader makes more calls to the base stream's Read method.
But there's more to deserializing a (String, Guid) pair than just reading. You also have to parse the Guid. If you write the file in binary then the Guid is written in binary, which makes it much easier and faster to create a Guid structure. If it's a string, then you have to call new Guid(string) to parse the text and create a Guid, after you split the line into its two fields.
Hard to say which of those will be faster.
I can't imagine that we're talking about a whole lot of time here. Certainly reading a file with a million lines will take around a second. Unless the string is really long. A GUID is only 36 characters if you count the separators, right?
With BinaryWriter, you can write the file like this:
writer.Write(count); // integer number of records
foreach (var pair in pairs)
{
writer.Write(pair.theString);
writer.Write(pair.theGuid.ToByteArray());
}
And to read it, you have:
count = reader.ReadInt32();
byte[] guidBytes = new byte[16];
for (int i = 0; i < count; ++i)
{
string s = reader.ReadString();
reader.Read(guidBytes, 0, guidBytes.Length);
pairs.Add(new Pair(s, new Guid(guidBytes));
}
Whether that's faster than splitting a string and calling the Guid constructor that takes a string parameter, I don't know.
I suspect that any difference is going to be pretty slight. I'd probably go with the simplest method: a text file.
If you want to get really crazy, you can write a custom format that you can easily slurp up in just a couple of large reads (a header, an index, and two arrays for strings and GUIDs), and do everything else in memory. That would almost certainly be faster. But faster enough to warrant the extra work? Doubtful.
Update
Or maybe not doubtful. Here's some code that writes and reads a custom binary format. The format is:
count (int32)
guids (count * 16 bytes)
strings (one big concatenated string)
index (index of each string's starting character in the big string)
I assume you're using a Dictionary<string, Guid> to hold these things. But your data structure doesn't really matter. The code would be substantially the same.
Note that I tested this very briefly. I won't say that the code is 100% bug free, but I think you can get the idea of what I'm doing.
private void WriteGuidFile(string filename, Dictionary<string, Guid>guids)
{
using (var fs = File.Create(filename))
{
using (var writer = new BinaryWriter(fs, Encoding.UTF8))
{
List<int> stringIndex = new List<int>(guids.Count);
StringBuilder bigString = new StringBuilder();
// write count
writer.Write(guids.Count);
// Write the GUIDs and build the string index
foreach (var pair in guids)
{
writer.Write(pair.Value.ToByteArray(), 0, 16);
stringIndex.Add(bigString.Length);
bigString.Append(pair.Key);
}
// Add one more entry to the string index.
// makes deserializing easier
stringIndex.Add(bigString.Length);
// Write the string that contains all of the strings, combined
writer.Write(bigString.ToString());
// write the index
foreach (var ix in stringIndex)
{
writer.Write(ix);
}
}
}
}
Reading is just slightly more involved:
private Dictionary<string, Guid> ReadGuidFile(string filename)
{
using (var fs = File.OpenRead(filename))
{
using (var reader = new BinaryReader(fs, Encoding.UTF8))
{
// read the count
int count = reader.ReadInt32();
// The guids are in a huge byte array sized 16*count
byte[] guidsBuffer = new byte[16*count];
reader.Read(guidsBuffer, 0, guidsBuffer.Length);
// Strings are all concatenated into one
var bigString = reader.ReadString();
// Index is an array of int. We can read it as an array of
// ((count+1) * 4) bytes.
byte[] indexBuffer = new byte[4*(count+1)];
reader.Read(indexBuffer, 0, indexBuffer.Length);
var guids = new Dictionary<string, Guid>(count);
byte[] guidBytes = new byte[16];
int startix = 0;
int endix = 0;
for (int i = 0; i < count; ++i)
{
endix = BitConverter.ToInt32(indexBuffer, 4*(i+1));
string key = bigString.Substring(startix, endix - startix);
Buffer.BlockCopy(guidsBuffer, (i*16),
guidBytes, 0, 16);
guids.Add(key, new Guid(guidBytes));
startix = endix;
}
return guids;
}
}
}
A couple of notes here. First, I'm using BitConverter to convert the data in the byte arrays to integers. It would be faster to use unsafe code and just index into the arrays using an int32*.
You might gain some speed by using pointers to index into the guidBuffer and calling Guid Constructor (Int32, Int16, Int16, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte) rather than using Buffer.BlockCopy to copy the GUID into the temporary array.
You could make the string index an index of lengths rather than the starting positions. That would eliminate the need for the extra value at the end of the array, but it's unlikely that it'd make any difference in the speed.
There might be other optimization opportunities, but I think you get the general idea here.

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

Recently I came across an interview question to create a algorithm in any language which should do the following
Read 1 terabyte of content
Make a count for each reoccuring word in that content
List the top 10 most frequently occurring words
Could you let me know the best possible way to create an algorithm for this?
Edit:
OK, let's say the content is in English. How we can find the top 10 words that occur most frequently in that content? My other doubt is, if purposely they are giving unique data then our buffer will expire with heap size overflow. We need to handle that as well.
Interview Answer
This task is interesting without being too complex, so a great way to start a good technical discussion. My plan to tackle this task would be:
Split input data in words, using white space and punctuation as delimiters
Feed every word found into a Trie structure, with counter updated in nodes representing a word's last letter
Traverse the fully populated tree to find nodes with highest counts
In the context of an interview ... I would demonstrate the idea of Trie by drawing the tree on a board or paper. Start from empty, then build the tree based on a single sentence containing at least one recurring word. Say "the cat can catch the mouse". Finally show how the tree can then be traversed to find highest counts. I would then justify how this tree provides good memory usage, good word lookup speed (especially in the case of natural language for which many words derive from each other), and is suitable for parallel processing.
Draw on the board
Demo
The C# program below goes through 2GB of text in 75secs on an 4 core xeon W3520, maxing out 8 threads. Performance is around 4.3 million words per second with less than optimal input parsing code. With the Trie structure to store words, memory is not an issue when processing natural language input.
Notes:
test text obtained from the Gutenberg project
input parsing code assumes line breaks and is pretty sub-optimal
removal of punctuation and other non-word is not done very well
handling one large file instead of several smaller one would require a small amount of code to start reading threads between specified offset within the file.
using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
TrieNode root = new TrieNode(null, '?');
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
if (args.Length > 0)
{
foreach (string path in args)
{
DataReader new_reader = new DataReader(path, ref root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
int distinct_word_count = 0;
int total_word_count = 0;
root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 1;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ref TrieNode root)
{
m_root = root;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
{
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private ConcurrentDictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new ConcurrentDictionary<char, TrieNode>();
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.TryAdd(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
lock (this)
{
m_word_count++;
}
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
if (m_parent == null) return "";
else return m_parent.ToString() + m_char;
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
Here the output from processing the same 20MB of text 100 times across 8 threads.
Counting words...
Input data processed in 75.2879952 secs
Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times
323618000 words counted
60896 distinct words found
A great deal here depends on some things that haven't been specified. For example, are we trying to do this once, or are we trying to build a system that will do this on a regular and ongoing basis? Do we have any control over the input? Are we dealing with text that's all in a single language (e.g., English) or are many languages represented (and if so, how many)?
These matter because:
If the data starts out on a single hard drive, parallel counting (e.g., map-reduce) isn't going to do any real good -- the bottleneck is going to be the transfer speed from the disk. Making copies to more disks so we can count faster will be slower than just counting directly from the one disk.
If we're designing a system to do this on a regular basis, most of our emphasis is really on the hardware -- specifically, have lots of disks in parallel to increase our bandwidth and at least get a little closer to keeping up with the CPU.
No matter how much text you're reading, there's a limit on the number of discrete words you need to deal with -- whether you have a terabyte of even a petabyte of English text, you're not going to see anything like billions of different words in English. Doing a quick check, the Oxford English Dictionary lists approximately 600,000 words in English.
Although the actual words are obviously different between languages, the number of words per language is roughly constant, so the size of the map we build will depend heavily on the number of languages represented.
That mostly leaves the question of how many languages could be represented. For the moment, let's assume the worst case. ISO 639-2 has codes for 485 human languages. Let's assume an average of 700,000 words per language, and an average word length of, say, 10 bytes of UTF-8 per word.
Just stored as simple linear list, that means we can store every word in every language on earth along with an 8-byte frequency count in a little less than 6 gigabytes. If we use something like a Patricia trie instead, we can probably plan on that shrinking at least somewhat -- quite possibly to 3 gigabytes or less, though I don't know enough about all those languages to be at all sure.
Now, the reality is that we've almost certainly overestimated the numbers in a number of places there -- quite a few languages share a fair number of words, many (especially older) languages probably have fewer words than English, and glancing through the list, it looks like some are included that probably don't have written forms at all.
Summary: Almost any reasonably new desktop/server has enough memory to hold the map entirely in RAM -- and more data won't change that. For one (or a few) disks in parallel, we're going to be I/O-bound anyway, so parallel counting (and such) will probably be a net loss. We probably need tens of disks in parallel before any other optimization means much.
You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.
First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:
map(document):
for each word w:
EmitIntermediate(w,"1")
reduce(word,list<val>):
Emit(word,size(list))
Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.
A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.
Three things of note for this.
Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.
Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.
If it's any easier (to keep your head from spinning).
"You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."
Same exact problem.
It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:
Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
Count-min sketch, to get an estimated count for any word
Those data structures require only very little constant space (exact amount depends on error you can tolerate).
See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.
I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.
EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?
editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.
A different solution could be using an SQL table, and let the system handle the data as good as it can. First create the table with the single field word, for each word in the collection.
Then use the query (sorry for syntax issue, my SQL is rusty - this is a pseudo-code actually):
SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC
The general idea is to first generate a table (which is stored on disk) with all words, and then use a query to count and sort (word,occurances) for you. You can then just take the top K from the retrieved list.
To all: If I indeed have any syntax or other issues in the SQL statement: feel free to edit
First, I only recently "discovered" the Trie data structure and zeFrenchy's answer was great for getting me up to speed on it.
I did see in the comments several people making suggestions on how to improve its performance, but these were only minor tweaks so I'd thought I'd share with you what I found to be the real bottle neck... the ConcurrentDictionary.
I'd wanted to play around with thread local storage and your sample gave me a great opportunity to do that and after some minor changes to use a Dictionary per thread and then combine the dictionaries after the Join() saw the performance improve ~30% (processing 20MB 100 times across 8 threads went from ~48 sec to ~33 sec on my box).
The code is pasted below and you'll notice not much changed from the approved answer.
P.S. I don't have more than 50 reputation points so I could not put this in a comment.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
List<ThreadLocal<TrieNode>> roots;
if (args.Length == 0)
{
roots = new List<ThreadLocal<TrieNode>>(1);
}
else
{
roots = new List<ThreadLocal<TrieNode>>(args.Length);
foreach (string path in args)
{
ThreadLocal<TrieNode> root = new ThreadLocal<TrieNode>(() =>
{
return new TrieNode(null, '?');
});
roots.Add(root);
DataReader new_reader = new DataReader(path, root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
{
roots[0].Value.CombineNode(root.Value);
root.Dispose();
}
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
int distinct_word_count = 0;
int total_word_count = 0;
roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
roots[0].Dispose();
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
Console.ReadLine();
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 100;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ThreadLocal<TrieNode> root)
{
m_root = root.Value;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private Dictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new Dictionary<char, TrieNode>();
}
public void CombineNode(TrieNode from)
{
foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
{
char keyChar = fromChild.Key;
if (!m_children.ContainsKey(keyChar))
{
m_children.Add(keyChar, new TrieNode(this, keyChar));
}
m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
m_children[keyChar].CombineNode(fromChild.Value);
}
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.Add(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
m_word_count++;
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
return BuildString(new StringBuilder()).ToString();
}
private StringBuilder BuildString(StringBuilder builder)
{
if (m_parent == null)
{
return builder;
}
else
{
return m_parent.BuildString(builder).Append(m_char);
}
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
As a quick general algorithm I would do this.
Create a map with entries being the count for a specific word and the key being the actual string.
for each string in content:
if string is a valid key for the map:
increment the value associated with that key
else
add a new key/value pair to the map with the key being the word and the count being one
done
Then you could just find the largest value in the map
create an array size 10 with data pairs of (word, count)
for each value in the map
if current pair has a count larger than the smallest count in the array
replace that pair with the current one
print all pairs in array
Well, personally, I'd split the file into different sizes of say 128mb, maintaining two in memory all the time while scannng, any discovered word is added to a Hash list, and List of List count, then I'd iterate the list of list at the end to find the top 10...
Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:
Get the 1st 10 found words where occurence >= 2
Get how many times these words occure in the entire string and delete them while counting
Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
Do the same for the rest of the string(which dosnt contain these
words anymore).
You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)
But in the worst case you may have more rounds than in a conventional algorithm.
By the way its a problem i would try to solve with a functional programing language rather than OOP.
The method below will only read your data once and can be tuned for memory sizes.
Read the file in chunks of say 1GB
For each chunk make a list of say the 5000 most occurring words with their frequency
Merge the lists based on frequency (1000 lists with 5000 words each)
Return the top 10 of the merged list
Theoretically you might miss words, althoug I think that chance is very very small.
Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well -
http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010
Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .
This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.
Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.
You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.
As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.
Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
here is implementation of tire in java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
MapReduce
WordCount can be acheived effciently through mapreduce using hadoop.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0
Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Categories

Resources