Huge files(~1Gb) into array to manipulate - c#

In this weekend, I have wanted to enhance my Apple's right click dictionaries. It has to be in some determined format(has nothing to do with this part of the question). For example, Oxford dictionary shows, when clicked, -ed, -ing, -s forms as bare verb form, but not same for Collin's. To integrate it, I need to add -ed, -ing, -s forms of verb words.
Example word brace in Oxford dict is defined like
<d:entry id="_4ma" d:title="brace">
<d:index d:value="brace" d:title="brace"/><d:index d:value="braced" d:title="brace"/><d:index d:value="braces" d:title="brace"/><d:index d:value="bracing" d:title="brace"/><a name="49f1b05b3013431b8dbb9ff989c2755a_brace" nattr="head-jump-entry-flag"></a>
As you can see, there are <d:index d:value=" multiple entries.
In Collin's,
<d:entry id="_379" d:title="brace">
<d:index d:value="brace" d:title="brace"/><div id="collins_english_dictionary">
As you can see again, there are no the following entries unlike that of Oxford,
<d:index d:value="braced" d:title="brace"/><d:index d:value="braces" d:title="brace"/><d:index d:value="bracing" d:title="brace"/>
What the following code snippet does is to find missing entries of every word which are in the destination dictionary(Collin's) and make the following format ultimately. (Of no importance of being duplicated.)
<d:entry id="_379" d:title="brace">
<d:index d:value="braced" d:title="brace"/><d:index d:value="braces" d:title="brace"/><d:index d:value="bracing" d:title="brace"/>
<d:index d:value="brace" d:title="brace"/><div id="collins_english_dictionary">
The problem is that the files are too big one of them is of almost 500MB size while the other is of 1Gb size(almost 4 million lines Oxford, 120 thousand lines in Collins due to being collapsed tags). The IDE or the compiler leaves in the lurch before completion successfully. How can we succeed?
using System.Text.RegularExpressions;
var source = "/Users/soner/Desktop/Oxford Advanced Learner Dictionary/Oxford Advanced Learner's Dictionary.xml";
var lines = File.ReadLines(source);
var destination = "/Users/soner/Desktop/Collins COBUILD Advanced English Dictionary/Collins COBUILD Advanced English Dictionary.xml";
var destinationLines = File.ReadLines(destination).ToList();
var sourceLines = lines.ToList();
int destIndex = 0;
for (int i = 0; i < sourceLines.Count - 1; i++)
{
if (sourceLines[i].Contains("<d:entry id="))
{
++i; // we are almost sure searched part is immediately in the next line, if not q==-1 cotinue;
var aTagLine = sourceLines[i];
var q = aTagLine.IndexOf("<a name", StringComparison.Ordinal);
if (q == -1) continue;
var wantedPart = aTagLine[..q];
// if there is no form of the word being searched -ing -s or -ed tags
if(wantedPart.Split("d:index").Length <= 2) continue;
// the word to be searched for
var word = Regex.Matches(wantedPart, "(?<=d:title=\").*?(?=\")")[0].ToString();
int memoriedIndex = destIndex;
// from now on we are in destination dictionary to add missing entities -ed,-ing etc. forms
for (; destIndex < destinationLines.Count; destIndex++)
{
if (destinationLines[destIndex].Contains("<d:entry id=") && destinationLines[destIndex].Contains($"{word}\">"))
{
destinationLines[destIndex] = $"{destinationLines[destIndex]}\n{wantedPart}";
break;
}
// I need this one because
// what if destination dictionary doesn't have the word searched in source dictionary
// it needs go some steps backward to go on
if (destIndex - memoriedIndex > 100)
{
destIndex = memoriedIndex;
break;
}
}
}
}
System.IO.File.WriteAllLines("/Users/soner/Desktop/soner.xml", destinationLines);

First, you should definitely work with filestreams/stringstreams and ensure that async file IO is enabled. Do not read the whole file at once. Additionally, you can do all text-processing and datastructure building on separate threads.
You should definitely transform the raw xml data into a more usable data structure (in memory, or possibly something else) optimized for your use case, before working with it, as was already mentioned. I'm not gonna go into detail which one, I didn't look at your use case too closely.
You would have to experiment a little which one's faster, processing using XmlReader or regex. Usually, I would always recommend XmlReader, except in very simple and performance critical situations, when something else might be more efficient. Make sure to instantiate the regex only once (don't use the statig methods) and set the Compiled option.
These are just some thoughts to get you started, not a complete answer.

Related

Improve performance of TryGetValue

I am creating an Excel file using Open XML SDK. In this process, I have a scenario like below.
I need to add data into a Dictionary<uint, string> if key is not exists. For that I am using below code.
var dataLines = sheetData.Elements<Row>().ToList();
for (int i = 0; i < dataLines.Count; i++)
{
var x = dataLines[i];
if (!dataDictionary.TryGetValue(x.RowIndex.Value, out var res)) // 700 seconds, 1,279,999,998 Hit counts
{
dataDictionary.Add(x.RowIndex.Value, x.OuterXml);
}
}
When I am trying to create an Excel sheet which has rows around 90,000 - 92,000, the line with the IF condition in above code consume 700 seconds to complete. (checked with a performance profiler, also this line has 1,279,999,998 Hit counts).
How could I reduce the time the line with the IF condition in above code consumes?
Is there any better way to achive this with less time?
If the if statement is slow, one option you have is to eliminate it entirely and use the indexer of the dictionary to set the value. This means that the "last match will win". If you want the "first match to win", all you have to do is reverse the order you are iterating the list.
var dataLines = sheetData.Elements<Row>().ToList();
for (int i = dataLines.Count - 1; i >= 0; i--)
{
var x = dataLines[i];
dataDictionary[x.RowIndex.Value] = x.OuterXml;
}
If x.RowIndex.Value is unique, it doesn't matter which direction you iterate.
If it is important that the key is sorted in ascending order, you can use a SortedDictionary<TKey, TValue>.
But as others have pointed out, it seems odd that you have so many hit counts. There is probably recursion going on in your application that you need to track down.

Retrieve data from array and display in textbox, 5 strings at a time

I want to display specific arrays from a very large text file.
Below the coding is part of the file.
What I want to do is display specific strings from the text file.
For example the example shows the Footlocker page. On the Footlocker Shop page I want to retrieve the last 5 updates in the text file beginning with "footlocker" posting only Footlocker's most recent posts. I have tried many ways including array.sort I am not sure how you would do this. Thanks for your help.
Footlocker's page
//declaring string
string footlockerPosts =
sr.ReadToEnd();
//initialising string
string[] footlockerArray = footlockerPosts.Split('\n');
string[] sort = footlockerArray;
var target = "F";
var results = Array.FindAll(sort, f => f.Equals(target));
for (int i = footlockerArray.Length - 1; i > footlockerArray.Length - 7; i--)
{
footlockerArray.Reverse();
footlockerExistingBlogTextBox.Text += footlockerArray[i];
}
sr.Close();
return;
}
This is a small snippet of my file.
File
Footlocker,Rick,What a fabulous shop.
Footlocker,Ioela,Fantastic and incredible service.
Footlocker,Fisi,Can't wait to go back to shop!
Footlocker,Allui,Lovin' the new design and layout!
Footlocker,Rich,Can't wait for next season clothing range.
Hypebeast,Johnny,I didn’t get proper service from the shop assistant.
Hypebeast,Dalas,Awesome range of goods, great service.
Hypebeast,King,Cool music great staff.
Hypebeast,Nelson,Overated shop.
Hypebeast,Rick,Lovely place lovely people.
Hypebeast,Rick,What a fabulous shop.
Hypebeast,Ioela,Fantastic and incredible service.
Hypebeast,Fisi,Can't wait to go back to shop!
Hypebeast,Allui,Lovin' the new design and layout!
Hypebeast,Rich,Can't wait for next season clothing range.
Lonestar,Johnny,I didn’t get proper service from the waiter.
Lonestar,Dalas,Awesome range of food, great service.
Lonestar,King,Cool music great staff.
Lonestar,Nelson,Overated restaurant.
Lonestar,Rick,Lovely place lovely people.
Try the following and let me know if it gets you the results you want
var results = sort.Where(r => r.StartsWith(target)).Reverse();
foreach (string result in results)
{
footlockerExistingBlogTextBox.Text += result;
}
Hope that helps!
Also just a note: There is not reason to assign another variable equal to the original array. You can remove this line string[] sort = footlockerArray; and point to the original footlockerArray.

regex performance degrades

I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.
For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?
It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.
Some of my regexes need to vary e.g. depending on the user's name it might be
mike said (\w*) or john said (\w*)
My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).
Does anyone have any suggestions?
[Edited to accurately reflect speed - was per 1000 strings, not per string]
This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:
Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:
Player_Name | Monetary_Value
If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).
Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.
In other words - consider the following:
Create a database table as follows:
CREATE TABLE PlayerActions (
RowID INT PRIMARY KEY IDENTITY,
Player_Name VARCHAR(50) NOT NULL,
Monetary_Value MONEY NOT NULL
)
Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.
public void ParseFullDataSet(IEnumerable<String> dataSource) {
var rowCount = dataSource.Count();
var setCount = Math.Floor(rowCount / 100000) + 1;
if (rowCount % 100000 != 0)
setCount++;
for (int i = 0; i < setCount; i++) {
var set = dataSource.Skip(i * 100000).Take(100000);
ParseSet(set);
}
}
public void ParseSet(IEnumerable<String> dataSource) {
String playerName = String.Empty;
decimal monetaryValue = 0.0m;
// Assume here that the method reflects your RegEx generator.
String regex = RegexFactory.Generate();
for (String data in dataSource) {
Match match = Regex.Match(data, regex);
if (match.Success) {
playerName = match.Groups[1].Value;
// Might want to add error handling here.
monetaryValue = Convert.ToDecimal(match.Groups[2].Value);
db.PlayerActions.Add(new PlayerAction() {
// ID = ..., // Set at DB layer using Auto_Increment
Player_Name = playerName,
Monetary_Value = monetaryValue
});
db.SaveChanges();
// If not using Entity Framework, use another method to insert
// a row to your database table.
}
}
}
Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:
ParseSet(new List<String>() { newValue });
or if multiples are created at once, call:
ParseSet(newValues); // Where newValues is an IEnumerable<String>
Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.
Regex does takes time to compute. However, U can make it compact using some tricks.
You can also use string functions in C# to avoid regex function.
The code would be lengthy but might improve performance.
String has several functions to cut and extract characters and do pattern matching as u need.
like eg: IndeOfAny, LastIndexOf, Contains....
string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};
if(str2.IndexOfAny(str) >= 0)
{
//success code//
}

Parsing one terabyte of text and efficiently counting the number of occurrences of each word

Recently I came across an interview question to create a algorithm in any language which should do the following
Read 1 terabyte of content
Make a count for each reoccuring word in that content
List the top 10 most frequently occurring words
Could you let me know the best possible way to create an algorithm for this?
Edit:
OK, let's say the content is in English. How we can find the top 10 words that occur most frequently in that content? My other doubt is, if purposely they are giving unique data then our buffer will expire with heap size overflow. We need to handle that as well.
Interview Answer
This task is interesting without being too complex, so a great way to start a good technical discussion. My plan to tackle this task would be:
Split input data in words, using white space and punctuation as delimiters
Feed every word found into a Trie structure, with counter updated in nodes representing a word's last letter
Traverse the fully populated tree to find nodes with highest counts
In the context of an interview ... I would demonstrate the idea of Trie by drawing the tree on a board or paper. Start from empty, then build the tree based on a single sentence containing at least one recurring word. Say "the cat can catch the mouse". Finally show how the tree can then be traversed to find highest counts. I would then justify how this tree provides good memory usage, good word lookup speed (especially in the case of natural language for which many words derive from each other), and is suitable for parallel processing.
Draw on the board
Demo
The C# program below goes through 2GB of text in 75secs on an 4 core xeon W3520, maxing out 8 threads. Performance is around 4.3 million words per second with less than optimal input parsing code. With the Trie structure to store words, memory is not an issue when processing natural language input.
Notes:
test text obtained from the Gutenberg project
input parsing code assumes line breaks and is pretty sub-optimal
removal of punctuation and other non-word is not done very well
handling one large file instead of several smaller one would require a small amount of code to start reading threads between specified offset within the file.
using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
TrieNode root = new TrieNode(null, '?');
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
if (args.Length > 0)
{
foreach (string path in args)
{
DataReader new_reader = new DataReader(path, ref root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
int distinct_word_count = 0;
int total_word_count = 0;
root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 1;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ref TrieNode root)
{
m_root = root;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
{
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private ConcurrentDictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new ConcurrentDictionary<char, TrieNode>();
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.TryAdd(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
lock (this)
{
m_word_count++;
}
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
if (m_parent == null) return "";
else return m_parent.ToString() + m_char;
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
Here the output from processing the same 20MB of text 100 times across 8 threads.
Counting words...
Input data processed in 75.2879952 secs
Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times
323618000 words counted
60896 distinct words found
A great deal here depends on some things that haven't been specified. For example, are we trying to do this once, or are we trying to build a system that will do this on a regular and ongoing basis? Do we have any control over the input? Are we dealing with text that's all in a single language (e.g., English) or are many languages represented (and if so, how many)?
These matter because:
If the data starts out on a single hard drive, parallel counting (e.g., map-reduce) isn't going to do any real good -- the bottleneck is going to be the transfer speed from the disk. Making copies to more disks so we can count faster will be slower than just counting directly from the one disk.
If we're designing a system to do this on a regular basis, most of our emphasis is really on the hardware -- specifically, have lots of disks in parallel to increase our bandwidth and at least get a little closer to keeping up with the CPU.
No matter how much text you're reading, there's a limit on the number of discrete words you need to deal with -- whether you have a terabyte of even a petabyte of English text, you're not going to see anything like billions of different words in English. Doing a quick check, the Oxford English Dictionary lists approximately 600,000 words in English.
Although the actual words are obviously different between languages, the number of words per language is roughly constant, so the size of the map we build will depend heavily on the number of languages represented.
That mostly leaves the question of how many languages could be represented. For the moment, let's assume the worst case. ISO 639-2 has codes for 485 human languages. Let's assume an average of 700,000 words per language, and an average word length of, say, 10 bytes of UTF-8 per word.
Just stored as simple linear list, that means we can store every word in every language on earth along with an 8-byte frequency count in a little less than 6 gigabytes. If we use something like a Patricia trie instead, we can probably plan on that shrinking at least somewhat -- quite possibly to 3 gigabytes or less, though I don't know enough about all those languages to be at all sure.
Now, the reality is that we've almost certainly overestimated the numbers in a number of places there -- quite a few languages share a fair number of words, many (especially older) languages probably have fewer words than English, and glancing through the list, it looks like some are included that probably don't have written forms at all.
Summary: Almost any reasonably new desktop/server has enough memory to hold the map entirely in RAM -- and more data won't change that. For one (or a few) disks in parallel, we're going to be I/O-bound anyway, so parallel counting (and such) will probably be a net loss. We probably need tens of disks in parallel before any other optimization means much.
You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.
First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:
map(document):
for each word w:
EmitIntermediate(w,"1")
reduce(word,list<val>):
Emit(word,size(list))
Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.
A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.
Three things of note for this.
Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.
Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.
If it's any easier (to keep your head from spinning).
"You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."
Same exact problem.
It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:
Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
Count-min sketch, to get an estimated count for any word
Those data structures require only very little constant space (exact amount depends on error you can tolerate).
See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.
I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.
EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?
editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.
A different solution could be using an SQL table, and let the system handle the data as good as it can. First create the table with the single field word, for each word in the collection.
Then use the query (sorry for syntax issue, my SQL is rusty - this is a pseudo-code actually):
SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC
The general idea is to first generate a table (which is stored on disk) with all words, and then use a query to count and sort (word,occurances) for you. You can then just take the top K from the retrieved list.
To all: If I indeed have any syntax or other issues in the SQL statement: feel free to edit
First, I only recently "discovered" the Trie data structure and zeFrenchy's answer was great for getting me up to speed on it.
I did see in the comments several people making suggestions on how to improve its performance, but these were only minor tweaks so I'd thought I'd share with you what I found to be the real bottle neck... the ConcurrentDictionary.
I'd wanted to play around with thread local storage and your sample gave me a great opportunity to do that and after some minor changes to use a Dictionary per thread and then combine the dictionaries after the Join() saw the performance improve ~30% (processing 20MB 100 times across 8 threads went from ~48 sec to ~33 sec on my box).
The code is pasted below and you'll notice not much changed from the approved answer.
P.S. I don't have more than 50 reputation points so I could not put this in a comment.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
List<ThreadLocal<TrieNode>> roots;
if (args.Length == 0)
{
roots = new List<ThreadLocal<TrieNode>>(1);
}
else
{
roots = new List<ThreadLocal<TrieNode>>(args.Length);
foreach (string path in args)
{
ThreadLocal<TrieNode> root = new ThreadLocal<TrieNode>(() =>
{
return new TrieNode(null, '?');
});
roots.Add(root);
DataReader new_reader = new DataReader(path, root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
{
roots[0].Value.CombineNode(root.Value);
root.Dispose();
}
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
int distinct_word_count = 0;
int total_word_count = 0;
roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
roots[0].Dispose();
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
Console.ReadLine();
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 100;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ThreadLocal<TrieNode> root)
{
m_root = root.Value;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private Dictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new Dictionary<char, TrieNode>();
}
public void CombineNode(TrieNode from)
{
foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
{
char keyChar = fromChild.Key;
if (!m_children.ContainsKey(keyChar))
{
m_children.Add(keyChar, new TrieNode(this, keyChar));
}
m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
m_children[keyChar].CombineNode(fromChild.Value);
}
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.Add(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
m_word_count++;
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
return BuildString(new StringBuilder()).ToString();
}
private StringBuilder BuildString(StringBuilder builder)
{
if (m_parent == null)
{
return builder;
}
else
{
return m_parent.BuildString(builder).Append(m_char);
}
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
As a quick general algorithm I would do this.
Create a map with entries being the count for a specific word and the key being the actual string.
for each string in content:
if string is a valid key for the map:
increment the value associated with that key
else
add a new key/value pair to the map with the key being the word and the count being one
done
Then you could just find the largest value in the map
create an array size 10 with data pairs of (word, count)
for each value in the map
if current pair has a count larger than the smallest count in the array
replace that pair with the current one
print all pairs in array
Well, personally, I'd split the file into different sizes of say 128mb, maintaining two in memory all the time while scannng, any discovered word is added to a Hash list, and List of List count, then I'd iterate the list of list at the end to find the top 10...
Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:
Get the 1st 10 found words where occurence >= 2
Get how many times these words occure in the entire string and delete them while counting
Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
Do the same for the rest of the string(which dosnt contain these
words anymore).
You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)
But in the worst case you may have more rounds than in a conventional algorithm.
By the way its a problem i would try to solve with a functional programing language rather than OOP.
The method below will only read your data once and can be tuned for memory sizes.
Read the file in chunks of say 1GB
For each chunk make a list of say the 5000 most occurring words with their frequency
Merge the lists based on frequency (1000 lists with 5000 words each)
Return the top 10 of the merged list
Theoretically you might miss words, althoug I think that chance is very very small.
Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well -
http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010
Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .
This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.
Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.
You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.
As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.
Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
here is implementation of tire in java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
MapReduce
WordCount can be acheived effciently through mapreduce using hadoop.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0
Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

find word and score based on positions

hey guys i have a textfile i have divided it into 4 parts. i want to search each part for the words that appear in each part and score that word
exmaple
welcome to the national basketball finals,the basketball teams here today have come a long way. without much delay lets play basketball.
i will want to return national = 1 as it appears only in one part etc
am working on determining text context using word position.
am working with c# and not very good in text processing
basically
if a word appears in the 4 sections it scores 4
if a word appears in the 3 sections it scores 3
if a word appears in the 2 sections it scores 2
if a word appears in the 1 section it scores 1
thanks in advance
so far i have this
var s = "welcome to the national basketball finals,the basketball teams here today have come a long way. without much delay lets play basketball. ";
var numberOfParts = 4;
var eachPartLength = s.Length / numberOfParts;
var parts = new List<string>();
var words = Regex.Split(s, #"\W").Where(w => w.Length > 0); // this splits all words, removes empty strings
var wordsIndex = 0;
for (int i = 0; i < numberOfParts; i++)
{
var sb = new StringBuilder();
while (sb.Length < eachPartLength && wordsIndex < words.Count())
{
sb.AppendFormat("{0} ", words.ElementAt(wordsIndex));
wordsIndex++;
}
// here you have the part
Response.Write("[{0}]"+ sb);
parts.Add(sb.ToString());
var allwords = parts.SelectMany(p => p.Split(' ').Distinct());
var wordsInAllParts = allwords.Where(w => parts.All(p => p.Contains(w))).Distinct();
This question is very difficult to interpret. I don't fully understand your goal and it is my suspicion that you might not either.
In the absence of a clear requirement, there is no way to give a specific answer, so I will give a generic one:
Try writing a test that clearly specifies the exact behavior you want. You've got the beginnings of one with your sample string and the result you want but it's not unambiguous what you are looking for.
Make a test that, when it passes, demonstrates that one of the required behaviors is there. If that doesn't help you get a solution to the problem, come back and edit this question or make a new one that includes the test.
At the very least, you will be able to harvest better answers from this site.

Categories

Resources