I been algorithm problem that requires me to do implementation of quick sort algorithm for linked list and array.
I have done both parts , algorithms are working, but it seems there is some bug in my quick-sort linked list implementation.
Here is my Quick sort linked list implementation.
public static void SortLinkedList(DataList items, DataList.Node low, DataList.Node high)
{
if( low != null && low !=high)
{
DataList.Node p = _PartitionLinkedList(items, low, high);
SortLinkedList(items, low, p);
SortLinkedList(items, p.Next(), null);
}
}
private static DataList.Node _PartitionLinkedList(DataList items, DataList.Node low, DataList.Node high)
{
DataList.Node pivot = low;
DataList.Node i = low;
for (DataList.Node j = i.Next(); j != high; j=j.Next())
{
if (j.Value().CompareTo(pivot.Value()) <= 0)
{
items.Swap(i.Next(),j);
i = i.Next();
}
}
items.Swap(pivot, i);
return i;
}
Here is Quick Sort array implementation
public static void SortData(DataArray items, int low, int high)
{
if (low < high)
{
int pi = _PartitionData(items, low, high);
SortData(items, low, pi - 1);
SortData(items, pi + 1, high);
}
}
static int _PartitionData(DataArray arr, int low, int high)
{
double pivot = arr[high];
int i = (low - 1);
for (int j = low; j <= high - 1; j++)
{
if (arr[j].CompareTo(pivot)<=0)
{
i++;
arr.Swap(i,j);
}
}
arr.Swap(i + 1, high);
return i + 1;
}
Here is Quick sort array and linked list performance. (left n, right time)
Picture
As you can see qs linked list took 10 min to sort 6400 elements. I dont think that its normal..
Also I dont think that its because of the data structure, because I was using same structure for selection sort and performance for both linked list and array were similar.
GitHub repo in case i forgot to provide some code. Repo
10 minutes is a very long time for 6400 elements. It would normally require 2 or 3 horrible mistakes, not just one.
Unfortunately, I only see one horrible mistake: Your second recursive call to SortLinkedList(items, p.Next(), null); goes all the way to the end of the list. You meant for it to stop at high.
That might account for the 10 minutes, but it seems a little unlikely.
It also looks to me like your sort is incorrect, even after you fix the above bug -- be sure to test the output!
I would look at your linked list, particularly the swap method. Unless we see the implementation of the linked list, I think the problem area is there.
Is there a reason why you're using linked lists? They have o(n) search which makes quicksort n^2lg(n) sort.
A different way to do it is to add all the items in your linked lists to a list, sort that list, and recreate your linkedlist. List.Sort() uses quick sort.
public static void SortLinkedList(DataList items)
{
list<object> actualList = new list<object>();
for (DataList.Node j = i.Next(); j != null; j=j.Next())
{
list.add(j.Value());
}
actualList.Sort();
items.Clear();
for (int i = 0; i < actualList.Count;i++)
{
items.Add(actualList[i]);
}
}
Quick sort for linked list is normally slightly different than quick sort for arrays. Use the first node's data value as the pivot value. Then the code creates 3 lists, one for values < pivot, one for values == pivot, one for values > pivot. It then does a recursive calls for the < pivot and > pivot lists. When the recursive call returns those 3 lists are now sorted, so the code only needs to concatenate the 3 lists.
To speed up concatenation of lists, keep track of a pointer to the last node. To simplify this, use circular lists, and use a pointer to the last node as the main way to access a list. This makes appending (joining) list simpler (no scanning). Once inside a function, use last->next in order to get a pointer to the first node of a list.
Two of the worst case data patterns are already sorted data or already reverse sorted data. If the circular list with pointer to last node method is used, then the average of last and first nodes could be used as a median of 2 which could help (note the list for nodes == pivot could end up empty).
Worst case time complexity is O(n^2). Worst case stack usage is O(n). The stack usage could be reduced by using recursion on the smaller of the list < pivot and list > pivot. After return, the now sorted smaller list would be concatenated with the list == pivot and saved in a 4th list. Then the sort process would iterate on the remaining unsorted list, then merging (or perhaps joining) with the saved list.
Sorting a linked list, using any method, including bottom up merge sort , will be slower than moving the linked list to an array, sorting the array, then creating a linked list from the sorted array. However the quick sort method I describe will be much faster than using an array oriented algorithm with a linked list.
Currently I have a list of integers. This list contains index values that point to "active" objects in another, much larger list. If the smaller list of "active" values becomes too large, it triggers a loop that iterates through the small list and removes values that have become inactive. Currently, it removes them by simply ignoring the inactive values and adding them to a second list (and when the second list gets full again the same process is repeated, placing them back into the first list and so on).
After this trigger occurs, the list is then sorted using a Quicksort implementation. This is all fine and dandy.
-------Question---------
However, I see a potential gain of speed. I am imagining combining the removal of inactive values while the sorting is taking place. Unfortunately, I cannot find a way to implement quicksort in this way. Simply because the quicksort works with pivots, which means if values are removed from the list, the pivot will eventually try to access a slot in the list that does not exist, etc etc.. (unless I'm just thinking about it wrong).
So, any ideas on how to combine the two operations? I can't seem to find any sorting algorithms as fast as quicksort that could handle this, or perhaps I'm just not seeing how to implement it into a quicksort... any hints are appreciated!
Code for better understanding of whats currently going on:
(Current Conditions: values can range from 0 to 2 million, no 2 values are the same, and in general they are mostly sorted, since they are sorted every so often)
if (deactive > 50000)//if the number of inactive coordinates is greater than 50k
{
for (int i = 0; i < activeCoords1.Count; i++)
{
if (largeArray[activeCoords[i]].active == true)//if coordinate is active, readd to list
{
activeCoords2.Add(activeCoords1[i]);
}
}
//clears the old list for future use
activeCoords1.Clear();
deactive = 0;
//sorts the new list
Quicksort(activeCoords2, 0, activeCoords2.Count() - 1);
}
static void Quicksort(List<int> elements, int left, int right)
{
int i = left, j = right;
int pivot = elements[(left + right) / 2];
while (i <= j)
{
// p < pivot
while (elements[i].CompareTo(pivot) < 0)
{
i++;
}
while (elements[j].CompareTo(pivot) > 0)
{
j--;
}
if (i <= j)
{
// Swap
int tmp = elements[i];
elements[i] = elements[j];
elements[j] = tmp;
i++;
j--;
}
}
// Recursive calls
if (left < j)
{
Quicksort(elements, elements, left, j);
}
if (i < right)
{
Quicksort(elements, elements, i, right);
}
}
It sounds like you might benefit from using a red-black tree (or another balanced binary tree), your search, insert and delete time will be O(log n). The tree will always be sorted so there will be no one off big hits incurred to re-sort.
What is your split in terms of types of access (search, insert, delete) and what are your constraints for reach?
I would use a List<T> or a SortedDictionary<TKey, TValue> as your data structure.
As your reason for sorting ("micro optimization based on feelings") is not a good one, I would refrain from it. A good reason would be "it has a measurable impact on performance".
In that case (or of you just want to do it), I recommend a SortedDictionary. All the sorting stuff is already done for you, no reason to reinvent the wheel.
There is no need to juggle with two Lists if one appropriate data structure suffices. A red-black-tree seems appropriate and is apparently used in the SortedDictionary according to this
Recently I came across an interview question to create a algorithm in any language which should do the following
Read 1 terabyte of content
Make a count for each reoccuring word in that content
List the top 10 most frequently occurring words
Could you let me know the best possible way to create an algorithm for this?
Edit:
OK, let's say the content is in English. How we can find the top 10 words that occur most frequently in that content? My other doubt is, if purposely they are giving unique data then our buffer will expire with heap size overflow. We need to handle that as well.
Interview Answer
This task is interesting without being too complex, so a great way to start a good technical discussion. My plan to tackle this task would be:
Split input data in words, using white space and punctuation as delimiters
Feed every word found into a Trie structure, with counter updated in nodes representing a word's last letter
Traverse the fully populated tree to find nodes with highest counts
In the context of an interview ... I would demonstrate the idea of Trie by drawing the tree on a board or paper. Start from empty, then build the tree based on a single sentence containing at least one recurring word. Say "the cat can catch the mouse". Finally show how the tree can then be traversed to find highest counts. I would then justify how this tree provides good memory usage, good word lookup speed (especially in the case of natural language for which many words derive from each other), and is suitable for parallel processing.
Draw on the board
Demo
The C# program below goes through 2GB of text in 75secs on an 4 core xeon W3520, maxing out 8 threads. Performance is around 4.3 million words per second with less than optimal input parsing code. With the Trie structure to store words, memory is not an issue when processing natural language input.
Notes:
test text obtained from the Gutenberg project
input parsing code assumes line breaks and is pretty sub-optimal
removal of punctuation and other non-word is not done very well
handling one large file instead of several smaller one would require a small amount of code to start reading threads between specified offset within the file.
using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
TrieNode root = new TrieNode(null, '?');
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
if (args.Length > 0)
{
foreach (string path in args)
{
DataReader new_reader = new DataReader(path, ref root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
int distinct_word_count = 0;
int total_word_count = 0;
root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 1;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ref TrieNode root)
{
m_root = root;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
{
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private ConcurrentDictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new ConcurrentDictionary<char, TrieNode>();
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.TryAdd(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
lock (this)
{
m_word_count++;
}
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
if (m_parent == null) return "";
else return m_parent.ToString() + m_char;
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
Here the output from processing the same 20MB of text 100 times across 8 threads.
Counting words...
Input data processed in 75.2879952 secs
Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times
323618000 words counted
60896 distinct words found
A great deal here depends on some things that haven't been specified. For example, are we trying to do this once, or are we trying to build a system that will do this on a regular and ongoing basis? Do we have any control over the input? Are we dealing with text that's all in a single language (e.g., English) or are many languages represented (and if so, how many)?
These matter because:
If the data starts out on a single hard drive, parallel counting (e.g., map-reduce) isn't going to do any real good -- the bottleneck is going to be the transfer speed from the disk. Making copies to more disks so we can count faster will be slower than just counting directly from the one disk.
If we're designing a system to do this on a regular basis, most of our emphasis is really on the hardware -- specifically, have lots of disks in parallel to increase our bandwidth and at least get a little closer to keeping up with the CPU.
No matter how much text you're reading, there's a limit on the number of discrete words you need to deal with -- whether you have a terabyte of even a petabyte of English text, you're not going to see anything like billions of different words in English. Doing a quick check, the Oxford English Dictionary lists approximately 600,000 words in English.
Although the actual words are obviously different between languages, the number of words per language is roughly constant, so the size of the map we build will depend heavily on the number of languages represented.
That mostly leaves the question of how many languages could be represented. For the moment, let's assume the worst case. ISO 639-2 has codes for 485 human languages. Let's assume an average of 700,000 words per language, and an average word length of, say, 10 bytes of UTF-8 per word.
Just stored as simple linear list, that means we can store every word in every language on earth along with an 8-byte frequency count in a little less than 6 gigabytes. If we use something like a Patricia trie instead, we can probably plan on that shrinking at least somewhat -- quite possibly to 3 gigabytes or less, though I don't know enough about all those languages to be at all sure.
Now, the reality is that we've almost certainly overestimated the numbers in a number of places there -- quite a few languages share a fair number of words, many (especially older) languages probably have fewer words than English, and glancing through the list, it looks like some are included that probably don't have written forms at all.
Summary: Almost any reasonably new desktop/server has enough memory to hold the map entirely in RAM -- and more data won't change that. For one (or a few) disks in parallel, we're going to be I/O-bound anyway, so parallel counting (and such) will probably be a net loss. We probably need tens of disks in parallel before any other optimization means much.
You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.
First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:
map(document):
for each word w:
EmitIntermediate(w,"1")
reduce(word,list<val>):
Emit(word,size(list))
Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.
A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.
Three things of note for this.
Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.
Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.
If it's any easier (to keep your head from spinning).
"You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."
Same exact problem.
It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:
Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
Count-min sketch, to get an estimated count for any word
Those data structures require only very little constant space (exact amount depends on error you can tolerate).
See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.
I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.
EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?
editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.
A different solution could be using an SQL table, and let the system handle the data as good as it can. First create the table with the single field word, for each word in the collection.
Then use the query (sorry for syntax issue, my SQL is rusty - this is a pseudo-code actually):
SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC
The general idea is to first generate a table (which is stored on disk) with all words, and then use a query to count and sort (word,occurances) for you. You can then just take the top K from the retrieved list.
To all: If I indeed have any syntax or other issues in the SQL statement: feel free to edit
First, I only recently "discovered" the Trie data structure and zeFrenchy's answer was great for getting me up to speed on it.
I did see in the comments several people making suggestions on how to improve its performance, but these were only minor tweaks so I'd thought I'd share with you what I found to be the real bottle neck... the ConcurrentDictionary.
I'd wanted to play around with thread local storage and your sample gave me a great opportunity to do that and after some minor changes to use a Dictionary per thread and then combine the dictionaries after the Join() saw the performance improve ~30% (processing 20MB 100 times across 8 threads went from ~48 sec to ~33 sec on my box).
The code is pasted below and you'll notice not much changed from the approved answer.
P.S. I don't have more than 50 reputation points so I could not put this in a comment.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
namespace WordCount
{
class MainClass
{
public static void Main(string[] args)
{
Console.WriteLine("Counting words...");
DateTime start_at = DateTime.Now;
Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
if (args.Length == 0)
{
args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
"war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
}
List<ThreadLocal<TrieNode>> roots;
if (args.Length == 0)
{
roots = new List<ThreadLocal<TrieNode>>(1);
}
else
{
roots = new List<ThreadLocal<TrieNode>>(args.Length);
foreach (string path in args)
{
ThreadLocal<TrieNode> root = new ThreadLocal<TrieNode>(() =>
{
return new TrieNode(null, '?');
});
roots.Add(root);
DataReader new_reader = new DataReader(path, root);
Thread new_thread = new Thread(new_reader.ThreadRun);
readers.Add(new_reader, new_thread);
new_thread.Start();
}
}
foreach (Thread t in readers.Values) t.Join();
foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
{
roots[0].Value.CombineNode(root.Value);
root.Dispose();
}
DateTime stop_at = DateTime.Now;
Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
Console.WriteLine();
Console.WriteLine("Most commonly found words:");
List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
int distinct_word_count = 0;
int total_word_count = 0;
roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);
top10_nodes.Reverse();
foreach (TrieNode node in top10_nodes)
{
Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
}
roots[0].Dispose();
Console.WriteLine();
Console.WriteLine("{0} words counted", total_word_count);
Console.WriteLine("{0} distinct words found", distinct_word_count);
Console.WriteLine();
Console.WriteLine("done.");
Console.ReadLine();
}
}
#region Input data reader
public class DataReader
{
static int LOOP_COUNT = 100;
private TrieNode m_root;
private string m_path;
public DataReader(string path, ThreadLocal<TrieNode> root)
{
m_root = root.Value;
m_path = path;
}
public void ThreadRun()
{
for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
{
using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
using (StreamReader sreader = new StreamReader(fstream))
{
string line;
while ((line = sreader.ReadLine()) != null)
{
string[] chunks = line.Split(null);
foreach (string chunk in chunks)
{
m_root.AddWord(chunk.Trim());
}
}
}
}
}
}
#endregion
#region TRIE implementation
public class TrieNode : IComparable<TrieNode>
{
private char m_char;
public int m_word_count;
private TrieNode m_parent = null;
private Dictionary<char, TrieNode> m_children = null;
public TrieNode(TrieNode parent, char c)
{
m_char = c;
m_word_count = 0;
m_parent = parent;
m_children = new Dictionary<char, TrieNode>();
}
public void CombineNode(TrieNode from)
{
foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
{
char keyChar = fromChild.Key;
if (!m_children.ContainsKey(keyChar))
{
m_children.Add(keyChar, new TrieNode(this, keyChar));
}
m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
m_children[keyChar].CombineNode(fromChild.Value);
}
}
public void AddWord(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
{
if (!m_children.ContainsKey(key))
{
m_children.Add(key, new TrieNode(this, key));
}
m_children[key].AddWord(word, index + 1);
}
else
{
// not a letter! retry with next char
AddWord(word, index + 1);
}
}
else
{
if (m_parent != null) // empty words should never be counted
{
m_word_count++;
}
}
}
public int GetCount(string word, int index = 0)
{
if (index < word.Length)
{
char key = word[index];
if (!m_children.ContainsKey(key))
{
return -1;
}
return m_children[key].GetCount(word, index + 1);
}
else
{
return m_word_count;
}
}
public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
{
if (m_word_count > 0)
{
distinct_word_count++;
total_word_count += m_word_count;
}
if (m_word_count > most_counted[0].m_word_count)
{
most_counted[0] = this;
most_counted.Sort();
}
foreach (char key in m_children.Keys)
{
m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
}
}
public override string ToString()
{
return BuildString(new StringBuilder()).ToString();
}
private StringBuilder BuildString(StringBuilder builder)
{
if (m_parent == null)
{
return builder;
}
else
{
return m_parent.BuildString(builder).Append(m_char);
}
}
public int CompareTo(TrieNode other)
{
return this.m_word_count.CompareTo(other.m_word_count);
}
}
#endregion
}
As a quick general algorithm I would do this.
Create a map with entries being the count for a specific word and the key being the actual string.
for each string in content:
if string is a valid key for the map:
increment the value associated with that key
else
add a new key/value pair to the map with the key being the word and the count being one
done
Then you could just find the largest value in the map
create an array size 10 with data pairs of (word, count)
for each value in the map
if current pair has a count larger than the smallest count in the array
replace that pair with the current one
print all pairs in array
Well, personally, I'd split the file into different sizes of say 128mb, maintaining two in memory all the time while scannng, any discovered word is added to a Hash list, and List of List count, then I'd iterate the list of list at the end to find the top 10...
Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:
Get the 1st 10 found words where occurence >= 2
Get how many times these words occure in the entire string and delete them while counting
Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
Do the same for the rest of the string(which dosnt contain these
words anymore).
You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)
But in the worst case you may have more rounds than in a conventional algorithm.
By the way its a problem i would try to solve with a functional programing language rather than OOP.
The method below will only read your data once and can be tuned for memory sizes.
Read the file in chunks of say 1GB
For each chunk make a list of say the 5000 most occurring words with their frequency
Merge the lists based on frequency (1000 lists with 5000 words each)
Return the top 10 of the merged list
Theoretically you might miss words, althoug I think that chance is very very small.
Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well -
http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010
Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .
This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.
Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.
You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.
As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.
Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
here is implementation of tire in java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
MapReduce
WordCount can be acheived effciently through mapreduce using hadoop.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0
Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Is there a method to build a balanced binary search tree?
Example:
1 2 3 4 5 6 7 8 9
5
/ \
3 etc
/ \
2 4
/
1
I'm thinking there is a method to do this, without using the more complex self-balancing trees. Otherwise I can do it on my own, but someone probably have done this already :)
Thanks for the answers! This is the final python code:
def _buildTree(self, keys):
if not keys:
return None
middle = len(keys) // 2
return Node(
key=keys[middle],
left=self._buildTree(keys[:middle]),
right=self._buildTree(keys[middle + 1:])
)
For each subtree:
Find the middle element of the subtree and put that at the top of the tree.
Find all the elements before the middle element and use this algorithm recursively to get the left subtree.
Find all the elements after the middle element and use this algorithm recursively to get the right subtree.
If you sort your elements first (as in your example) finding the middle element of a subtree can be done in constant time.
This is a simple algorithm for constructing a one-off balanced tree. It is not an algorithm for a self-balancing tree.
Here is some source code in C# that you can try for yourself:
public class Program
{
class TreeNode
{
public int Value;
public TreeNode Left;
public TreeNode Right;
}
TreeNode constructBalancedTree(List<int> values, int min, int max)
{
if (min == max)
return null;
int median = min + (max - min) / 2;
return new TreeNode
{
Value = values[median],
Left = constructBalancedTree(values, min, median),
Right = constructBalancedTree(values, median + 1, max)
};
}
TreeNode constructBalancedTree(IEnumerable<int> values)
{
return constructBalancedTree(
values.OrderBy(x => x).ToList(), 0, values.Count());
}
void Run()
{
TreeNode balancedTree = constructBalancedTree(Enumerable.Range(1, 9));
// displayTree(balancedTree); // TODO: implement this!
}
static void Main(string[] args)
{
new Program().Run();
}
}
This paper explains in detail:
Tree Rebalancing in Optimal Time and Space
http://www.eecs.umich.edu/~qstout/abs/CACM86.html
Also here:
One-Time Binary Search Tree Balancing:
The Day/Stout/Warren (DSW) Algorithm
http://penguin.ewu.edu/~trolfe/DSWpaper/
If you really want to do it on-the-fly, you need a self-balancing tree.
If you just want to build a simple tree, without having to go to the trouble of balancing it, just randomize the elements before inserting them into the tree.
Make the median of your data (or more precisely, the nearest element in your array to the median) the root of the tree. And so on recursively.
I'm a Linq beginner so just looking for someone to let me know if following is possible to implement with Linq and if so some pointers how it could be achieved.
I want to transform one financial time series list into another where the second series list will be same length or shorter than the first list (usually it will be shorter, i.e., it becomes a new list where the elements themselves represent aggregation of information of one or more elements from the 1st list). How it collapses the list from one to the other depends on the data in the first list. The algorithm needs to track a calculation that gets reset upon new elements added to second list. It may be easier to describe via an example:
List 1 (time ordered from beginning to end series of closing prices and volume):
{P=7,V=1}, {P=10,V=2}, {P=10,V=1}, {P=10,V=3}, {P=11,V=5}, {P=12,V=1}, {P=13,V=2}, {P=17,V=1}, {P=15,V=4}, {P=14,V=10}, {P=14,V=8}, {P=10,V=2}, {P=9,V=3}, {P=8,V=1}
List 2 (series of open/close price ranges and summation of volume for such range period using these 2 param settings to transform list 1 to list 2: param 1: Price Range Step Size = 3, param 2: Price Range Reversal Step Size = 6):
{O=7,C=10,V=1+2+1}, {O=10,C=13,V=3+5+1+2}, {O=13,C=16,V=0}, {O=16,C=10,V=1+4+10+8+2}, {O=10,C=8,V=3+1}
In list 2, I explicitly am showing the summation of the V attributes from list 1 in list 2. But V is just a long so it would just be one number in reality. So how this works is opening time series price is 7. Then we are looking for first price from this initial starting price where delta is 3 away from 7 (via param 1 setting). In list 1, as we move thru the list, the next step is upwards move to 10 and thus we've established an "up trend". So now we build our first element in list 2 with Open=7,Close=10 and sum up the Volume of all bars used in first list to get to this first step in list 2. Now, next element starting point is 10. To build another up step, we need to advance another 3 upwards to create another up step or we could reverse and go downwards 6 (param 2). With data from list 1, we reach 13 first, so that builds our second element in list 2 and sums up all the V attributes used to get to this step. We continue on this process until end of list 1 processing.
Note the gap jump that happens in list 1. We still want to create a step element of {O=13,C=16,V=0}. The V of 0 is simply stating that we have a range move that went thru this step but had Volume of 0 (no actual prices from list 1 occurred here - it was above it but we want to build the set of steps that lead to price that was above it).
Second to last entry in list 2 represents the reversal from up to down.
Final entry in list 2 just uses final Close from list 1 even though it really hasn't finished establishing full range step yet.
Thanks for any pointers of how this could be potentially done via Linq if at all.
My first thought is, why try to use LINQ on this? It seems like a better situation for making a new Enumerable using the yield keyword to partially process and then spit out an answer.
Something along the lines of this:
public struct PricePoint
{
ulong price;
ulong volume;
}
public struct RangePoint
{
ulong open;
ulong close;
ulong volume;
}
public static IEnumerable<RangePoint> calculateRanges(IEnumerable<PricePoint> pricePoints)
{
if (pricePoints.Count() > 0)
{
ulong open = pricePoints.First().price;
ulong volume = pricePoints.First().volume;
foreach(PricePoint pricePoint in pricePoints.Skip(1))
{
volume += pricePoint.volume;
if (pricePoint.price > open)
{
if ((pricePoint.price - open) >= STEP)
{
// We have established a up-trend.
RangePoint rangePoint;
rangePoint.open = open;
rangePoint.close = close;
rangePoint.volume = volume;
open = pricePoint.price;
volume = 0;
yield return rangePoint;
}
}
else
{
if ((open - pricePoint.price) >= REVERSAL_STEP)
{
// We have established a reversal.
RangePoint rangePoint;
rangePoint.open = open;
rangePoint.close = pricePoint.price;
rangePoint.volume = volume;
open = pricePoint.price;
volume = 0;
yield return rangePoint;
}
}
}
RangePoint lastPoint;
lastPoint.open = open;
lastPoint.close = pricePoints.Last().price;
lastPoint.volume = volume;
yield return lastPoint;
}
}
This isn't yet complete. For instance, it doesn't handle gapping, and there is an unhandled edge case where the last data point might be consumed, but it will still process a "lastPoint". But it should be enough to get started.