I have a text file with 100000 pairs: word and frequency.
test.in file with words:
1 line - total count of all word-frequency pairs
2 line to ~100 001 - word-frequency pairs
100 002 line - total count of user input words
from 100 003 to the end - user input words
I parse this file and put the words in
Dictionary<string,double> dictionary;
And I want to execute some search + order logic in the following code:
for(int i=0;i<15000;i++)
{
tempInputWord = //take data from file(or other sources)
var adviceWords = dictionary
.Where(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
.OrderByDescending(ks => ks.Value)
.ThenBy(ks => ks.Key,StringComparer.Ordinal)
.Take(10)
.ToList();
//some output
}
The problem: This code must run in less than 10 seconds.
On my computer (core i5 2400, 8gb RAM) with Parallel.For() - about 91 sec.
Can you give me some advice how to increase performance?
UPDATE :
Hooray! We did it!
Thank you #CodesInChaos, #usr, #T_D and everyone who was involved in solving the problem.
The final code:
var kvList = dictionary.OrderBy(ks => ks.Key, StringComparer.Ordinal).ToList();
var strComparer = new MyStringComparer();
var intComparer = new MyIntComparer();
var kvListSize = kvList.Count;
var allUserWords = new List<string>();
for (int i = 0; i < userWordQuantity; i++)
{
var searchWord = Console.ReadLine();
allUserWords.Add(searchWord);
}
var result = allUserWords
.AsParallel()
.AsOrdered()
.Select(searchWord =>
{
int startIndex = kvList.BinarySearch(new KeyValuePair<string, int>(searchWord, 0), strComparer);
if (startIndex < 0)
startIndex = ~startIndex;
var matches = new List<KeyValuePair<string, int>>();
bool isNotEnd = true;
for (int j = startIndex; j < kvListSize ; j++)
{
isNotEnd = kvList[j].Key.StartsWith(searchWord, StringComparison.Ordinal);
if (isNotEnd) matches.Add(kvList[j]);
else break;
}
matches.Sort(intComparer);
var res = matches.Select(s => s.Key).Take(10).ToList();
return res;
});
foreach (var adviceWords in result)
{
foreach (var adviceWord in adviceWords)
{
Console.WriteLine(adviceWord);
}
Console.WriteLine();
}
6 sec (9 sec without manual loop (with linq)))
You are not at all using any algorithmic strength of the dictionary. Ideally, you'd use a tree structure so that you can perform prefix lookups. On the other hand you are within 3.7x of your performance goal. I think you can reach that by just optimizing the constant factor in your algorithm.
Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List<T>. That turns out to give a major speed-up in practice.
Don't use a dictionary at all. Just use a KeyValuePair<T1, T2>[] and run through it using a foreach loop. This is the fastest possible way to traverse a set of pairs.
Could look like this:
KeyValuePair<T1, T2>[] items;
List<KeyValuePair<T1, T2>> matches = new ...(); //Consider pre-sizing this.
//This could be a parallel loop as well.
//Make sure to not synchronize too much on matches.
//If there tend to be few matches a lock will be fine.
foreach (var item in items) {
if (IsMatch(item)) {
matches.Add(item);
}
}
matches.Sort(...); //Sort in-place
return matches.Take(10); //Maybe matches.RemoveRange(10, matches.Count - 10) is better
That should exceed a 3.7x speedup.
If you need more, try stuffing the items into a dictionary keyed on the first char of Key. That way you can look up all items matching tempInputWord[0]. That should reduce search times by the selectivity that is in the first char of tempInputWord. For English text that would be on the order of 26 or 52. This is a primitive form of prefix lookup that has one level of lookup. Not pretty but maybe it is enough.
I think the best way would be to use a Trie data structure instead of a dictionary. A Trie data structure saves all the words in a tree structure. A node can represent all the words that start with the same letters. So if you look for your search word tempInputWord in a Trie you will get a node that represents all the words starting with tempInputWord and you just have to traverse through all the child nodes. So you just have one search operation. The link to the Wikipedia article also mentions some other advantages over hash tables (that's what an Dictionary is basically):
Looking up data in a trie is faster in the worst case, O(m) time
(where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. A key
collision is the hash function mapping of different keys to the same
position in a hash table. The worst-case lookup speed in an imperfect
hash table is O(N) time, but far more typically is O(1), with O(m)
time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is
associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
And here are some ideas for creating a trie in C#.
This should at least speed up the lookup, however, building the Trie might be slower.
Update:
Ok, I tested it myself using a file with frequencies of english words that uses the same format as yours. This is my code which uses the Trie class that you also tried to use.
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
sw.Start();
var trie = new Trie<KeyValuePair<string,int>>();
//build trie with your value pairs
var lines = File.ReadLines("en.txt");
foreach(var line in lines.Take(100000))
{
var split = line.Split(' ');
trie.Add(split[0], new KeyValuePair<string,int>(split[0], int.Parse(split[1])));
}
Console.WriteLine("Time needed to read file and build Trie with 100000 words: " + sw.Elapsed);
sw.Reset();
//test with 10000 search words
sw.Start();
foreach (string line in lines.Take(10000))
{
var searchWord = line.Split(' ')[0];
var allPairs = trie.Retrieve(searchWord);
var bestWords = allPairs.OrderByDescending(kv => kv.Value).ThenBy(kv => kv.Key).Select(kv => kv.Key).Take(10);
var output = bestWords.Aggregate("", (s1, s2) => s1 + ", " + s2);
Console.WriteLine(output);
}
Console.WriteLine("Time to process 10000 different searchWords: " + sw.Elapsed);
}
My results on a pretty similar machine:
Time needed to read file and build Trie with 100000 words: 00:00:00.7397839
Time to process 10000 different searchWords: 00:00:03.0181700
So I think you are doing something wrong that we cannot see. For example the way you measure the time or the way you read the file. As my results show this stuff should be really fast. The 3 seconds are mainly due to the Console output in the loop which I needed so that the bestWords variable is used. Otherwise the variable would have been optimized away.
Replace the dictionary by a List<KeyValuePair<string, decimal>>, sorted by the key.
For the search I use that a substring sorts directly before its prefixes with ordinal comparisons. So I can use a binary search to find the first candidate. Since the candidates are contiguous I can replace Where with TakeWhile.
int startIndex = dictionary.BinarySearch(searchWord, comparer);
if(startIndex < 0)
startIndex = ~startIndex;
var adviceWords = dictionary
.Skip(startIndex)
.TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
.OrderByDescending(ks => ks.Value)
.ThenBy(ks => ks.Key)
.Select(s => s.Key)
.Take(10).ToList();
Make sure to use ordinal comparison for all operations, including the initial sort, the binary search and the StartsWith check.
I would call Console.ReadLine outside the parallel loop. Probably using AsParallel().Select(...) on the collection of search words instead of Parallel.For.
If you want profiling, separate the reading of the file and see how long that takes.
Also data calculation, collection, presentation could be different steps.
If you want concurrence AND a dictionary, look at the ConcurrentDictionary, maybe even more for reliability than for performance, but probably for both:
http://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx
Assuming the 10 is constant, then why is everyone storing the entire data set? Memory is not free. The fastest solution is to store the first 10 entries into a list, sort it. Then, maintain the 10-element-sorted-list as you traverse through the rest of the data set, removing the 11th element every time you insert an element.
The above method works best for small values. If you had to take the first 5000 objects, consider using a binary heap instead of a list.
Related
I'm sure something like this exists, but I don't know what it would be called (or how to find more info on it). If I have an alphabetically sorted list of words, and I'm checking to see if and where the word "test" is in that list, it doesn't make sense to start at the beginning, but to start in the T's, right? And the same for numbers, of course. Is there a way to implement something like this and tailor the the start of the search? Or do hash sets and methods like Contain already do this by themselves?
EDIT:
For example, if I have a list of integers like {1,2,3,5,7,8,9,23..}, is there any automatic way to sort it so that when I check the list for the element "9", it doesn't begin from the beginning...?
Sorry, this is a simple example, but I do intend to search thousands of times through a list that potentially contains thousands of elements
EDIT 2:
From the replies, I learned about Binary search, but since that apparently starts in the middle of your list, is it possible to implement something manually, along the lines of, for example, splitting a list of words into 26 bins such that when you search for a particular word, it can immediately start searching in the best place (or maybe 52 bins if each bin starts to become overpopulated...)
When you say you have a sorted list and you want to search it, the algorithm that immediately jumps to my mind is a binary search. Fortunately List<T> already has that implemented.
The example on that link actually looks to do exactly what you want (it's dealing with finding a word in a list of sorted words too).
In essence, you want something like this:
List<string> words = ...;
words.Sort(); // or not depending on the source
var index = words.BinarySearch("word");
if(index > -1)
{
// word was found, and its index is stored in index
}
else // you may or may not want this part
{ // this will insert the word into the list, so that you don't have to re-sort it.
words.Insert(~index, "word");
}
This, of course, also works with ints. Simply replace List<string> with List<int> and your BinarySearch argument with an int.
Most Contains-type functions simply loop through the collection until coming across the item you're looking for. That works great in that you don't have to sort the collection first, but it's not so nice when you start off with it sorted. So in most cases, if you're searching the same list a lot, sort it and BinarySearch it, but if you're modifying the list a lot and only searching once or twice, a regular IndexOf or Contains will likely be your best bet.
If you're looking to group words by their first letter, I would probably use a Dictionary<char, List<string>> to store them. I chose List over an array for the purposes of mutability, so make that call on your own--there's also Array.BinarySearch if you choose to use an array. You could get into a proprietary tree model, but that may or may not be overkill. To do the dictionary keyed by first character, you'll want something like this:
Dictionary<char, List<string>> GetDict(IEnumerable<string> args)
{
return args.GroupBy(c => c[0]).ToDictionary(c => c.Key, c => c.OrderBy(x => x).ToList());
}
Then you can use it pretty simply, similarly to before. The only change would reside in the fetching of your list.
Dictionary<char, List<string>> wordsByKey = GetDict(words);
List<string> keyed;
string word = "word";
if (wordsByKey.TryGetValue(word[0], out keyed))
{
// same as before
}
else
{
wordsByKey.Add(word[0], new List<string>() { word }); // or not, again
// depending on whether you
// want the list to update.
}
When list is sorted, then you are looking for BinarySearch: http://msdn.microsoft.com/pl-pl/library/3f90y839%28v=vs.110%29.aspx. The complexity is O(log n) against O(n) in simple Contains.
List<string> myList = GetList();
string elementToSearch = "test";
if (myList.Contains(elementToSearch))
{
// found, O(n), works on unsorted list
}
if (myList.BinarySearch(elementToSearch)) >= 0)
{
// found, O(log n), works only on sorted list
}
To claryify: What is the difference between Linear search and Binary search?
To your edit:
If your input collection is not sorted, you should use Contains or IndexOf due to mentioned O(n) time. It will loop your collection once. Sorting collection is less efficient - it takes O(n log n). S it's not efficient to sort it in order to search one element.
Some sample to realize the pefromance:
var r = new Random();
var list = new List<int>();
for (var i = 1; i < 10000000; i++)
{
list.Add(r.Next());
}
// O (log n) - we assume that list is sorted, so sorting is pefromed outside watch
var sortedList = new List<int>(list);
sortedList.Sort();
var elementToSearch = sortedList.Last();
var watcher = new Stopwatch();
watcher.Start();
sortedList.BinarySearch(elementToSearch);
watcher.Stop();
Console.WriteLine("BinarySearch on already sorted: {0} ms",
watcher.Elapsed.TotalMilliseconds);
// O(n) - simple search
elementToSearch = list.Last();
watcher.Reset();
watcher.Start();
list.IndexOf(elementToSearch);
watcher.Stop();
Console.WriteLine("IndexOf on unsorted: {0} ms",
watcher.Elapsed.TotalMilliseconds);
// O(n log n) + O (log n)
watcher.Reset();
watcher.Start();
list.Sort();
elementToSearch = list.Last();
list.BinarySearch(elementToSearch);
watcher.Stop();
Console.WriteLine("Sort + binary search on unsorted: {0} ms"
, watcher.Elapsed.TotalMilliseconds);
Console.ReadKey();
Result:
BinarySearch on already sorted: 0.0248 ms
IndexOf on unsorted: 6.144 ms
Sort + binary search on unsorted: 1157.3298 ms
Edit to edit 2:
I think you are looking for BucketSort rather:
You can implement it by own, but I think that solution with Dictionary of Matthew Haugen is simpler and faster to implement :)
I have the following code. Is there a more efficient way to accomplish the same tasks?
Given a folder, loop over the files within the folder.
Within each file, skip the first four header lines,
After splitting the row based on a space, if the resulting array contains less than 7 elements, skip it,
Check if the specified element is already in the dictionary. If it is, increment the count. If not, add it.
It's not a complicated process. Is there a better way to do this? LINQ?
string sourceDirectory = #"d:\TESTDATA\";
string[] files = Directory.GetFiles(sourceDirectory, "*.log",
SearchOption.TopDirectoryOnly);
var dictionary = new Dictionary<string, int>();
foreach (var file in files)
{
string[] lines = System.IO.File.ReadLines(file).Skip(4).ToArray();
foreach (var line in lines)
{
var elements = line.Split(' ');
if (elements.Length > 6)
{
if (dictionary.ContainsKey(elements[9]))
{
dictionary[elements[9]]++;
}
else
{
dictionary.Add(elements[9], 1);
}
}
}
}
Something Linqy should do you. Doubt its any more efficient. And, it's almost certainly more of a hassle to debug. But it is very trendy these days:
static Dictionary<string,int> Slurp( string rootDirectory )
{
Dictionary<string,int> instance = Directory.EnumerateFiles(rootDirectory,"*.log",SearchOption.TopDirectoryOnly)
.SelectMany( fn => File.ReadAllLines(fn)
.Skip(4)
.Select( txt => txt.Split( " ".ToCharArray() , StringSplitOptions.RemoveEmptyEntries) )
.Where(x => x.Length > 9 )
.Select( x => x[9])
)
.GroupBy( x => x )
.ToDictionary( x => x.Key , x => x.Count())
;
return instance ;
}
A more efficient (performance-wise) way to do this would be to parallelize your outer foreach with the Parallel.Foreach method. You'd also need a ConcurrentDictionary object instead of a standard dictionary.
Not sure if you are looking for better performance or more elegant code.
If you prefer functional style linq, something like this maybe:
var query= from element in
(
//go through all file names
from fileName in files
//read all lines from every file and skip first 4
from line in File.ReadAllLines(fileName).Skip(4)
//split every line into words
let lineData = line.Split(new[] {' '})
//select only lines with more than 6 words
where lineData.Count() > 6
//take 6th element from line
select lineData.ElementAt(6)
)
//outer query will group by element
group element by element
into g
select new
{
Key = g.Key,
Count = g.Count()
};
var dictionary = list.ToDictionary(e=>e.Key,e=>e.Count);
Result is dictionary with word as Key and count of word occrances as value.
I would expect that reading the files is going to be the most consuming part of the operation. In many cases, trying to read multiple files at once on different threads will hurt rather than help performance, but it may be helpful to have a thread which does nothing but read files, so that it can keep the drive as busy as possible.
If the files could get big (as seems likely) and if no line will exceed 32K bytes (8,000-32,000 Unicode characters), I'd suggest that you read them in chunks of about 32K or 64K bytes (not characters). Reading a file as bytes and subdividing into lines yourself may be faster than reading it as lines, since the subdivision can happen on a different thread from the physical disk access.
I'd suggest starting with one thread for disk access and one thread for parsing and counting, with a blocking queue between them. The disk-access thread should put on the queue data items which contain a 32K array of bytes, an indication of how many bytes are valid [may be less than 32K at the end of a file], and an indicator of whether it's the last record of a file. The parsing thread should read those items, parse them into lines, and update the appropriate counts.
To improve counting performance, it may be helpful to define
class ExposedFieldHolder<T> {public T Value; }
and then have a Dictionary<string, ExposedFieldHolder<int>>. One would have to create a new ExposedFieldHolder<int> for each dictionary slot, but dictionary[elements[9]].Value++; would likely be faster than dictionary[elements[9]]++; since the latter statement translates as dictionary[elements[9]] = dictionary[elements[9]]+1;, and has to look up the element once when reading and again when writing].
If it's necessary to do the parsing and counting on multiple threads, I would suggest that each thread have its own queue, and the disk-reading thread switch queues after each file [all blocks of a file should be handled by the same thread, since a text line might span two blocks]. Additionally, while it would be possible to use a ConcurrentDictionary, it might be more efficient to have each thread have its own independent Dictionary and consolidate the results at the end.
Is the Lookup Time for a HashTable or Dictionary Always O(1) as long as it has a Unique Hash Code?
If a HashTable has 100 Million Rows would it take the same amount of time to look up as something that has 1 Row?
No. It is technically possible but it would be extremely rare to get the exact same amount of overhead. A hash table is organized into buckets. Dictionary<> (and Hashtable) calculate a bucket number for the object with an expression like this:
int bucket = key.GetHashCode() % totalNumberOfBuckets;
So two objects with a different hash code can end of in the same bucket. A bucket is a List<>, the indexer next searches that list for the key which is O(n) where n is the number of items in the bucket.
Dictionary<> dynamically increases the value of totalNumberOfBuckets to keep the bucket search efficient. When you pump a hundred million items in the dictionary, there will be thousands of buckets. The odds that the bucket is empty when you add an item will be quite small. But if it is by chance then, yes, it will take just as long to retrieve the item.
The amount of overhead increases very slowly as the number of items grows. This is called amortized O(1).
Might be helpful : .NET HashTable Vs Dictionary - Can the Dictionary be as fast?
As long as there are no collisions with the hashes, yes.
var dict = new Dictionary<string, string>();
for (int i = 0; i < 100; i++) {
dict.Add("" + i, "" + i);
}
long start = DateTime.Now.Ticks;
string s = dict["10"];
Console.WriteLine(DateTime.Now.Ticks - start);
for (int i = 100; i < 100000; i++) {
dict.Add("" + i, "" + i);
}
start = DateTime.Now.Ticks;
s = dict["10000"];
Console.WriteLine(DateTime.Now.Ticks - start);
This prints 0 on both cases. So it seems the answer would be Yes.
[Got moded down so I'll explain better]
It seems that it is constant. But it depends on the Hash function giving a different result in all keys. As there is no hash function that can do that it all boils down to the Data that you feed to the Dictionary. So you will have to test with your data to see if it is constant.
Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.
I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:
First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.
I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:
Prefix Code 1: 1234567 <-- this code is 7 digits long.
I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.
Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".
Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.
Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.
Thanks for your help.
Update:
Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.
I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.
UPDATE
You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.
const Int32 minimumPrefixLength = 3;
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var numberPrefixes = numbers
.Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
.First(n.StartsWith))
.ToList();
So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?
Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.
MPL Runtime
-----------------
1 10.198 ms
2 1.179 ms
3 205 ms
4 130 ms
5 107 ms
Just to give an rough idea - I did just one run and have a lot of other stuff going on.
Original answer
I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.
I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.
List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);
Then I used the following LINQ to Object query to find the prefix of each number.
var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();
Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.
Test application
Note that generating many prefixes is really slow and might take a minute or two.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace Test
{
static class Program
{
static void Main()
{
// Set number of prefixes and calls to not more than 50 to get results
// printed to the console.
Console.Write("Generating prefixes");
List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
Console.WriteLine();
Console.Write("Generating calls");
List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
Console.WriteLine();
Console.WriteLine("Processing started.");
Stopwatch stopwatch = new Stopwatch();
const Int32 minimumPrefixLength = 5;
stopwatch.Start();
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var result = calls
.GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
.First(c.Number.StartsWith))
.Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
.ToList();
stopwatch.Stop();
Console.WriteLine("Processing finished.");
Console.WriteLine(stopwatch.Elapsed);
if ((prefixes.Count <= 50) && (calls.Count <= 50))
{
Console.WriteLine("Prefixes");
foreach (String prefix in prefixes.OrderBy(p => p))
{
Console.WriteLine(String.Format(" prefix={0}", prefix));
}
Console.WriteLine("Calls");
foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
{
Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration));
}
Console.WriteLine("Result");
foreach (Call call in result.OrderBy(c => c.Number))
{
Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration));
}
}
Console.ReadLine();
}
private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<String> prefixes = new List<String>(count);
StringBuilder stringBuilder = new StringBuilder(maximumLength);
while (prefixes.Count < count)
{
stringBuilder.Length = 0;
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
String prefix = stringBuilder.ToString();
if (prefixes.Count % 1000 == 0)
{
Console.Write(".");
}
if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
{
prefixes.Add(stringBuilder.ToString());
}
}
return prefixes;
}
private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<Call> calls = new List<Call>(count);
StringBuilder stringBuilder = new StringBuilder();
while (calls.Count < count)
{
stringBuilder.Length = 0;
stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
if (calls.Count % 1000 == 0)
{
Console.Write(".");
}
calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
}
return calls;
}
private class Call
{
public Call (String number, Decimal duration)
{
this.Number = number;
this.Duration = duration;
}
public String Number { get; private set; }
public Decimal Duration { get; private set; }
}
}
}
It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.
Then create a dictionary from carrier to an int or long (the total).
Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.
The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.
Now, to associate a call with a carrier:
foreach (Carrier carrier in carriers)
{
bool found = false;
for (int length = 1; length <= 7; length++)
{
int prefix = ExtractDigits(callNumber, length);
if (carrier.Prefixes.Contains(prefix))
{
carrier.Calls.Add(callNumber);
found = true;
break;
}
}
if (found)
break;
}
If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.
You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.
How about dumping your data into a couple of database tables and then query them using SQL? Easy!
CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )
CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )
-- now populate the tables, create indexes etc
-- and then just run your query...
SELECT p.prefix,
SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
INNER JOIN dbo.prefixes AS p
ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix
(This was written for SQL Server, but should be very simple to translate for any other DBMS.)
Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.
You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).
Then your report would be a sum query on the table.
I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.
You may want to use a HashTable in C#.
This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.
You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.
I am fairly new to C# programming and I am stuck on my little ASP.NET project.
My website currently examines Twitter statuses for URLs and then adds those URLs to an array, all via a regular expression pattern matching procedure. Clearly more than one person will update a with a specific URL so I do not want to list duplicates, and I want to count the number of times a particular URL is mentioned in, say, 100 tweets.
Now I have a List<String> which I can sort so that all duplicate URLs are next to each other. I was under the impression that I could compare list[i] with list[i+1] and if they match, for a counter to be added to (count++), and if they don't match, then for the URL and the count value to be added to a new array, assuming that this is the end of the duplicates.
This would remove duplicates and give me a count of the number of occurrences for each URL. At the moment, what I have is not working, and I do not know why (like I say, I am not very experienced with it all).
With the code below, assume that a JSON feed has been searched for using a keyword into srchResponse.results. The results with URLs in them get added to sList, a string List type, which contains only the URLs, not the message as a whole.
I want to put one of each URL (no duplicates), a count integer (to string) for the number of occurrences of a URL, and the username, message, and user image URL all into my jagged array called 'urls[100][]'. I have made the array 100 rows long to make sure everything can fit but generally, this is too big. Each 'row' will have 5 elements in them.
The debugger gets stuck on the line: if (sList[i] == sList[i + 1]) which is the crux of my idea, so clearly the logic is not working. Any suggestions or anything will be seriously appreciated!
Here is sample code:
var sList = new ArrayList();
string[][] urls = new string[100][];
int ctr = 0;
int j = 1;
foreach (Result res in srchResponse.results)
{
string content = res.text;
string pattern = #"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
MatchCollection matches = Regex.Matches(content, pattern);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
sList.Add(groups[0].Value.ToString());
}
}
sList.Sort();
foreach (Result res in srchResponse.results)
{
for (int i = 0; i < 100; i++)
{
if (sList[i] == sList[i + 1])
{
j++;
}
else
{
urls[ctr][0] = sList[i].ToString();
urls[ctr][1] = j.ToString();
urls[ctr][2] = res.text;
urls[ctr][3] = res.from_user;
urls[ctr][4] = res.profile_image_url;
ctr++;
j = 1;
}
}
}
The code then goes on to add each result into a StringBuilder method with the HTML.
Is now edite
The description of your algorithm seems fine. I don't know what's wrong with the implementation; I haven't read it that carefully. (The fact that you are using an ArrayList is an immediate red flag; why aren't you using a more strongly typed generic collection?)
However, I have a suggestion. This is exactly the sort of problem that LINQ was intended to solve. Instead of writing all that error-prone code yourself, just describe the transformation you're interested in, and let the compiler work it out for you.
Suppose you have a list of strings and you wish to determine the number of occurrences of each:
var notes = new []{ "Do", "Fa", "La", "So", "Mi", "Do", "Re" };
var counts = from note in notes
group note by note into g
select new { Note = g.Key, Count = g.Count() }
foreach(var count in counts)
Console.WriteLine("Note {0} occurs {1} times.", count.Note, count.Count);
Which I hope you agree is much easier to read than all that array logic you wrote. And of course, now you have your sequence of unique items; you have a sequence of counts, and each count contains a unique Note.
I'd recommend using a more sophisticated data structure than an array. A Set will guarantee that you have no duplicates.
Looks like C# collections doesn't include a Set, but there are 3rd party implementations available, like this one.
Your loop fails because when i == 99, (i + 1) == 100 which is outside the bounds of your array.
But as other have pointed out, .Net 3.5 has ways of doing what you want more elegantly.
If you don't need to know how many duplicates a specific entry has you could do the following:
LINQ Extension Methods
.Count()
.Distinct()
.Count()