Select a random item from a weighted list - c#

I am trying to write a program to select a random name from the US Census last name list. The list format is
Name Weight Cumulative line
----- ----- ----- -
SMITH 1.006 1.006 1
JOHNSON 0.810 1.816 2
WILLIAMS 0.699 2.515 3
JONES 0.621 3.136 4
BROWN 0.621 3.757 5
DAVIS 0.480 4.237 6
Assuming I load the data in to a structure like
Class Name
{
public string Name {get; set;}
public decimal Weight {get; set;}
public decimal Cumulative {get; set;}
}
What data structure would be best to hold the list of names, and what would be the best way to select a random name from the list but have the distribution of names be the same as the real world.
I will only be working with the first 10,000 rows if it makes a difference in the data structure.
I have tried looking at some of the other questions about weighted randomness but I am having a bit of trouble turning theory in to code. I do not know much about math theory so I do not know if this is a "With or without replacement" random selection, I want the same name able to show up more than once, which ever that one means.

The "easiest" way to handle this would be to keep this in a list.
You could then just use:
Name GetRandomName(Random random, List<Name> names)
{
double value = random.NextDouble() * names[names.Count-1].Culmitive;
return names.Last(name => name.Culmitive <= value);
}
If speed is a concern, you could store a separate array of just the Culmitive values. With this, you could use Array.BinarySearch to quickly find the appropriate index:
Name GetRandomName(Random random, List<Name> names, double[] culmitiveValues)
{
double value = random.NextDouble() * names[names.Count-1].Culmitive;
int index = Array.BinarySearch(culmitiveValues, value);
if (index >= 0)
index = ~index;
return names[index];
}
Another option, which is probably the most efficient, would be to use something like one of the C5 Generic Collection Library's tree classes. You could then use RangeFrom to find the appropriate name. This has the advantage of not requiring a separate collection

I've created a C# library for randomly selected weighted items.
It implements both the tree-selection and walker alias method algorithms, to give the best performance for all use-cases.
It is unit-tested and optimized.
It has LINQ support.
It's free and open-source, licensed under the MIT license.
Some example code:
IWeightedRandomizer<string> randomizer = new DynamicWeightedRandomizer<string>();
randomizer["Joe"] = 1;
randomizer["Ryan"] = 2;
randomizer["Jason"] = 2;
string name1 = randomizer.RandomWithReplacement();
//name1 has a 20% chance of being "Joe", 40% of "Ryan", 40% of "Jason"
string name2 = randomizer.RandomWithRemoval();
//Same as above, except whichever one was chosen has been removed from the list.

I'd say an array (vectors if you prefer) would be best to hold them. As for the weighted average, find the sum, pick a random number between zero and the sum, and pick the last name whose cumulative value is less. (e.g. here, <1.006 = smith, 1.006-1.816 = johnson, etc.
P.S. it's Cumulative.

Just for fun, and in no way optimal
List<Name> Names = //Load your structure into this
List<String> NameBank = new List<String>();
foreach(Name name in Names)
for(int i = 0; i <= (int)(name.Weight*1000); i++)
NameBank.Add(name.Name)
then:
String output = NameBank[rand(NameBank.Count)];

Related

Random string collision after using Fisher-Yates algorithm (C#)

I am doing an exercise from exercism.io, in which I have to generate random names for robots. I am able to get through a bulk of the tests until I hit this test:
[Fact]
public void Robot_names_are_unique()
{
var names = new HashSet<string>();
for (int i = 0; i < 10_000; i++) {
var robot = new Robot();
Assert.True(names.Add(robot.Name));
}
}
After some googling around, I stumbled upon a couple of solutions and found out about the Fisher-Yates algorithm. I tried to implement it into my own solution but unfortunately, I haven't been able to pass the final test, and I'm stumped. If anyone could point me in the right direction with this, I'd greatly appreciate it. My code is below:
EDIT: I forgot to mention that the format of the string has to follow this: #"^[A-Z]{2}\d{3}$"
public class Robot
{
string _name;
Random r = new Random();
string alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
string nums = "0123456789";
public Robot()
{
_name = letter() + num();
}
public string Name
{
get { return _name; }
}
private string letter() => GetString(2 ,alpha.ToCharArray(), r);
private string num() => GetString(3, nums.ToCharArray(), r);
public void Reset() => _name = letter() + num();
public string GetString(int length,char[] chars, Random rnd)
{
Shuffle(chars, rnd);
return new string(chars, 0, length);
}
public void Shuffle(char[] _alpha, Random r)
{
for(int i = _alpha.Length - 1; i > 1; i--)
{
int j = r.Next(i);
char temp = _alpha[i];
_alpha[i] = _alpha[j];
_alpha[j] = temp;
}
}
}
The first rule of any ID is:
It does not mater how big it is, how many possible value it has - if you just create enough of them, you will get a colission eventually.
To Quote Trillian from the Hithchikers Guide: "[A colission] is not impossible. Just realy, really unlikely."
However in this case, I think it is you creating Random Instances in a Loop. This is a classical beginners mistake when workign with Random. You should not create a new random isntance for each Robot Instance, you should have one for the application that you re-use. Like all Pseudorandom Number Generators, Random is deterministic. Same inputs - same outputs.
As you did not specify a seed value, it will use the time in milliseconds. Wich is going to the same between the first 20+ loop itterations at last. So it is going to have the same seed and the same inputs, so the same outputs.
The easiest solution for unique names is to use GUIDs. In theory, it is possible to generate non-unique GUIDs but it is pretty close to zero.
Here is the sample code:
var newUniqueName = Guid.NewGuid().ToString();
Sure GUIDs do not look pretty but they are really easy to use.
EDIT: Since the I missed the additional requirement for the format I see that GUID format is not acceptable.
Here is an easy way to do that too. Since format is two letters (26^2 possibile values) and 3 digits (10^3 possible values) the final number of possible values is 26^2 * 10^3 = 676 * 1000 = 676000. This number is quite small so Random can be used to generate the random integer in the range 0-675999 and then that number can be converted to the name. Here is the sample code:
var random = new System.Random();
var value = random.Next(676000);
var name = ((char)('A' + (value % 26))).ToString();
value /= 26;
name += (char)('A' + (value % 26));
value /= 26;
name += (char)('0' + (value % 10));
value /= 10;
name += (char)('0' + (value % 10));
value /= 10;
name += (char)('0' + (value % 10));
The usual disclaimer about possible identical names applies here too since we have 676000 possible variants and 10000 required names.
EDIT2: Tried the code above and generating 10000 names using random numbers produced between 9915 and 9950 unique names. That is no good. I would use a simple static in class member as a counter instead of random number generator.
First, let's review the test you're code is failing against:
10.000 instances created
Must all have distinct names
So somehow, when creating 10000 "random" names, your code produces at least two names that are the same.
Now, let's have a look at the naming scheme you're using:
AB123
The maximum number of unique names we could possibly create is 468000 (26 * 25 * 10 * 9 * 8).
This seems like it should not be a problem, because 10000 < 468000 - but this is where the birthday paradox comes in!
From wikipedia:
In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday.
Rewritten for the purposes of your problem, we end up asking:
What's the probability that, in a set of 10000 randomly chosen people, some pair of them will have the same name.
The wikipedia article also lists a function for approximating the number of people required to reach a 50% propbability that two people will have the same name:
where m is the total number of possible distinct values. Applying this with m=468000 gives us ~806 - meaning that after creating only 806 randomly named Robots, there's already a 50% chance of two of them having the same name.
By the time you reach Robot #10000, the chances of not having generated two names that are the same is basically 0.
As others have noted, you can solve this by using a Guid as the robot name instead.
If you want to retain the naming convention you might also get around this by implementing an LCG with an appropriate period and use that as a less collision-prone "naming generator".
Here's one way you can do it:
Generate the list of all possible names
For each robot, select a name from the list at random
Remove the selected name from the list so it can't be selected again
With this, you don't even need to shuffle. Something like this (note, I stole Optional Option's method of generating names because it's quite clever and I couldn't be bother thinking of my own):
public class Robot
{
private static List<string> names;
private static Random rnd = new Random();
public string Name { get; private set; }
static Robot()
{
Console.WriteLine("Initializing");
// Generate possible candidates
names = Enumerable.Range(0, 675999).Select(i =>
{
var sb = new StringBuilder(5);
sb.Append((char)('A' + i % 26));
i /= 26;
sb.Append((char)('A' + i % 26));
i /= 26;
sb.Append(i % 10);
i /= 10;
sb.Append(i % 10);
i /= 10;
sb.Append(i % 10);
return sb.ToString();
}).ToList();
}
public Robot()
{
// Note: if this needs to be multithreaded, then you'd need to do some work here
// to avoid two threads trying to take a name at the same time
// Also note: you should probably check that names.Count > 0
// and throw an error if not
var i = rnd.Next(0, names.Count - 1);
Name = names[i];
names.RemoveAt(i);
}
}
Here's a fiddle that generates 20 random names. They can only be unique because they are removed once they are used.
The point about multitheading is very important however. If you needed to be able to generate robots in parallel, then you'd need to add some code (e.g. locking the critical section of code) to ensure that only one name is being picked and removed from the list of candidates at a time or else things will get really bad, really quickly. This is why, when people need a random id with a reasonable expectation that it'll be unique, without worrying that some other thread(s) are trying the same thing at the same time, they use GUIDs. The sheer number of possible GUIDs makes collisions very unlikely. But you don't have that luxury with only 676,000 possible values

combining two different information in binary code

I have Dictionary<string,T> where string represents the key of record, and I have two other pieces of information about the record that I need to maintain for each record in the dictionary, which are the category of the record and its redundancy (how many times its repeated).
For example: the record XYZ1 is of category 1, and its repeated 1 times. therefore the implementation has to be something like this:
"XYZ1", {1,1}
Now moving on, I may encounter the same record in my dataset, therefore the value of the key has to be updated like:
"XYZ1", {1,2}
"XYZ1", {1,3}
...
Since I am processing big number of records such as 100K, I tried this approach but it seems inefficient because the extra effort of fetching the value from dictionary and then slicing {1,1} and then converting both slices into integer puts lot of overhead on the execution.
I was thinking of using binary digits to represent both category and repatation and maybe bitmask to fetch these pieces.
Edit: I tried to use object with 2 properties, and then Tuple<int,int>. Complexity got worse !
My question: is it possible to do so ?
if not (in terms of complexity) any suggestions?
What is your type T? You could define a custom type which holds the information you need (category and occurences) .
class MyInfo {
public int c { get; set; }
public int o { get; set; }
}
Dictionary<String, MyInfo> data;
Then when traversing your data you can easily check whether some key is already present. If yes, just increment the occurences, else insert a new element.
MyInfo d;
foreach (var e in elements) {
if (!data.TryGet(e.key, out d))
data.Add(e.key, new MyInfo { c = e.cat, o= 1});
else
d.o++;
}
EDIT
You could also combine the category and the number of occurences into one UInt64. For instance take the category in the higher 32 bit (ie you can have 4 billion categories) and the number of occurenes in the lower 32 bit (ie each key can occur 4 billion times)
Dictionary<string, UInt64> data;
UInt64 d;
foreach (var e in elements) {
if (!data.TryGet(e.key, out d))
data[e.key] = (e.cat << 32) + 1;
else
data[e.key] = d + 1;
}
And if you want to get the number of occurrences for one specific key you can just inspect the respective part of the value.
var d = data["somekey"];
var occurrences = d & 0xFFFFFFFF;
var category = d >> 32;
It seems like category never changes. So rather than using a simple string for the key of your dictionary, I would instead do something like:
Dictionary<Tuple<string,int>,int> where the key of the dictionary is a Tuple<string,int> where the string is the record and the int is the category. Then the value in the dictionary is just a count.
A dictionary is probably going to be the fastest data structure for what you're trying to accomplish as it has near constant time O(1) lookup and entry.
You can speed it up a little bit by using the Tuple, as now the category is part of the key and no longer a bit of information you have to access separately.
At the same time you could also keep the string as the key and store a Tuple<int,int> as the value and simply set Item1 as the category and Item2 as the count.
Either way is going to be roughly equivalent in speed. Processing 100k records in such a manner should be pretty fast either way.

Check if int is 10, 100, 1000,

I have a part in my application which needs to do do something (=> add padding 0 in front of other numbers) when a specified number gets an additional digit, meaning it gets 10, 100, 1000 or so on...
At the moment I use the following logic for that:
public static bool IsNewDigit(this int number)
{
var numberString = number.ToString();
return numberString.StartsWith("1")
&& numberString.Substring(1).All(c => c == '0');
}
The I can do:
if (number.IsNewDigit()) { /* add padding 0 to other numbers */ }
This seems like a "hack" to me using the string conversion.
Is there something something better (maybe even "built-in") to do this?
UPDATE:
One example where I need this:
I have an item with the following (simplified) structure:
public class Item
{
public int Id { get; set; }
public int ParentId { get; set; }
public int Position { get; set; }
public string HierarchicPosition { get; set; }
}
HierarchicPosition is the own position (with the padding) and the parents HierarchicPositon. E.g. an item, which is the 3rd child of 12 from an item at position 2 has 2.03 as its HierarchicPosition. This can as well be something more complicated like 011.7.003.2.02.
This value is then used for sorting the items very easily in a "tree-view" like structure.
Now I have an IQueryable<Item> and want to add one item as the last child of another item. To avoid needing to recreate all HierarchicalPosition I would like to detect (with the logic in question) if the new position adds a new digit:
Item newItem = GetNewItem();
IQueryable<Item> items = db.Items;
var maxPosition = items.Where(i => i.ParentId == newItem.ParentId)
.Max(i => i.Position);
newItem.Position = maxPosition + 1;
if (newItem.Position.IsNewDigit())
UpdateAllPositions(items.Where(i => i.ParentId == newItem.ParentId));
else
newItem.HierarchicPosition = GetHierarchicPosition(newItem);
UPDATE #2:
I query this position string from the DB like:
var items = db.Items.Where(...)
.OrderBy(i => i.HierarchicPosition)
.Skip(pageSize * pageNumber).Take(pageSize);
Because of this I can not use an IComperator (or something else wich sorts "via code").
This will return items with HierarchicPosition like (pageSize = 10):
03.04
03.05
04
04.01
04.01.01
04.01.02
04.02
04.02.01
04.03
05
UPDATE #3:
I like the alternative solution with the double values, but I have some "more complicated cases" like the following I am not shure I can solve with that:
I am building (on part of many) an image gallery, which has Categories and Images. There a category can have a parent and multiple children and each image belongs to a category (I called them Holder and Asstes in my logic - so each image has a holder and each category can have multiple assets). These images are sorted first be the categories position and then by its own position. This I do by combining the HierarchicPosition like HolderHierarchicPosition#ItemHierarchicPosition. So in a category which has 02.04 as its position and 120 images the 3rd image would get 02.04#003.
I have even some cases with "three levels" (or maybe more in the future) like 03.1#02#04.
Can I adapt the "double solution" to suport such scenarios?
P.S.: I am also open to other solution for my base problem.
You could check if base-10 logarithm of the number is an integer. (10 -> 1, 100 -> 2, 1000 -> 3, ...)
This could also simplify your algorithm a bit in general. Instead of adding one 0 of padding every time you find something bigger, simply keep track of the maximum number you see, then take length = floor(log10(number))+1 and make sure everything is padded to length. This part does not suffer from the floating point arithmetic issues like the comparison to integer does.
From What you describe, it looks like your HierarchicPosition position should maintain an order of items and you run into the problem, that when you have the ids 1..9 and add a 10, you'll get the order 1,10,2,3,4,5,6... somewhere and therefore want to pad-left to 01,02,03...,10 - correct?
If I'm right, please have a look at this first: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Because what you try to do is a workarround to solve the problem in a certain way. - But there might be more efficent ways to actually really solve it. (therefore you should have better asked about your actual problem rather than the solution you try to implement)
See here for a solution, using a custom IComparator to sort strings (that are actually numbers) in a native way: http://www.codeproject.com/Articles/11016/Numeric-String-Sort-in-C
Update regarding your update:
With providing a sorting "String" like you do, you could insert a element "somewhere" without having ALL subsequent items reindexed, as it would be for a integer value. (This seems to be the purpose)
Instead of building up a complex "String", you could use a Double-Value to achieve the very same result real quick:
If you insert an item somewhere between 2 existing items, all you have to do is : this.sortingValue = (prior.sortingValue + next.sortingValue) / 2 and handle the case when you are inserting at the end of the list.
Let's assume you add Elements in the following order:
1 First Element // pick a double value for sorting - 100.00 for example. -> 100.00
2 Next Element // this is the list end - lets just add another 100.00 -> 200.00
1.1 Child // this should go "in between": (100+200)/2 = 150.00
1.2 Another // between 1.1 and 2 : (150+200)/2 = 175
When you now simple sort depending on that double field, the order would be:
100.00 -> 1
150.00 -> 1.1
175.00 -> 1.2
200.00 -> 2
Wanna Add 1.1.1? Great: positon = (150.00 + 175.00)/2;;
you could simple multiply all by 10, whenever your NEW value hits x.5* to ensure you are not running out of decimal places (but you dont have to - having .5 .25 .125 ... does not hurt the sorting):
So, after adding the 1.1.1 which would be 162,5, multiply all by 10:
1000.00 -> 1
1500.00 -> 1.1
1625.00 -> 1.1.1
1750.00 -> 1.2
2000.00 -> 2
So, whenever you move an item arround, you only need to recalculate the position of n by looking at n-1 and n+1
Depending on the expected childs per entry, you could start with "1000.00", "10.000" or whatever matches best.
What I didn't take into account: When you want to move "2" to the top, you would need to recalculate all childs of "2" to have a value somewhere between the sorting value of "2" and the now "next" item... Could serve some headache :)
The solution with "double" values has some limitations, but will work for smaller sets of groups. However you are talking about "Groups, subgroups, and pictures with counts of 100" - so another solution would be preferable:
First, you should refactor your database: Currently you are trying to "squeeze" a Tree into a list (datatables are basically lists)
To really reflect the complex layout of a tree with an infinite depth, you should use 2 tables and implement the composite pattern.
Then you can use a recursive approach to get a category, its subcategory, [...] and finally the elements of that category.
With that, you only need to provide a position of each leaf within it's current node.
Rearanging leafs will not affect any leaf of another node or any node.
Rearanging nodes will not affect any subnode or leaf of that node.
You could check sum of square of all digits for the input, 10,100,1000 has something in common that, if you do the sum of square of all digits, it should converge to one;
10
1^2 + 0^2 = 1
100
1^2 + 0^2 + 0^2 = 1
so on so forth.

Pick up two numbers from an array so that the sum is a constant

I came across an algorithm problem. Suppose I receive a credit and would like to but two items from a local store. I would like to buy two items that add up to the entire value of the credit. The input data has three lines.
The first line is the credit, the second line is the total amount of the items and the third line lists all the item price.
Sample data 1:
200
7
150 24 79 50 88 345 3
Which means I have $200 to buy two items, there are 7 items. I should buy item 1 and item 4 as 200=150+50
Sample data 2:
8
8
2 1 9 4 4 56 90 3
Which indicates that I have $8 to pick two items from total 8 articles. The answer is item 4 and item 5 because 8=4+4
My thought is first to create the array of course, then pick up any item say item x. Creating another array say "remain" which removes x from the original array.
Subtract the price of x from the credit to get the remnant and check whether the "remain" contains remnant.
Here is my code in C#.
// Read lines from input file and create array price
foreach (string s in price)
{
int x = Int32.Parse(s);
string y = (credit - x).ToString();
index1 = Array.IndexOf(price, s) ;
index2 = Array.IndexOf(price, y) ;
remain = price.ToList();
remain.RemoveAt(index1);//remove an element
if (remain.Contains(y))
{
break;
}
}
// return something....
My two questions:
How is the complexity? I think it is O(n2).
Any improvement to the algorithm? When I use sample 2, I have trouble to get correct indices. Because there two "4" in the array, it always returns the first index since IndexOf(String) reports the zero-based index of the first occurrence of the specified string in this instance.
You can simply sort the array in O(nlogn) time. Then for each element A[i] conduct a binary search for S-A[i] again in O(nlogn) time.
EDIT: As pointed out by Heuster, you can solve the 2-SUM problem on the sorted array in linear time by using two pointers (one from the beginning and other from the end).
Create a HashSet<int> of the prices. Then go through it sequentially.Something like:
HashSet<int> items = new HashSet<int>(itemsList);
int price1 = -1;
int price2 = -1;
foreach (int price in items)
{
int otherPrice = 200 - price;
if (items.Contains(otherPrice))
{
// found a match.
price1 = price;
price2 = otherPrice;
break;
}
}
if (price2 != -1)
{
// found a match.
// price1 and price2 contain the values that add up to your target.
// now remove the items from the HashSet
items.Remove(price1);
items.Remove(price2);
}
This is O(n) to create the HashSet. Because lookups in the HashSet are O(1), the foreach loop is O(n).
This problem is called 2-sum. See., for example, http://coderevisited.com/2-sum-problem/
Here is an algorithm in O(N) time complexity and O(N) space : -
1. Put all numbers in hash table.
2. for each number Arr[i] find Sum - Arr[i] in hash table in O(1)
3. If found then (Arr[i],Sum-Arr[i]) are your pair that add up to Sum
Note:- Only failing case can be when Arr[i] = Sum/2 then you can get false positive but you can always check if there are two Sum/2 in the array in O(N)
I know I am posting this is a year and a half later, but I just happened to come across this problem and wanted to add input.
If there exists a solution, then you know that both values in the solution must both be less than the target sum.
Perform a binary search in the array of values, searching for the target sum (which may or may not be there).
The binary search will end with either finding the sum, or the closest value less than sum. That is your starting high value while searching through the array using the previously mentioned solutions. Any value above your new starting high value cannot be in the solution, as it is more than the target value.
At this point, you have eliminated a chunk of data in log(n) time, that would otherwise be eliminated in O(n) time.
Again, this is an optimization that may only be worth implementing if the data set calls for it.

Comparing 2 huge lists using C# multiple times (with a twist)

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.
I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:
First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.
I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:
Prefix Code 1: 1234567 <-- this code is 7 digits long.
I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.
Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".
Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.
Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.
Thanks for your help.
Update:
Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.
I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.
UPDATE
You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.
const Int32 minimumPrefixLength = 3;
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var numberPrefixes = numbers
.Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
.First(n.StartsWith))
.ToList();
So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?
Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.
MPL Runtime
-----------------
1 10.198 ms
2 1.179 ms
3 205 ms
4 130 ms
5 107 ms
Just to give an rough idea - I did just one run and have a lot of other stuff going on.
Original answer
I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.
I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.
List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);
Then I used the following LINQ to Object query to find the prefix of each number.
var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();
Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.
Test application
Note that generating many prefixes is really slow and might take a minute or two.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace Test
{
static class Program
{
static void Main()
{
// Set number of prefixes and calls to not more than 50 to get results
// printed to the console.
Console.Write("Generating prefixes");
List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
Console.WriteLine();
Console.Write("Generating calls");
List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
Console.WriteLine();
Console.WriteLine("Processing started.");
Stopwatch stopwatch = new Stopwatch();
const Int32 minimumPrefixLength = 5;
stopwatch.Start();
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var result = calls
.GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
.First(c.Number.StartsWith))
.Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
.ToList();
stopwatch.Stop();
Console.WriteLine("Processing finished.");
Console.WriteLine(stopwatch.Elapsed);
if ((prefixes.Count <= 50) && (calls.Count <= 50))
{
Console.WriteLine("Prefixes");
foreach (String prefix in prefixes.OrderBy(p => p))
{
Console.WriteLine(String.Format(" prefix={0}", prefix));
}
Console.WriteLine("Calls");
foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
{
Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration));
}
Console.WriteLine("Result");
foreach (Call call in result.OrderBy(c => c.Number))
{
Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration));
}
}
Console.ReadLine();
}
private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<String> prefixes = new List<String>(count);
StringBuilder stringBuilder = new StringBuilder(maximumLength);
while (prefixes.Count < count)
{
stringBuilder.Length = 0;
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
String prefix = stringBuilder.ToString();
if (prefixes.Count % 1000 == 0)
{
Console.Write(".");
}
if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
{
prefixes.Add(stringBuilder.ToString());
}
}
return prefixes;
}
private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<Call> calls = new List<Call>(count);
StringBuilder stringBuilder = new StringBuilder();
while (calls.Count < count)
{
stringBuilder.Length = 0;
stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
if (calls.Count % 1000 == 0)
{
Console.Write(".");
}
calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
}
return calls;
}
private class Call
{
public Call (String number, Decimal duration)
{
this.Number = number;
this.Duration = duration;
}
public String Number { get; private set; }
public Decimal Duration { get; private set; }
}
}
}
It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.
Then create a dictionary from carrier to an int or long (the total).
Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.
The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.
Now, to associate a call with a carrier:
foreach (Carrier carrier in carriers)
{
bool found = false;
for (int length = 1; length <= 7; length++)
{
int prefix = ExtractDigits(callNumber, length);
if (carrier.Prefixes.Contains(prefix))
{
carrier.Calls.Add(callNumber);
found = true;
break;
}
}
if (found)
break;
}
If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.
You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.
How about dumping your data into a couple of database tables and then query them using SQL? Easy!
CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )
CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )
-- now populate the tables, create indexes etc
-- and then just run your query...
SELECT p.prefix,
SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
INNER JOIN dbo.prefixes AS p
ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix
(This was written for SQL Server, but should be very simple to translate for any other DBMS.)
Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.
You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).
Then your report would be a sum query on the table.
I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.
You may want to use a HashTable in C#.
This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.
You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.

Categories

Resources