Best Way to Check for Used Key with Nhibernate?

Best Way to Check for Used Key with Nhibernate? - c#

on my site I allow people to buy subscriptions to my site in bulk(I call them vouchers). Once they have these vouchers, they give them to whoever and they enter that code into their account to upgrade them.
Right now I am thinking of doing 4 alphanumeric code(upper case, lower case and digits) and will have something like this
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[4];
var random = new Random();
for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}
var finalString = new String(stringChars);
For now I think that will give me more than enough combinations and if I ever do run out I can always up the length of the code. I want to keep it short because I don't want the user to have to type in huge as numbers.
I also don't have the time to make a more elegant solution maybe were they click a link or something in their email and it activates their account and of course this would cut down on someone trying to randomly guess a voucher number.
These are things I would deal with if the site every becomes more popular.
I am wondering though how can I handle the possible duplicate generation of the same voucher. My first thought was to check the database each time a voucher is created and if it exists then make a new one.
However that seems like it could be slow. So I thought also maybe getting all the keys first and store them in memory and they check there but if the list keeps growing I might run into out of memory exceptions and all that great stuff.
So does anyone have any ideas? Or am I stuck doing one of the 2 method I listed above?
I am using nhibernate, asp.net mvc and C#.
Edit
static void Main(string[] args)
{
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
HashAlgorithm sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
hold.Add(hex.Substring(0,3));
Console.WriteLine(hex.Substring(0, 4));
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
above is my attempt to try to use hashing. However I think I am missing something as it seems to have quite a bit more duplicates then expected.
Edit 2
I think I added what I was missing but not sure if this is exactly what he meant. I am also not sure what to do in a situation when I moved it as far as I can move it(my has seems to give me a length of 40 places I can move it).
static void Main(string[] args)
{
int subStringLength = 4;
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
SHA1CryptoServiceProvider sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
int startingPositon = 0;
string possibleVoucherCode = hex.Substring(startingPositon,subStringLength);
string voucherCode = Move(subStringLength, hold, hex, startingPositon, possibleVoucherCode);
hold.Add(voucherCode);
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
private static string Move(int subStringLength, List<string> hold, string hex, int startingPositon, string possibleVoucherCode)
{
if (hold.Contains(possibleVoucherCode))
{
int newPosition = startingPositon + 1;
if (newPosition <= hex.Length)
{
if ((newPosition + subStringLength) > hex.Length)
{
possibleVoucherCode = hex.Substring(newPosition, subStringLength);
return Move(subStringLength, hold, hex, newPosition, possibleVoucherCode);
}
// return something
return "0";
}
else
{
// return something
return "0";
}
}
else
{
return possibleVoucherCode;
}
}
}

It is going to be slow because you want to generate the vouchers randomly and then check the database for every generated code.
I would create a table vouchers with an id, the code and an is_used column. I would fill that table once with enough random codes. Since this can be done in a separate process, the performance won't be such a big problem. Let it run in the evening and the next day you get a fully filled vouchers-table.
If you want to prevent generating duplicate vouchers, that won't be a problem. You can generate them anyway and put them either in a System.Collections.Generic.HashSet (which prevents adding duplicates without throwing an exception) or call the Linq-method Distinct(), before adding them to that vouchers table.

If you insist on short codes:
Use a GUID as a primary key, generate one random number. How you might want to translate this in to alpha-num is up to you.
Use the last byte or two of the guid and the random number. 1234-684687 This should make it slightly less easy to bruteforce coupons. And handle any (rare) collisions with an exception.
Easy way to shorten an int, change it's base (from 10 to 62). (in VB, and this is old code)
This yields "2lkCB1" when given Int32.MaxValue
''//given intValue as your random integer
Dim result As String = String.Empty
Dim digits as String = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
Dim x As Integer
While (intValue > 0)
x = intValue Mod digits.Length
result = digits(x) & result
intValue = intValue - x
intValue = intValue \ digits.Length
End While
Return result
But now we're already answering more than one question.

For a bulk data operation like this, I would recommend not using NHibernate and just doing straight ADO.NET.
Batch Check
Since you anticipate generating big batches of codes at once, you should batch multiple code checks into a single round-trip to the database. If you're using SQL Server 2008 or higher, you could do this using table-valued parameters, checking a whole list of codes at once.
SELECT DISTINCT b.Code
FROM #batch b
WHERE NOT EXISTS (
SELECT v.Code
FROM dbo.Voucher v
WHERE v.Code = b.Code
);
Concurrency
Now, what about concurrency issues? What if two users generate the same code at roughly the same time? Or simply in-between the time when we check the code for uniqueness and when we insert it into the Voucher table?
We can take care of that by modifying the query as follows:
DECLARE #batchid uniqueidentifier;
SET #batchid = NEWID();
INSERT INTO dbo.Voucher (Code, BatchId)
SELECT DISTINCT b.Code, #batchid
FROM #batch b
WHERE NOT EXISTS (
SELECT Code
FROM dbo.Voucher v
WHERE b.Code = v.Code
);
SELECT Code
FROM dbo.Voucher
WHERE BatchId = #batchid;
Executing via .NET
Assuming that you have defined the following table-valued user type...
CREATE TYPE dbo.VoucherCodeList AS TABLE (
Code nvarchar(8) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL
/* !!! Remember to specify the collation on your Voucher.Code column too, since you want upper and lower-case codes. */
);
... you could execute this query via .NET code like this:
public ICollection<string> GenerateCodes(int numberOfCodes)
{
var result = new List<string>(numberOfCodes);
while (result.Count < numberOfCodes)
{
var batchSize = Math.Min(_batchSize, numberOfCodes - result.Count);
var batch = Enumerable.Range(0, batchSize)
.Select(x => GenerateRandomCode());
var oldResultCount = result.Count;
result.AddRange(FilterAndSecureBatch(batch));
var filteredBatchSize = result.Count - oldResultCount;
var collisionRatio = ((double)batchSize - filteredBatchSize) / batchSize;
// Automatically increment length of random codes if collisions begin happening too frequently
if (collisionRatio > _collisionThreshold)
CodeLength++;
}
return result;
}
private IEnumerable<string> FilterAndSecureBatch(IEnumerable<string> batch)
{
using (var command = _connection.CreateCommand())
{
command.CommandText = _sqlQuery; // the concurrency-safe query listed above
var metaData = new[] { new SqlMetaData("Code", SqlDbType.NVarChar, 8) };
var param = command.Parameters.Add("#batch", SqlDbType.Structured);
param.TypeName = "dbo.VoucherCodeList";
param.Value = batch.Select(x =>
{
var record = new SqlDataRecord(metaData);
record.SetString(0, x);
return record;
});
using (var reader = command.ExecuteReader())
while (reader.Read())
yield return reader.GetString(0);
}
}
Performance
After implementing all of this (and moving the command and parameter creation out of the loop so it would be re-used between batches), I was able to insert 10,000 codes using a batch size of 500 consistently in approx. 0.5 to 2 seconds, or 5 to 20 codes per millisecond.
Code Density / Collisions / Guessability
The _collisionThreshold field limits the density of your codes. It's a value between 0 and 1. Actually, it must be less than 1 or else you would wind up in an infinite loop when the 4 digit codes were exhausted (probably should add an assertion for this in code). I would recommend never turning it above 0.5 for performance reasons. More than 50% collisions would mean it's spending more time testing already-used codes than actually generating new ones.
Keeping the collision threshold low is how you would control how hard-to-guess your codes are. Setting _collisionThreshold to 0.01 would generate codes such that there's approximately a 1% chance of someone guessing a code.
If collisions occur too frequently, CodeLength (which is used by the GenerateRandomCode() method) will be incremented. This value needs to be persisted somewhere. After executing GenerateCodes(), check CodeLength to see if it has changed and then save the new value.
Source Code
The full code is available here: https://gist.github.com/3217856. I am the author of this code, and am releasing it under the MIT license. I had fun with this little challenge, and also got to learn how to pass a table-valued parameter to an inline parametrized query. I hadn't ever done that before. I've only ever passed them to full-fledged stored procedures.

A possible solution for you is like this:
Find the maximum ID of a voucher (an integer). Then, run any hash function on it, take the first 32 bits and convert to the string you want to show the user (or use a 32bit hash function such as Jenkins hash function). This will probably work, hash collisions are pretty rare. But this solution is very similar to yours, in the point of randomness.
You could run a test which finds the first 10 or 100 collisions (this should be enough for you) and forces the algorithm to "skip" them and use a different starting value. Then, you don't need to check the database at all (well, at least until you reach about 4294967296 vouchers...)

how about utilizing nHibernate's HiLo algorithm?
Here is an example on how you can get the next value (without DB access).

Related

How to correctly store C#'s Bitarray in Postgres table

Currently I am trying to store bit patterns of variable length in a Postgres table. My use-case is that I want to encode some information about data records being assigned to groups in a compact way.
A simplified schema of my table in Postgres looks like this:
CREATE TABLE axivas.group_records (
id int4 NOT NULL GENERATED ALWAYS AS IDENTITY,
record_id int4 NOT NULL,
group_ids varbit(50) NOT NULL,
CONSTRAINT group_records_pkey PRIMARY KEY (id));
In a C# application, I create enties using Npgsql Entity FrameworkCore like this:
try
{
var context = new xerxesdevtestsContext();
Random rnd = new Random();
for (int i = 0; i < 1024; i++)
{
BitArray ba = new BitArray(rnd.Next(10, 50));
ba.SetAll(false);
for (int j=rnd.Next(0,5);j<rnd.Next(5,ba.Length-1);j++)
{
ba[j] = true;
}
context.GroupRecords.Add(new GroupRecords()
{
GroupIds = ba,
RecordId = i
});
}
context.SaveChanges();
}
catch (Exception ex)
{
Console.WriteLine("Error: " + ex.Message);
}
Randomnes was added on porpose both for number of set bits and length of the bit array.
When I look at the stored data, I can see that in some cases bit strings are stored with leading zeros, in other cases, leading zeros are omitted (I attached a screenshot to show this). In otherwords, in some cases DBeaver shows values like '0000110110111110000.....', in other cases values like '1100111...'.
So my question is, how this can be explained and if there's away to completely omit leading zeros in the bit strings?
Any idea or extra information would be appreciated.
Update:
I changed the size of the bit string in my table and
the algorithm which sets individual bits in the bit array a bit and tested again. My changes have the following effects:
the bit array will be filled starting from the highest index.
in the results event more leading zeros can be observed; e.g. this is one of the resulting records:
|3104 |702 |0000000000000000000000000000000000000000000000000000000000000000000000111111111111111111110 |
I think this situation clearly demonstrates, why I want leading zeros to be omitted.
Best regards,
Michael

It seems pretty obvious that the reason your bit arrays have leading zeros is because in your for loop, j is being initialized to some value that is rarely zero. j would always have to be zero in order to start placing 1's at the beginning of the array. Otherwise, you are in most cases going to end up with leading zeros.
So, if you want random binary numbers of a variable length, why not do something like this:
BitArray ba = new BitArray(rnd.Next(10, 50));
ba.SetAll(false);
ba[0] = true;
for (int j = 1; j < ba.Length - 1; j++)
{
ba[j] = Convert.ToBoolean(rnd.Next(2));
}
Example values created:
100110101001000011111100110100010101100110110
111101100110001101001100111101001100011110
1010101111000100
1011001010001000010100
101001101001010100101110000001000111001010

How to assert uniqueness of a huge collection of strings?

Let's say I have an algorithm which takes an unsigned 64-bit integer as input, and yields a string as a result. The string's alphabet is limited to [a-z, A-Z, 0-9] and its' maximum length is 16. So that's or 47,672,401,706,823,533,450,263,330,816 possible results.
I would like to assert the uniqueness of the algorithm's output. Read: I want to verify there are no collisions.
Is there an easy/quick 'n dirty way to do this, without having to fall back to (e.g.) some kind of database?
[EDIT]
Some clarification: the concerns uttered in the comments are legit, but no worries, I wasn't really planning on iterating over all possible combinations, my lifespan will probably be sub-1 century ;) Nor did I write my own algorithm to generate unique ID's. I just saw this and started wondering how one would go about asserting uniqueness for algorithms with very large result sets that can't be handled in-memory
[/EDIT]

As said in the comments, It would take a very long time to compute every possible entries, but just for fun, here is a try:
var workspace = new DirectoryInfo("MyWorkspace");
if (workspace.Exists)
{
workspace.Delete();
}
workspace.Create();
var limit = 23997907;
var buffer = new HashSet<string>();
ulong i = 0;
int j = 0;
var stopWatch = Stopwatch.StartNew();
while (i <= ulong.MaxValue)
{
var result = YourSuperAlgorythm(i);
// Check the result with current results
if (buffer.Contains(result))
{
throw new Exception("Failure !");
}
// Check the result with older results
foreach (var file in workspace.GetFiles())
{
var content = new HashSet<string>(File.ReadAllText(file.FullName).Split(';'));
if (content.Contains(result))
{
throw new Exception("Failure !");
}
}
buffer[j] = result;
i++;
j++;
if (j == arrayLimit)
{
stopWatch.Stop();
Console.WriteLine("Resetting. This loop takes " + stopWatch.Elapsed.TotalMilliseconds + "ms");
j = 0;
var file = Path.GetRandomFileName();
File.WriteAllText(Path.Combine(workspace.FullName, file), String.Join(";", buffer));
buffer = new HashSet<string>();
stopWatch.Restart();
}
}
You could probably optimize it but you won't have enought of a lifetime to check the results. For now, it did not even create a file to store the first set of entries :D. I will edit this post when one loop will be done!
Your only option is to prove mathematically your algorithm. Good luck with that...
EDIT1: for my test, I use this function:
private static string YourSuperAlgorythm(ulong i)
{
return i.ToString("x");
}
EDIT2: One loop takes 1477221.4261ms (~25min). And then the String.Join(";", buffer) line failed (OutOfMemory). So 23997907 is not the max value for my try. It must be decreased!

regex performance degrades

I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.
For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?
It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.
Some of my regexes need to vary e.g. depending on the user's name it might be
mike said (\w*) or john said (\w*)
My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).
Does anyone have any suggestions?
[Edited to accurately reflect speed - was per 1000 strings, not per string]

This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:
Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:
Player_Name | Monetary_Value
If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).
Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.
In other words - consider the following:
Create a database table as follows:
CREATE TABLE PlayerActions (
RowID INT PRIMARY KEY IDENTITY,
Player_Name VARCHAR(50) NOT NULL,
Monetary_Value MONEY NOT NULL
)
Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.
public void ParseFullDataSet(IEnumerable<String> dataSource) {
var rowCount = dataSource.Count();
var setCount = Math.Floor(rowCount / 100000) + 1;
if (rowCount % 100000 != 0)
setCount++;
for (int i = 0; i < setCount; i++) {
var set = dataSource.Skip(i * 100000).Take(100000);
ParseSet(set);
}
}
public void ParseSet(IEnumerable<String> dataSource) {
String playerName = String.Empty;
decimal monetaryValue = 0.0m;
// Assume here that the method reflects your RegEx generator.
String regex = RegexFactory.Generate();
for (String data in dataSource) {
Match match = Regex.Match(data, regex);
if (match.Success) {
playerName = match.Groups[1].Value;
// Might want to add error handling here.
monetaryValue = Convert.ToDecimal(match.Groups[2].Value);
db.PlayerActions.Add(new PlayerAction() {
// ID = ..., // Set at DB layer using Auto_Increment
Player_Name = playerName,
Monetary_Value = monetaryValue
});
db.SaveChanges();
// If not using Entity Framework, use another method to insert
// a row to your database table.
}
}
}
Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:
ParseSet(new List<String>() { newValue });
or if multiples are created at once, call:
ParseSet(newValues); // Where newValues is an IEnumerable<String>
Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.

Regex does takes time to compute. However, U can make it compact using some tricks.
You can also use string functions in C# to avoid regex function.
The code would be lengthy but might improve performance.
String has several functions to cut and extract characters and do pattern matching as u need.
like eg: IndeOfAny, LastIndexOf, Contains....
string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};
if(str2.IndexOfAny(str) >= 0)
{
//success code//
}

how to generate a voucher code in c#?

I need to generate a voucher code[ 5 to 10 digit] for one time use only. what is the best way to generate and check if been used?
edited: I would prefer alpha-numeric characters - amazon like gift voucher codes that must be unique.

When generating voucher codes - you should consider whether having a sequence which is predictable is really what you want.
For example, Voucher Codes: ABC101, ABC102, ABC103 etc are fairly predictable. A user could quite easily guess voucher codes.
To protect against this - you need some way of preventing random guesses from working.
Two approaches:
Embed a checksum in your voucher codes.
The last number on a credit card is a checksum (Check digit) - when you add up the other numbers in a certain way, lets you ensure someone has entered a number correctly. See: http://www.beachnet.com/~hstiles/cardtype.html (first link out of google) for how this is done for credit cards.
Have a large key-space, that is only sparsely populated.
For example, if you want to generate 1,000 vouchers - then a key-space of 1,000,000 means you should be able to use random-generation (with duplicate and sequential checking) to ensure it's difficult to guess another voucher code.
Here's a sample app using the large key-space approach:
static Random random = new Random();
static void Main(string[] args)
{
int vouchersToGenerate = 10;
int lengthOfVoucher = 10;
List<string> generatedVouchers = new List<string>();
char[] keys = "ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890".ToCharArray();
Console.WriteLine("Vouchers: ");
while(generatedVouchers.Count < vouchersToGenerate)
{
var voucher = GenerateVoucher(keys, lengthOfVoucher);
if (!generatedVouchers.Contains(voucher))
{
generatedVouchers.Add(voucher);
Console.WriteLine("\t[#{0}] {1}", generatedVouchers.Count, voucher);
}
}
Console.WriteLine("done");
Console.ReadLine();
}
private static string GenerateVoucher(char[] keys, int lengthOfVoucher)
{
return Enumerable
.Range(1, lengthOfVoucher) // for(i.. )
.Select(k => keys[random.Next(0, keys.Length - 1)]) // generate a new random char
.Aggregate("", (e, c) => e + c); // join into a string
}

Building on the answers from Will Hughes & Shekhar_Pro (and just because I found this question interesting) here's another implementation but I've been a bit liberal with your requirement for the length of the voucher code.
Using a base 32 encoder I found you can use the Tick value to generate alpha-numeric strings. The encoding of a tick count to base 32 produces a 13 character string which can be formatted to make it more readable.
public void GenerateCodes()
{
Random random = new Random();
DateTime timeValue = DateTime.MinValue;
// Create 10 codes just to see the random generation.
for(int i=0; i<10; ++i)
{
int rand = random.Next(3600)+1; // add one to avoid 0 result.
timeValue = timeValue.AddMinutes(rand);
byte[] b = System.BitConverter.GetBytes(timeValue.Ticks);
string voucherCode = Transcoder.Base32Encode(b);
Console.WriteLine(string.Format("{0}-{1}-{2}",
voucherCode.Substring(0,4),
voucherCode.Substring(4,4),
voucherCode.Substring(8,5)));
}
}
Here's the output
AARI-3RCP-AAAAA
ACOM-AAZF-AIAAA
ABIH-LV7W-AIAAA
ADPL-26FL-AMAAA
ABBL-W6LV-AQAAA
ADTP-HFIR-AYAAA
ACDG-JH5K-A4AAA
ADDE-GTST-BEAAA
AAWL-3ZNN-BIAAA
AAGK-4G3Y-BQAAA
If you use a known seed for the Random object and remember how many codes you have already created you can continue to generate codes; e.g. if you need more codes and want to be certain you won't generate duplicates.

Here's one way: Generate a bunch of unique numbers between 10000 and 9999999999 put it in a database. Every time you give one to a user, mark it as used (or delete it if you're trying to save space).
EDIT: Generate the unique alpha-numeric values in the beginning. You'll probably have to keep them around for validation (as others have pointed out).

If your app is limited to using only Numerical digits then i think Timestamps (DateTime.Now.Ticks) can be a good way to get unique code every time. You can use random nums but that will have overhead of checking every number that its been issued already or not. If you can use alphabets also then surely go with GUID.
For checking if its been used or not you need to maintain a database and query it to check for validity.

If you prefer alphanumerical, you could use Guid.NewGuid() method:
Guid g = Guid.NewGuid();
Random rn = new Random();
string gs = g.ToString();
int randomInt = rn.Next(5,10+1);
Console.WriteLine(gs.Substring(gs.Length - randomInt - 1, randomInt));
To check if it was not used store somwhere previously generated codes and compare.

private void AutoPurchaseVouNo1()
{
try
{
int Num = 0;
con.Close();
con.Open();
string incre = "SELECT MAX(VoucherNoint+1) FROM tbl_PurchaseAllCompany";
SqlCommand command = new SqlCommand(incre, con);
if (Convert.IsDBNull(command.ExecuteScalar()))
{
Num = 100;
txtVoucherNoInt1.Text = Convert.ToString(Num);
txtVoucherNo1.Text = Convert.ToString("ABC" + Num);
}
else
{
Num = (int)(command.ExecuteScalar());
txtVoucherNoInt1.Text = Convert.ToString(Num);
txtVoucherNo1.Text = Convert.ToString("ABC" + Num);
}
con.Close();
}
catch (Exception ex)
{
MessageBox.Show("Error: " + ex, "Error !!", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
Try this method for creating Voucher Number like ABC100, ABC101, ABC102, etc.

Try this code
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[15];
for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}

Comparing 2 huge lists using C# multiple times (with a twist)

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.
I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:
First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.
I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:
Prefix Code 1: 1234567 <-- this code is 7 digits long.
I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.
Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".
Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.
Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.
Thanks for your help.
Update:
Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.
I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.

UPDATE
You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.
const Int32 minimumPrefixLength = 3;
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var numberPrefixes = numbers
.Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
.First(n.StartsWith))
.ToList();
So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?
Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.
MPL Runtime
-----------------
1 10.198 ms
2 1.179 ms
3 205 ms
4 130 ms
5 107 ms
Just to give an rough idea - I did just one run and have a lot of other stuff going on.
Original answer
I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.
I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.
List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);
Then I used the following LINQ to Object query to find the prefix of each number.
var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();
Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.
Test application
Note that generating many prefixes is really slow and might take a minute or two.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace Test
{
static class Program
{
static void Main()
{
// Set number of prefixes and calls to not more than 50 to get results
// printed to the console.
Console.Write("Generating prefixes");
List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
Console.WriteLine();
Console.Write("Generating calls");
List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
Console.WriteLine();
Console.WriteLine("Processing started.");
Stopwatch stopwatch = new Stopwatch();
const Int32 minimumPrefixLength = 5;
stopwatch.Start();
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var result = calls
.GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
.First(c.Number.StartsWith))
.Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
.ToList();
stopwatch.Stop();
Console.WriteLine("Processing finished.");
Console.WriteLine(stopwatch.Elapsed);
if ((prefixes.Count <= 50) && (calls.Count <= 50))
{
Console.WriteLine("Prefixes");
foreach (String prefix in prefixes.OrderBy(p => p))
{
Console.WriteLine(String.Format(" prefix={0}", prefix));
}
Console.WriteLine("Calls");
foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
{
Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration));
}
Console.WriteLine("Result");
foreach (Call call in result.OrderBy(c => c.Number))
{
Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration));
}
}
Console.ReadLine();
}
private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<String> prefixes = new List<String>(count);
StringBuilder stringBuilder = new StringBuilder(maximumLength);
while (prefixes.Count < count)
{
stringBuilder.Length = 0;
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
String prefix = stringBuilder.ToString();
if (prefixes.Count % 1000 == 0)
{
Console.Write(".");
}
if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
{
prefixes.Add(stringBuilder.ToString());
}
}
return prefixes;
}
private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<Call> calls = new List<Call>(count);
StringBuilder stringBuilder = new StringBuilder();
while (calls.Count < count)
{
stringBuilder.Length = 0;
stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
if (calls.Count % 1000 == 0)
{
Console.Write(".");
}
calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
}
return calls;
}
private class Call
{
public Call (String number, Decimal duration)
{
this.Number = number;
this.Duration = duration;
}
public String Number { get; private set; }
public Decimal Duration { get; private set; }
}
}
}

It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.
Then create a dictionary from carrier to an int or long (the total).
Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.

The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.
Now, to associate a call with a carrier:
foreach (Carrier carrier in carriers)
{
bool found = false;
for (int length = 1; length <= 7; length++)
{
int prefix = ExtractDigits(callNumber, length);
if (carrier.Prefixes.Contains(prefix))
{
carrier.Calls.Add(callNumber);
found = true;
break;
}
}
if (found)
break;
}
If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.
You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.

How about dumping your data into a couple of database tables and then query them using SQL? Easy!
CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )
CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )
-- now populate the tables, create indexes etc
-- and then just run your query...
SELECT p.prefix,
SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
INNER JOIN dbo.prefixes AS p
ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix
(This was written for SQL Server, but should be very simple to translate for any other DBMS.)

Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.
You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).
Then your report would be a sum query on the table.

I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.

You may want to use a HashTable in C#.
This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.
You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.