regex performance degrades

regex performance degrades - c#

I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.
For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?
It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.
Some of my regexes need to vary e.g. depending on the user's name it might be
mike said (\w*) or john said (\w*)
My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).
Does anyone have any suggestions?
[Edited to accurately reflect speed - was per 1000 strings, not per string]

This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:
Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:
Player_Name | Monetary_Value
If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).
Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.
In other words - consider the following:
Create a database table as follows:
CREATE TABLE PlayerActions (
RowID INT PRIMARY KEY IDENTITY,
Player_Name VARCHAR(50) NOT NULL,
Monetary_Value MONEY NOT NULL
)
Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.
public void ParseFullDataSet(IEnumerable<String> dataSource) {
var rowCount = dataSource.Count();
var setCount = Math.Floor(rowCount / 100000) + 1;
if (rowCount % 100000 != 0)
setCount++;
for (int i = 0; i < setCount; i++) {
var set = dataSource.Skip(i * 100000).Take(100000);
ParseSet(set);
}
}
public void ParseSet(IEnumerable<String> dataSource) {
String playerName = String.Empty;
decimal monetaryValue = 0.0m;
// Assume here that the method reflects your RegEx generator.
String regex = RegexFactory.Generate();
for (String data in dataSource) {
Match match = Regex.Match(data, regex);
if (match.Success) {
playerName = match.Groups[1].Value;
// Might want to add error handling here.
monetaryValue = Convert.ToDecimal(match.Groups[2].Value);
db.PlayerActions.Add(new PlayerAction() {
// ID = ..., // Set at DB layer using Auto_Increment
Player_Name = playerName,
Monetary_Value = monetaryValue
});
db.SaveChanges();
// If not using Entity Framework, use another method to insert
// a row to your database table.
}
}
}
Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:
ParseSet(new List<String>() { newValue });
or if multiples are created at once, call:
ParseSet(newValues); // Where newValues is an IEnumerable<String>
Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.

Regex does takes time to compute. However, U can make it compact using some tricks.
You can also use string functions in C# to avoid regex function.
The code would be lengthy but might improve performance.
String has several functions to cut and extract characters and do pattern matching as u need.
like eg: IndeOfAny, LastIndexOf, Contains....
string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};
if(str2.IndexOfAny(str) >= 0)
{
//success code//
}

Related

Huge files(~1Gb) into array to manipulate

In this weekend, I have wanted to enhance my Apple's right click dictionaries. It has to be in some determined format(has nothing to do with this part of the question). For example, Oxford dictionary shows, when clicked, -ed, -ing, -s forms as bare verb form, but not same for Collin's. To integrate it, I need to add -ed, -ing, -s forms of verb words.
Example word brace in Oxford dict is defined like
<d:entry id="_4ma" d:title="brace">
<d:index d:value="brace" d:title="brace"/><d:index d:value="braced" d:title="brace"/><d:index d:value="braces" d:title="brace"/><d:index d:value="bracing" d:title="brace"/><a name="49f1b05b3013431b8dbb9ff989c2755a_brace" nattr="head-jump-entry-flag"></a>
As you can see, there are <d:index d:value=" multiple entries.
In Collin's,
<d:entry id="_379" d:title="brace">
<d:index d:value="brace" d:title="brace"/><div id="collins_english_dictionary">
As you can see again, there are no the following entries unlike that of Oxford,
<d:index d:value="braced" d:title="brace"/><d:index d:value="braces" d:title="brace"/><d:index d:value="bracing" d:title="brace"/>
What the following code snippet does is to find missing entries of every word which are in the destination dictionary(Collin's) and make the following format ultimately. (Of no importance of being duplicated.)
<d:entry id="_379" d:title="brace">
<d:index d:value="braced" d:title="brace"/><d:index d:value="braces" d:title="brace"/><d:index d:value="bracing" d:title="brace"/>
<d:index d:value="brace" d:title="brace"/><div id="collins_english_dictionary">
The problem is that the files are too big one of them is of almost 500MB size while the other is of 1Gb size(almost 4 million lines Oxford, 120 thousand lines in Collins due to being collapsed tags). The IDE or the compiler leaves in the lurch before completion successfully. How can we succeed?
using System.Text.RegularExpressions;
var source = "/Users/soner/Desktop/Oxford Advanced Learner Dictionary/Oxford Advanced Learner's Dictionary.xml";
var lines = File.ReadLines(source);
var destination = "/Users/soner/Desktop/Collins COBUILD Advanced English Dictionary/Collins COBUILD Advanced English Dictionary.xml";
var destinationLines = File.ReadLines(destination).ToList();
var sourceLines = lines.ToList();
int destIndex = 0;
for (int i = 0; i < sourceLines.Count - 1; i++)
{
if (sourceLines[i].Contains("<d:entry id="))
{
++i; // we are almost sure searched part is immediately in the next line, if not q==-1 cotinue;
var aTagLine = sourceLines[i];
var q = aTagLine.IndexOf("<a name", StringComparison.Ordinal);
if (q == -1) continue;
var wantedPart = aTagLine[..q];
// if there is no form of the word being searched -ing -s or -ed tags
if(wantedPart.Split("d:index").Length <= 2) continue;
// the word to be searched for
var word = Regex.Matches(wantedPart, "(?<=d:title=\").*?(?=\")")[0].ToString();
int memoriedIndex = destIndex;
// from now on we are in destination dictionary to add missing entities -ed,-ing etc. forms
for (; destIndex < destinationLines.Count; destIndex++)
{
if (destinationLines[destIndex].Contains("<d:entry id=") && destinationLines[destIndex].Contains($"{word}\">"))
{
destinationLines[destIndex] = $"{destinationLines[destIndex]}\n{wantedPart}";
break;
}
// I need this one because
// what if destination dictionary doesn't have the word searched in source dictionary
// it needs go some steps backward to go on
if (destIndex - memoriedIndex > 100)
{
destIndex = memoriedIndex;
break;
}
}
}
}
System.IO.File.WriteAllLines("/Users/soner/Desktop/soner.xml", destinationLines);

First, you should definitely work with filestreams/stringstreams and ensure that async file IO is enabled. Do not read the whole file at once. Additionally, you can do all text-processing and datastructure building on separate threads.
You should definitely transform the raw xml data into a more usable data structure (in memory, or possibly something else) optimized for your use case, before working with it, as was already mentioned. I'm not gonna go into detail which one, I didn't look at your use case too closely.
You would have to experiment a little which one's faster, processing using XmlReader or regex. Usually, I would always recommend XmlReader, except in very simple and performance critical situations, when something else might be more efficient. Make sure to instantiate the regex only once (don't use the statig methods) and set the Compiled option.
These are just some thoughts to get you started, not a complete answer.

Improve performance of TryGetValue

I am creating an Excel file using Open XML SDK. In this process, I have a scenario like below.
I need to add data into a Dictionary<uint, string> if key is not exists. For that I am using below code.
var dataLines = sheetData.Elements<Row>().ToList();
for (int i = 0; i < dataLines.Count; i++)
{
var x = dataLines[i];
if (!dataDictionary.TryGetValue(x.RowIndex.Value, out var res)) // 700 seconds, 1,279,999,998 Hit counts
{
dataDictionary.Add(x.RowIndex.Value, x.OuterXml);
}
}
When I am trying to create an Excel sheet which has rows around 90,000 - 92,000, the line with the IF condition in above code consume 700 seconds to complete. (checked with a performance profiler, also this line has 1,279,999,998 Hit counts).
How could I reduce the time the line with the IF condition in above code consumes?
Is there any better way to achive this with less time?

If the if statement is slow, one option you have is to eliminate it entirely and use the indexer of the dictionary to set the value. This means that the "last match will win". If you want the "first match to win", all you have to do is reverse the order you are iterating the list.
var dataLines = sheetData.Elements<Row>().ToList();
for (int i = dataLines.Count - 1; i >= 0; i--)
{
var x = dataLines[i];
dataDictionary[x.RowIndex.Value] = x.OuterXml;
}
If x.RowIndex.Value is unique, it doesn't matter which direction you iterate.
If it is important that the key is sorted in ascending order, you can use a SortedDictionary<TKey, TValue>.
But as others have pointed out, it seems odd that you have so many hit counts. There is probably recursion going on in your application that you need to track down.

How to improve performance of this algorithm?

I have a text file with 100000 pairs: word and frequency.
test.in file with words:
1 line - total count of all word-frequency pairs
2 line to ~100 001 - word-frequency pairs
100 002 line - total count of user input words
from 100 003 to the end - user input words
I parse this file and put the words in
Dictionary<string,double> dictionary;
And I want to execute some search + order logic in the following code:
for(int i=0;i<15000;i++)
{
tempInputWord = //take data from file(or other sources)
var adviceWords = dictionary
.Where(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
.OrderByDescending(ks => ks.Value)
.ThenBy(ks => ks.Key,StringComparer.Ordinal)
.Take(10)
.ToList();
//some output
}
The problem: This code must run in less than 10 seconds.
On my computer (core i5 2400, 8gb RAM) with Parallel.For() - about 91 sec.
Can you give me some advice how to increase performance?
UPDATE :
Hooray! We did it!
Thank you #CodesInChaos, #usr, #T_D and everyone who was involved in solving the problem.
The final code:
var kvList = dictionary.OrderBy(ks => ks.Key, StringComparer.Ordinal).ToList();
var strComparer = new MyStringComparer();
var intComparer = new MyIntComparer();
var kvListSize = kvList.Count;
var allUserWords = new List<string>();
for (int i = 0; i < userWordQuantity; i++)
{
var searchWord = Console.ReadLine();
allUserWords.Add(searchWord);
}
var result = allUserWords
.AsParallel()
.AsOrdered()
.Select(searchWord =>
{
int startIndex = kvList.BinarySearch(new KeyValuePair<string, int>(searchWord, 0), strComparer);
if (startIndex < 0)
startIndex = ~startIndex;
var matches = new List<KeyValuePair<string, int>>();
bool isNotEnd = true;
for (int j = startIndex; j < kvListSize ; j++)
{
isNotEnd = kvList[j].Key.StartsWith(searchWord, StringComparison.Ordinal);
if (isNotEnd) matches.Add(kvList[j]);
else break;
}
matches.Sort(intComparer);
var res = matches.Select(s => s.Key).Take(10).ToList();
return res;
});
foreach (var adviceWords in result)
{
foreach (var adviceWord in adviceWords)
{
Console.WriteLine(adviceWord);
}
Console.WriteLine();
}
6 sec (9 sec without manual loop (with linq)))

You are not at all using any algorithmic strength of the dictionary. Ideally, you'd use a tree structure so that you can perform prefix lookups. On the other hand you are within 3.7x of your performance goal. I think you can reach that by just optimizing the constant factor in your algorithm.
Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List<T>. That turns out to give a major speed-up in practice.
Don't use a dictionary at all. Just use a KeyValuePair<T1, T2>[] and run through it using a foreach loop. This is the fastest possible way to traverse a set of pairs.
Could look like this:
KeyValuePair<T1, T2>[] items;
List<KeyValuePair<T1, T2>> matches = new ...(); //Consider pre-sizing this.
//This could be a parallel loop as well.
//Make sure to not synchronize too much on matches.
//If there tend to be few matches a lock will be fine.
foreach (var item in items) {
if (IsMatch(item)) {
matches.Add(item);
}
}
matches.Sort(...); //Sort in-place
return matches.Take(10); //Maybe matches.RemoveRange(10, matches.Count - 10) is better
That should exceed a 3.7x speedup.
If you need more, try stuffing the items into a dictionary keyed on the first char of Key. That way you can look up all items matching tempInputWord[0]. That should reduce search times by the selectivity that is in the first char of tempInputWord. For English text that would be on the order of 26 or 52. This is a primitive form of prefix lookup that has one level of lookup. Not pretty but maybe it is enough.

I think the best way would be to use a Trie data structure instead of a dictionary. A Trie data structure saves all the words in a tree structure. A node can represent all the words that start with the same letters. So if you look for your search word tempInputWord in a Trie you will get a node that represents all the words starting with tempInputWord and you just have to traverse through all the child nodes. So you just have one search operation. The link to the Wikipedia article also mentions some other advantages over hash tables (that's what an Dictionary is basically):
Looking up data in a trie is faster in the worst case, O(m) time
(where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. A key
collision is the hash function mapping of different keys to the same
position in a hash table. The worst-case lookup speed in an imperfect
hash table is O(N) time, but far more typically is O(1), with O(m)
time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is
associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
And here are some ideas for creating a trie in C#.
This should at least speed up the lookup, however, building the Trie might be slower.
Update:
Ok, I tested it myself using a file with frequencies of english words that uses the same format as yours. This is my code which uses the Trie class that you also tried to use.
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
sw.Start();
var trie = new Trie<KeyValuePair<string,int>>();
//build trie with your value pairs
var lines = File.ReadLines("en.txt");
foreach(var line in lines.Take(100000))
{
var split = line.Split(' ');
trie.Add(split[0], new KeyValuePair<string,int>(split[0], int.Parse(split[1])));
}
Console.WriteLine("Time needed to read file and build Trie with 100000 words: " + sw.Elapsed);
sw.Reset();
//test with 10000 search words
sw.Start();
foreach (string line in lines.Take(10000))
{
var searchWord = line.Split(' ')[0];
var allPairs = trie.Retrieve(searchWord);
var bestWords = allPairs.OrderByDescending(kv => kv.Value).ThenBy(kv => kv.Key).Select(kv => kv.Key).Take(10);
var output = bestWords.Aggregate("", (s1, s2) => s1 + ", " + s2);
Console.WriteLine(output);
}
Console.WriteLine("Time to process 10000 different searchWords: " + sw.Elapsed);
}
My results on a pretty similar machine:
Time needed to read file and build Trie with 100000 words: 00:00:00.7397839
Time to process 10000 different searchWords: 00:00:03.0181700
So I think you are doing something wrong that we cannot see. For example the way you measure the time or the way you read the file. As my results show this stuff should be really fast. The 3 seconds are mainly due to the Console output in the loop which I needed so that the bestWords variable is used. Otherwise the variable would have been optimized away.

Replace the dictionary by a List<KeyValuePair<string, decimal>>, sorted by the key.
For the search I use that a substring sorts directly before its prefixes with ordinal comparisons. So I can use a binary search to find the first candidate. Since the candidates are contiguous I can replace Where with TakeWhile.
int startIndex = dictionary.BinarySearch(searchWord, comparer);
if(startIndex < 0)
startIndex = ~startIndex;
var adviceWords = dictionary
.Skip(startIndex)
.TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
.OrderByDescending(ks => ks.Value)
.ThenBy(ks => ks.Key)
.Select(s => s.Key)
.Take(10).ToList();
Make sure to use ordinal comparison for all operations, including the initial sort, the binary search and the StartsWith check.
I would call Console.ReadLine outside the parallel loop. Probably using AsParallel().Select(...) on the collection of search words instead of Parallel.For.

If you want profiling, separate the reading of the file and see how long that takes.
Also data calculation, collection, presentation could be different steps.
If you want concurrence AND a dictionary, look at the ConcurrentDictionary, maybe even more for reliability than for performance, but probably for both:
http://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx

Assuming the 10 is constant, then why is everyone storing the entire data set? Memory is not free. The fastest solution is to store the first 10 entries into a list, sort it. Then, maintain the 10-element-sorted-list as you traverse through the rest of the data set, removing the 11th element every time you insert an element.
The above method works best for small values. If you had to take the first 5000 objects, consider using a binary heap instead of a list.

Best Way to Check for Used Key with Nhibernate?

on my site I allow people to buy subscriptions to my site in bulk(I call them vouchers). Once they have these vouchers, they give them to whoever and they enter that code into their account to upgrade them.
Right now I am thinking of doing 4 alphanumeric code(upper case, lower case and digits) and will have something like this
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[4];
var random = new Random();
for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}
var finalString = new String(stringChars);
For now I think that will give me more than enough combinations and if I ever do run out I can always up the length of the code. I want to keep it short because I don't want the user to have to type in huge as numbers.
I also don't have the time to make a more elegant solution maybe were they click a link or something in their email and it activates their account and of course this would cut down on someone trying to randomly guess a voucher number.
These are things I would deal with if the site every becomes more popular.
I am wondering though how can I handle the possible duplicate generation of the same voucher. My first thought was to check the database each time a voucher is created and if it exists then make a new one.
However that seems like it could be slow. So I thought also maybe getting all the keys first and store them in memory and they check there but if the list keeps growing I might run into out of memory exceptions and all that great stuff.
So does anyone have any ideas? Or am I stuck doing one of the 2 method I listed above?
I am using nhibernate, asp.net mvc and C#.
Edit
static void Main(string[] args)
{
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
HashAlgorithm sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
hold.Add(hex.Substring(0,3));
Console.WriteLine(hex.Substring(0, 4));
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
above is my attempt to try to use hashing. However I think I am missing something as it seems to have quite a bit more duplicates then expected.
Edit 2
I think I added what I was missing but not sure if this is exactly what he meant. I am also not sure what to do in a situation when I moved it as far as I can move it(my has seems to give me a length of 40 places I can move it).
static void Main(string[] args)
{
int subStringLength = 4;
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
SHA1CryptoServiceProvider sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
int startingPositon = 0;
string possibleVoucherCode = hex.Substring(startingPositon,subStringLength);
string voucherCode = Move(subStringLength, hold, hex, startingPositon, possibleVoucherCode);
hold.Add(voucherCode);
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
private static string Move(int subStringLength, List<string> hold, string hex, int startingPositon, string possibleVoucherCode)
{
if (hold.Contains(possibleVoucherCode))
{
int newPosition = startingPositon + 1;
if (newPosition <= hex.Length)
{
if ((newPosition + subStringLength) > hex.Length)
{
possibleVoucherCode = hex.Substring(newPosition, subStringLength);
return Move(subStringLength, hold, hex, newPosition, possibleVoucherCode);
}
// return something
return "0";
}
else
{
// return something
return "0";
}
}
else
{
return possibleVoucherCode;
}
}
}

It is going to be slow because you want to generate the vouchers randomly and then check the database for every generated code.
I would create a table vouchers with an id, the code and an is_used column. I would fill that table once with enough random codes. Since this can be done in a separate process, the performance won't be such a big problem. Let it run in the evening and the next day you get a fully filled vouchers-table.
If you want to prevent generating duplicate vouchers, that won't be a problem. You can generate them anyway and put them either in a System.Collections.Generic.HashSet (which prevents adding duplicates without throwing an exception) or call the Linq-method Distinct(), before adding them to that vouchers table.

If you insist on short codes:
Use a GUID as a primary key, generate one random number. How you might want to translate this in to alpha-num is up to you.
Use the last byte or two of the guid and the random number. 1234-684687 This should make it slightly less easy to bruteforce coupons. And handle any (rare) collisions with an exception.
Easy way to shorten an int, change it's base (from 10 to 62). (in VB, and this is old code)
This yields "2lkCB1" when given Int32.MaxValue
''//given intValue as your random integer
Dim result As String = String.Empty
Dim digits as String = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
Dim x As Integer
While (intValue > 0)
x = intValue Mod digits.Length
result = digits(x) & result
intValue = intValue - x
intValue = intValue \ digits.Length
End While
Return result
But now we're already answering more than one question.

For a bulk data operation like this, I would recommend not using NHibernate and just doing straight ADO.NET.
Batch Check
Since you anticipate generating big batches of codes at once, you should batch multiple code checks into a single round-trip to the database. If you're using SQL Server 2008 or higher, you could do this using table-valued parameters, checking a whole list of codes at once.
SELECT DISTINCT b.Code
FROM #batch b
WHERE NOT EXISTS (
SELECT v.Code
FROM dbo.Voucher v
WHERE v.Code = b.Code
);
Concurrency
Now, what about concurrency issues? What if two users generate the same code at roughly the same time? Or simply in-between the time when we check the code for uniqueness and when we insert it into the Voucher table?
We can take care of that by modifying the query as follows:
DECLARE #batchid uniqueidentifier;
SET #batchid = NEWID();
INSERT INTO dbo.Voucher (Code, BatchId)
SELECT DISTINCT b.Code, #batchid
FROM #batch b
WHERE NOT EXISTS (
SELECT Code
FROM dbo.Voucher v
WHERE b.Code = v.Code
);
SELECT Code
FROM dbo.Voucher
WHERE BatchId = #batchid;
Executing via .NET
Assuming that you have defined the following table-valued user type...
CREATE TYPE dbo.VoucherCodeList AS TABLE (
Code nvarchar(8) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL
/* !!! Remember to specify the collation on your Voucher.Code column too, since you want upper and lower-case codes. */
);
... you could execute this query via .NET code like this:
public ICollection<string> GenerateCodes(int numberOfCodes)
{
var result = new List<string>(numberOfCodes);
while (result.Count < numberOfCodes)
{
var batchSize = Math.Min(_batchSize, numberOfCodes - result.Count);
var batch = Enumerable.Range(0, batchSize)
.Select(x => GenerateRandomCode());
var oldResultCount = result.Count;
result.AddRange(FilterAndSecureBatch(batch));
var filteredBatchSize = result.Count - oldResultCount;
var collisionRatio = ((double)batchSize - filteredBatchSize) / batchSize;
// Automatically increment length of random codes if collisions begin happening too frequently
if (collisionRatio > _collisionThreshold)
CodeLength++;
}
return result;
}
private IEnumerable<string> FilterAndSecureBatch(IEnumerable<string> batch)
{
using (var command = _connection.CreateCommand())
{
command.CommandText = _sqlQuery; // the concurrency-safe query listed above
var metaData = new[] { new SqlMetaData("Code", SqlDbType.NVarChar, 8) };
var param = command.Parameters.Add("#batch", SqlDbType.Structured);
param.TypeName = "dbo.VoucherCodeList";
param.Value = batch.Select(x =>
{
var record = new SqlDataRecord(metaData);
record.SetString(0, x);
return record;
});
using (var reader = command.ExecuteReader())
while (reader.Read())
yield return reader.GetString(0);
}
}
Performance
After implementing all of this (and moving the command and parameter creation out of the loop so it would be re-used between batches), I was able to insert 10,000 codes using a batch size of 500 consistently in approx. 0.5 to 2 seconds, or 5 to 20 codes per millisecond.
Code Density / Collisions / Guessability
The _collisionThreshold field limits the density of your codes. It's a value between 0 and 1. Actually, it must be less than 1 or else you would wind up in an infinite loop when the 4 digit codes were exhausted (probably should add an assertion for this in code). I would recommend never turning it above 0.5 for performance reasons. More than 50% collisions would mean it's spending more time testing already-used codes than actually generating new ones.
Keeping the collision threshold low is how you would control how hard-to-guess your codes are. Setting _collisionThreshold to 0.01 would generate codes such that there's approximately a 1% chance of someone guessing a code.
If collisions occur too frequently, CodeLength (which is used by the GenerateRandomCode() method) will be incremented. This value needs to be persisted somewhere. After executing GenerateCodes(), check CodeLength to see if it has changed and then save the new value.
Source Code
The full code is available here: https://gist.github.com/3217856. I am the author of this code, and am releasing it under the MIT license. I had fun with this little challenge, and also got to learn how to pass a table-valued parameter to an inline parametrized query. I hadn't ever done that before. I've only ever passed them to full-fledged stored procedures.

A possible solution for you is like this:
Find the maximum ID of a voucher (an integer). Then, run any hash function on it, take the first 32 bits and convert to the string you want to show the user (or use a 32bit hash function such as Jenkins hash function). This will probably work, hash collisions are pretty rare. But this solution is very similar to yours, in the point of randomness.
You could run a test which finds the first 10 or 100 collisions (this should be enough for you) and forces the algorithm to "skip" them and use a different starting value. Then, you don't need to check the database at all (well, at least until you reach about 4294967296 vouchers...)

how about utilizing nHibernate's HiLo algorithm?
Here is an example on how you can get the next value (without DB access).

Comparing 2 huge lists using C# multiple times (with a twist)

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.
I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:
First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.
I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:
Prefix Code 1: 1234567 <-- this code is 7 digits long.
I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.
Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".
Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.
Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.
Thanks for your help.
Update:
Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.
I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.

UPDATE
You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.
const Int32 minimumPrefixLength = 3;
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var numberPrefixes = numbers
.Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
.First(n.StartsWith))
.ToList();
So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?
Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.
MPL Runtime
-----------------
1 10.198 ms
2 1.179 ms
3 205 ms
4 130 ms
5 107 ms
Just to give an rough idea - I did just one run and have a lot of other stuff going on.
Original answer
I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.
I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.
List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);
Then I used the following LINQ to Object query to find the prefix of each number.
var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();
Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.
Test application
Note that generating many prefixes is really slow and might take a minute or two.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace Test
{
static class Program
{
static void Main()
{
// Set number of prefixes and calls to not more than 50 to get results
// printed to the console.
Console.Write("Generating prefixes");
List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
Console.WriteLine();
Console.Write("Generating calls");
List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
Console.WriteLine();
Console.WriteLine("Processing started.");
Stopwatch stopwatch = new Stopwatch();
const Int32 minimumPrefixLength = 5;
stopwatch.Start();
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var result = calls
.GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
.First(c.Number.StartsWith))
.Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
.ToList();
stopwatch.Stop();
Console.WriteLine("Processing finished.");
Console.WriteLine(stopwatch.Elapsed);
if ((prefixes.Count <= 50) && (calls.Count <= 50))
{
Console.WriteLine("Prefixes");
foreach (String prefix in prefixes.OrderBy(p => p))
{
Console.WriteLine(String.Format(" prefix={0}", prefix));
}
Console.WriteLine("Calls");
foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
{
Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration));
}
Console.WriteLine("Result");
foreach (Call call in result.OrderBy(c => c.Number))
{
Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration));
}
}
Console.ReadLine();
}
private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<String> prefixes = new List<String>(count);
StringBuilder stringBuilder = new StringBuilder(maximumLength);
while (prefixes.Count < count)
{
stringBuilder.Length = 0;
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
String prefix = stringBuilder.ToString();
if (prefixes.Count % 1000 == 0)
{
Console.Write(".");
}
if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
{
prefixes.Add(stringBuilder.ToString());
}
}
return prefixes;
}
private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<Call> calls = new List<Call>(count);
StringBuilder stringBuilder = new StringBuilder();
while (calls.Count < count)
{
stringBuilder.Length = 0;
stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
if (calls.Count % 1000 == 0)
{
Console.Write(".");
}
calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
}
return calls;
}
private class Call
{
public Call (String number, Decimal duration)
{
this.Number = number;
this.Duration = duration;
}
public String Number { get; private set; }
public Decimal Duration { get; private set; }
}
}
}

It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.
Then create a dictionary from carrier to an int or long (the total).
Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.

The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.
Now, to associate a call with a carrier:
foreach (Carrier carrier in carriers)
{
bool found = false;
for (int length = 1; length <= 7; length++)
{
int prefix = ExtractDigits(callNumber, length);
if (carrier.Prefixes.Contains(prefix))
{
carrier.Calls.Add(callNumber);
found = true;
break;
}
}
if (found)
break;
}
If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.
You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.

How about dumping your data into a couple of database tables and then query them using SQL? Easy!
CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )
CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )
-- now populate the tables, create indexes etc
-- and then just run your query...
SELECT p.prefix,
SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
INNER JOIN dbo.prefixes AS p
ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix
(This was written for SQL Server, but should be very simple to translate for any other DBMS.)

Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.
You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).
Then your report would be a sum query on the table.

I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.

You may want to use a HashTable in C#.
This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.
You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.