I have just started experimenting with Cassandra, and I'm using C# and the DataStax driver (v 3.0.8). I wanted to do some performance tests to see how fast Cassandra is handling time series data.
The results are chocking in that it takes an eternity to do a SELECT. So I guess I'm doing something wrong.
I have setup Cassandra on my local computer and I have created a table:
CREATE KEYSPACE dm WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
CREATE TABLE dm.daily_data_by_day (
symbol text,
value_type int,
as_of_day date,
revision_timestamp_utc timestamp,
value decimal,
PRIMARY KEY ((symbol, value_type), as_of_day, revision_timestamp_utc)
) WITH CLUSTERING ORDER BY (as_of_day ASC, revision_timestamp_utc ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have filled this table with about 15 million rows, divided into about 10000 partitions, each containing up to 10000 rows.
Here's the test I'm running (updated on request by phact):
[Test]
public void SelectPerformance()
{
_cluster = Cluster.Builder().AddContactPoint("127.0.0.1").Build();
_stopwatch = new Stopwatch();
var items = new[]
{
// 20 different items...
};
foreach (var item in items)
{
var watch = Stopwatch.StartNew();
var rows = ExecuteQuery(item.Symbol, item.FieldType, item.StartDate, item.EndDate);
watch.Stop();
Console.WriteLine($"{watch.ElapsedMilliseconds}\t{rows.Length}");
}
Console.WriteLine($"Average Execute: {_stopwatch.ElapsedMilliseconds/items.Length}");
_cluster.Dispose();
}
private Row[] ExecuteQuery(string symbol, int fieldType, LocalDate startDate, LocalDate endDate)
{
using (var session = _cluster.Connect("dm"))
{
var ps = session.Prepare(
#"SELECT
symbol,
value_type,
as_of_day,
revision_timestamp_utc,
value
FROM
daily_data_by_day
WHERE
symbol = ? AND
value_type = ? AND
as_of_day >= ? AND as_of_day < ?");
var statement = ps.Bind(symbol, fieldType, startDate, endDate);
statement.EnableTracing();
_stopwatch.Start();
var rowSet = session.Execute(statement);
_stopwatch.Stop();
return rowSet.ToArray();
}
}
The stopwatch tells me that session.Execute() takes 20-30 milliseconds to execute (update: after changing the code to create the cluster only once I'm down to about 15 milliseconds). So I enabled some tracing and got the following result:
activity | source_elapsed
--------------------------------------------------------------------------------------------
Parsing SELECT symbol, value_type, as_of_day, revision_timestamp_utc,...; | 47
Preparing statement | 98
Executing single-partition query on daily_data_by_day | 922
Acquiring sstable references | 939
Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 978
Bloom filter allows skipping sstable 74 | 1003
Bloom filter allows skipping sstable 75 | 1015
Bloom filter allows skipping sstable 72 | 1024
Bloom filter allows skipping sstable 73 | 1032
Key cache hit for sstable 63 | 1043
Merged data from memtables and 5 sstables | 1329
Read 100 live and 0 tombstone cells | 1353
If I understand this trace correctly, Cassandra spends less than 1.4 milliseconds executing my query. So what is the DataStax driver doing the rest of the time?
(As a reference, I have done the same performance test against a local SQL Server instance resulting in about 1-2 milliseconds executing the same query from C#.)
Update:
I have attempted to do some profiling, which is not that easy to do with asynchronous code that you don't own...
My conclusion is that most of the time is spend parsing the response. Each response contains between 2000 - 3000 rows and parsing takes about 9 ms per response. Deserializing takes most of the time, about 6.5 ms, with decimal being the worst, about 3 ms per field. The other fields (text, int, date and timestamp) take about 0.5 ms per field.
Looking at my measured times I ought to have suspected this: the more rows in the response, the longer time it takes, and almost linearly.
#xmas79 Highlighted a great point. You should not create too many sessions instances (better use 1 for each keyspace), but there are also another guidelines that could help you. Follow the guidelines below and reference:
Use one Cluster instance per (physical) cluster (per application lifetime)
Use at most one Session per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a PreparedStatement
You can reduce the number of network roundtrips and also have
atomic operations by using Batches
http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra
EDIT
Also, taking a second look at your code, your are creating a prepared statement for every same query you are executing. The prepared statement should be created only once and you should use its reference to execute the queries. What prepared statements does is to send to the server the CQL that you will execute often so the server already parses the string and return to the user an identification for that . So, my suggestion to you is dont use it if you are not going to share the PreparedStatment object for each query. Or change your code to something like this:
[Test]
public void SelectPerformance()
{
_cluster = Cluster.Builder().AddContactPoint("127.0.0.1").Build();
var session = _cluster.Connect("dm");
var ps = session.Prepare(#"SELECT symbol, value_type, as_of_day, revision_timestamp_utc, value FROM daily_data_by_day WHERE symbol = ? AND value_type = ? AND as_of_day >= ? AND as_of_day < ?");
var items = new[]
{
// 20 different items...
};
foreach (var item in items)
{
var watch = Stopwatch.StartNew();
var rows = ExecuteQuery(session, ps, item.Symbol, item.FieldType, item.StartDate, item.EndDate);
watch.Stop();
Console.WriteLine($"{watch.ElapsedMilliseconds}\t{rows.Length}");
}
Console.WriteLine($"Average Execute: { _stopwatch.ElapsedMilliseconds/items.Length}");
_cluster.Dispose();
}
private Row[] ExecuteQuery(Session session, PreparedStatement ps, string symbol, int fieldType, LocalDate startDate, LocalDate endDate)
{
var statement = ps.Bind(symbol, fieldType, startDate, endDate);
// Do not enable request tracing for latency benchmarking
// statement.EnableTracing();
var rowSet = session.Execute(statement);
return rowSet.ToArray();
}
Short answer you want to keep the cluster object to Cassandra open and re use it across requests.
The creation of the cluster object itself is costly but gives benefits like automatic load balancing, token awareness, automatic failover , etc etc.
Why do you execute
using (var session = _cluster.Connect("dm"))
on every query? You should build your Cluster instance once, connect to the cluster and get the Session once, and reuse them everywhere. I think Cluster object configures important parameters like fail over, loadbalancing etc.. Session object manages them for you. Connecting everytime will give you performance penalties.
EDIT
It seems you are performing SELECT with the latency of 10ms-15ms each. Are you getting the same tracing numbers (eg 1.4ms) at every query? What's your storage IO system? If you are on spinning disks it could be a seek time penalty of your disk subsystem.
Related
I want to insert 40000 rows to Cassandra with batch. But it always stop at number 32769 and give me an exception "System.ArgumentOutOfRangeException". What should I do that can insert more than 32769 rows to Cassandra.
Here is my code:
//建立DCS 資料
DateTime ToDay = DateTime.Today;
string LotStr = ToDay.ToString("yyMMdd");
DateTime NowTime = DateTime.Now;
List<DCS_Model> DCS_list = new List<DCS_Model>();
Random rnd = new Random();
for (int i = 1; i <= 40000; i++)
{
DCS_list.Add(new DCS_Model(LotStr, String.Format("Tag_{0}", i), rnd.Next(1000) + rnd.NextDouble(), NowTime, NowTime));
}
//上傳至Cassandra
DateTime tt = DateTime.Now;
Cluster cluster = Cluster.Builder().AddContactPoint("192.168.52.182").Build();
ISession session = cluster.Connect("testkeyspace");
//List<PreparedStatement> StatementLs = new List<PreparedStatement>();
var InsertDCS = session.Prepare("INSERT INTO DCS_Test (LOT, NAME, VALUE, CREATETIME, SERVERTIME) VALUES (?, ?, ?, ?, ?)");
var batch = new BatchStatement();
foreach (DCS_Model dcs in DCS_list)
{
batch.Add(InsertDCS.Bind(dcs.LOT,dcs.NAME,dcs.VALUE,dcs.CREATETIME,dcs.SERVERTIME));
}
session.Execute(batch);
//Row result = session.Execute("select * from TestTable").First();
TimeSpan CassandraTime = DateTime.Now - tt;
//Console.WriteLine(CassandraTime);
It will stop at batch.Add(InsertDCS.Bind(dcs.LOT,dcs.NAME,dcs.VALUE,dcs.CREATETIME,dcs.SERVERTIME))
when batch add 32768 times.
Please help me. Thanks!!
Batch functionality in the RDBMS world does not even remotely mirror batch functionality with Cassandra. They might be named the same, but they were designed for different purposes. In fact, Cassandra's should probably be renamed to "atomic" to avoid confusion.
Instead of batching them together all at once, try sending 40k individual requests, async with listenable futures (so that you know when they are all done). I believe the C# equivalent of Java's ListenableFuture is SettableFuture. You should look into that.
Sending 40k individual transactions might seem counter-intuitive. But it certainly beats hammering one Cassandra node as a coordinator (along with all the network traffic that the node will generate) to process and ensure atomicity for 40k upserts.
Also, make sure to use the Token Aware load balancing policy. That will direct your upsert to the exact node that it needs to go (saving you a network hop from using a coordinator).
Cluster cluster = Cluster.Builder().AddContactPoint("192.168.52.182").Build()
.WithLoadBalancingPolicy(new TokenAwarePolicy
(new DCAwareRoundRobinPolicy("westDC")));
I found that the source code of the function "BatchStatement" will throw a Exception when the add count more than Int16.MaxValue. So I change the source code then I solve this problem!!
I'm looking for a solution on how to index a large set of strings - say 100 000 000 (probably more) with an average length of 50 bytes each (= 5 000 000 000 = 5 GB of data; and then in UTF16 and with .NET memory allocation, even more).
I then want to use the index to allow other processes to query if a string exists in the index -- and this as fast as possible.
I've done some simple testing with a large memory based HashSet - about 1 000 000 strings - and looking up e.g. 50 000 strings in that HashSet is only a matter of milliseconds.
Here's some pseudo code for what I want to achieve:
// 1) create huge disk based HashSet / Index / Lookup
using (var hs = DiskBasedHashSet<string>(#"c:\index.bin", .create)) {
for each (var s in lotsOfStringsToIndex) {
hs.Add(s);
}
}
// 2) use index to check if items exists - this need to be fast
public static class Query {
static var hs = DiskBasedHashSet<string>(#"c:\index.bin", .read);
// callable from anywhere, and really fast
public static QueryItem(string s) {
return hs.Contains(s);
}
}
for each (var s in checkForThese) {
var result = Query.QueryItem(s);
}
I've tried using SQL Servers, Lucene.NET, and B+Trees, with and without partitioning data. Anyhow, these solutions are to slow and, I think, overqualified for this task. Immagine, the overhead of creating a SQL-query or a Lucene Filter, just do check for a string in a set.
I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.
For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?
It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.
Some of my regexes need to vary e.g. depending on the user's name it might be
mike said (\w*) or john said (\w*)
My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).
Does anyone have any suggestions?
[Edited to accurately reflect speed - was per 1000 strings, not per string]
This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:
Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:
Player_Name | Monetary_Value
If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).
Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.
In other words - consider the following:
Create a database table as follows:
CREATE TABLE PlayerActions (
RowID INT PRIMARY KEY IDENTITY,
Player_Name VARCHAR(50) NOT NULL,
Monetary_Value MONEY NOT NULL
)
Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.
public void ParseFullDataSet(IEnumerable<String> dataSource) {
var rowCount = dataSource.Count();
var setCount = Math.Floor(rowCount / 100000) + 1;
if (rowCount % 100000 != 0)
setCount++;
for (int i = 0; i < setCount; i++) {
var set = dataSource.Skip(i * 100000).Take(100000);
ParseSet(set);
}
}
public void ParseSet(IEnumerable<String> dataSource) {
String playerName = String.Empty;
decimal monetaryValue = 0.0m;
// Assume here that the method reflects your RegEx generator.
String regex = RegexFactory.Generate();
for (String data in dataSource) {
Match match = Regex.Match(data, regex);
if (match.Success) {
playerName = match.Groups[1].Value;
// Might want to add error handling here.
monetaryValue = Convert.ToDecimal(match.Groups[2].Value);
db.PlayerActions.Add(new PlayerAction() {
// ID = ..., // Set at DB layer using Auto_Increment
Player_Name = playerName,
Monetary_Value = monetaryValue
});
db.SaveChanges();
// If not using Entity Framework, use another method to insert
// a row to your database table.
}
}
}
Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:
ParseSet(new List<String>() { newValue });
or if multiples are created at once, call:
ParseSet(newValues); // Where newValues is an IEnumerable<String>
Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.
Regex does takes time to compute. However, U can make it compact using some tricks.
You can also use string functions in C# to avoid regex function.
The code would be lengthy but might improve performance.
String has several functions to cut and extract characters and do pattern matching as u need.
like eg: IndeOfAny, LastIndexOf, Contains....
string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};
if(str2.IndexOfAny(str) >= 0)
{
//success code//
}
I am running a test where I am comparing fetch time b/w appfabric and SQL Server 2008 and looks appFabric is performing 4x time slower than SQL Server.
I have a SQL Server 2008 setup which contains only one table with 4 columns (all nvarchar). The table has 6000 rows. I insert the same row (as CLR serializable obj) in the appfabric cache. I am running a loop to fetch data x times.
Here is the code
public class AppFabricCache
{
readonly DataCache myDefaultCache;
public AppFabricCache()
{
//-------------------------
// Configure Cache Client
//-------------------------
//Define Array for 1 Cache Host
var servers = new List<DataCacheServerEndpoint>(1);
//Specify Cache Host Details
// Parameter 1 = host name
// Parameter 2 = cache port number
servers.Add(new DataCacheServerEndpoint(#"localhost", 22233));
//Create cache configuration
var configuration = new DataCacheFactoryConfiguration();
//Set the cache host(s)
configuration.Servers = servers;
//Set default properties for local cache (local cache disabled)
configuration.LocalCacheProperties = new DataCacheLocalCacheProperties();
//Disable exception messages since this sample works on a cache aside
DataCacheClientLogManager.ChangeLogLevel(System.Diagnostics.TraceLevel.Off);
//Pass configuration settings to cacheFactory constructor
DataCacheFactory myCacheFactory = new DataCacheFactory(configuration);
//Get reference to named cache called "default"
myDefaultCache = myCacheFactory.GetCache("default");
}
public bool TryGetCachedObject(string key, out object value)
{
value = myDefaultCache.Get(key);
bool result = value != null;
return result;
}
public void PutItemIntoCache(string key, object value)
{
myDefaultCache.Put(key, value, TimeSpan.FromDays(365));
}
}
And here is the loop to fetch data from the cache
public double RunReadStressTest(int numberOfIterations, out int recordReadCount)
{
recordReadCount = 0;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < numberOfIterations; i++)
{
for (int j = 1; j <= 6000; j++)
{
string posId = "PosId-" + j;
try
{
object value;
if (TryGetCachedObject(posId, out value))
recordReadCount++;
}
catch (Exception e)
{
Trace.WriteLine("AS%% - Exception - " + e.Message);
}
}
}
sw.Stop();
return sw.ElapsedMilliseconds;
}
}
I have exactly the same logic to retrieve data from SQL Server. It creates a
sqlCommand = 'Select * from TableName where posId = 'someId''
Here are the results...
SQL Server 2008 R2 Reading-1(ms) Reading-2(ms) Reading-3(ms) Average Time in Seconds
Iteration Count = 5 2528 2649 2665 2.614
Iteration Count = 10 5280 5445 5343 5.356
Iteration Count = 15 7978 8370 7800 8.049333333
Iteration Count = 20 9277 9643 10220 9.713333333
AppFabric Reading-1 Reading-2 Reading-3 Average Time in Seconds
Iteration Count = 5 10301 10160 10186 10.21566667
Iteration Count = 10 20130 20191 20650 20.32366667
Iteration Count = 15 30747 30571 30647 30.655
Iteration Count = 20 40448 40541 40503 40.49733333
Am I missing something here? Why it is so slow?
The difference is the network overhead. In your SQL example, you hop over the network once and select N rows. In your AppFabric example, you hop over the network PER RECORD instead of in bulk. This is the difference. To prove this, temporarily store your records in AppFabric as a List and get just the list one time, or use the AppFabric Bulk API to select them all in one request - that should account for much of the difference.
This may be caused by .Net's built in serialisation.
.Net serialisation utilises reflection which in turn has very poor performance. I'd recommend looking into the use of custom written serialisation code.
I think your test is biased and your results are non optimal.
About Distributed Cache
Local Cache : you have disabled local cache feature. Cache objects are always retrieved from the server ; network transfert and deserialization have a cost.
BulkGet : BulkGet improves performance when used with small objects, for example, when retrieving many objects of 1 - 5KB or less in size .
No Data Compression : No compression between AppFabric and Cache Clients. Check this.
About Your test
Another Important thing is that your are not testing the same thing : On one side you test SELECT * and and the other side you test N x GET ITEM.
Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.
I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:
First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.
I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:
Prefix Code 1: 1234567 <-- this code is 7 digits long.
I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.
Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".
Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.
Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.
Thanks for your help.
Update:
Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.
I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.
UPDATE
You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.
const Int32 minimumPrefixLength = 3;
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var numberPrefixes = numbers
.Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
.First(n.StartsWith))
.ToList();
So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?
Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.
MPL Runtime
-----------------
1 10.198 ms
2 1.179 ms
3 205 ms
4 130 ms
5 107 ms
Just to give an rough idea - I did just one run and have a lot of other stuff going on.
Original answer
I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.
I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.
List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);
Then I used the following LINQ to Object query to find the prefix of each number.
var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();
Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.
Test application
Note that generating many prefixes is really slow and might take a minute or two.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace Test
{
static class Program
{
static void Main()
{
// Set number of prefixes and calls to not more than 50 to get results
// printed to the console.
Console.Write("Generating prefixes");
List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
Console.WriteLine();
Console.Write("Generating calls");
List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
Console.WriteLine();
Console.WriteLine("Processing started.");
Stopwatch stopwatch = new Stopwatch();
const Int32 minimumPrefixLength = 5;
stopwatch.Start();
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var result = calls
.GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
.First(c.Number.StartsWith))
.Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
.ToList();
stopwatch.Stop();
Console.WriteLine("Processing finished.");
Console.WriteLine(stopwatch.Elapsed);
if ((prefixes.Count <= 50) && (calls.Count <= 50))
{
Console.WriteLine("Prefixes");
foreach (String prefix in prefixes.OrderBy(p => p))
{
Console.WriteLine(String.Format(" prefix={0}", prefix));
}
Console.WriteLine("Calls");
foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
{
Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration));
}
Console.WriteLine("Result");
foreach (Call call in result.OrderBy(c => c.Number))
{
Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration));
}
}
Console.ReadLine();
}
private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<String> prefixes = new List<String>(count);
StringBuilder stringBuilder = new StringBuilder(maximumLength);
while (prefixes.Count < count)
{
stringBuilder.Length = 0;
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
String prefix = stringBuilder.ToString();
if (prefixes.Count % 1000 == 0)
{
Console.Write(".");
}
if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
{
prefixes.Add(stringBuilder.ToString());
}
}
return prefixes;
}
private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<Call> calls = new List<Call>(count);
StringBuilder stringBuilder = new StringBuilder();
while (calls.Count < count)
{
stringBuilder.Length = 0;
stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
if (calls.Count % 1000 == 0)
{
Console.Write(".");
}
calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
}
return calls;
}
private class Call
{
public Call (String number, Decimal duration)
{
this.Number = number;
this.Duration = duration;
}
public String Number { get; private set; }
public Decimal Duration { get; private set; }
}
}
}
It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.
Then create a dictionary from carrier to an int or long (the total).
Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.
The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.
Now, to associate a call with a carrier:
foreach (Carrier carrier in carriers)
{
bool found = false;
for (int length = 1; length <= 7; length++)
{
int prefix = ExtractDigits(callNumber, length);
if (carrier.Prefixes.Contains(prefix))
{
carrier.Calls.Add(callNumber);
found = true;
break;
}
}
if (found)
break;
}
If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.
You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.
How about dumping your data into a couple of database tables and then query them using SQL? Easy!
CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )
CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )
-- now populate the tables, create indexes etc
-- and then just run your query...
SELECT p.prefix,
SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
INNER JOIN dbo.prefixes AS p
ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix
(This was written for SQL Server, but should be very simple to translate for any other DBMS.)
Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.
You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).
Then your report would be a sum query on the table.
I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.
You may want to use a HashTable in C#.
This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.
You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.