I am running a test where I am comparing fetch time b/w appfabric and SQL Server 2008 and looks appFabric is performing 4x time slower than SQL Server.
I have a SQL Server 2008 setup which contains only one table with 4 columns (all nvarchar). The table has 6000 rows. I insert the same row (as CLR serializable obj) in the appfabric cache. I am running a loop to fetch data x times.
Here is the code
public class AppFabricCache
{
readonly DataCache myDefaultCache;
public AppFabricCache()
{
//-------------------------
// Configure Cache Client
//-------------------------
//Define Array for 1 Cache Host
var servers = new List<DataCacheServerEndpoint>(1);
//Specify Cache Host Details
// Parameter 1 = host name
// Parameter 2 = cache port number
servers.Add(new DataCacheServerEndpoint(#"localhost", 22233));
//Create cache configuration
var configuration = new DataCacheFactoryConfiguration();
//Set the cache host(s)
configuration.Servers = servers;
//Set default properties for local cache (local cache disabled)
configuration.LocalCacheProperties = new DataCacheLocalCacheProperties();
//Disable exception messages since this sample works on a cache aside
DataCacheClientLogManager.ChangeLogLevel(System.Diagnostics.TraceLevel.Off);
//Pass configuration settings to cacheFactory constructor
DataCacheFactory myCacheFactory = new DataCacheFactory(configuration);
//Get reference to named cache called "default"
myDefaultCache = myCacheFactory.GetCache("default");
}
public bool TryGetCachedObject(string key, out object value)
{
value = myDefaultCache.Get(key);
bool result = value != null;
return result;
}
public void PutItemIntoCache(string key, object value)
{
myDefaultCache.Put(key, value, TimeSpan.FromDays(365));
}
}
And here is the loop to fetch data from the cache
public double RunReadStressTest(int numberOfIterations, out int recordReadCount)
{
recordReadCount = 0;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < numberOfIterations; i++)
{
for (int j = 1; j <= 6000; j++)
{
string posId = "PosId-" + j;
try
{
object value;
if (TryGetCachedObject(posId, out value))
recordReadCount++;
}
catch (Exception e)
{
Trace.WriteLine("AS%% - Exception - " + e.Message);
}
}
}
sw.Stop();
return sw.ElapsedMilliseconds;
}
}
I have exactly the same logic to retrieve data from SQL Server. It creates a
sqlCommand = 'Select * from TableName where posId = 'someId''
Here are the results...
SQL Server 2008 R2 Reading-1(ms) Reading-2(ms) Reading-3(ms) Average Time in Seconds
Iteration Count = 5 2528 2649 2665 2.614
Iteration Count = 10 5280 5445 5343 5.356
Iteration Count = 15 7978 8370 7800 8.049333333
Iteration Count = 20 9277 9643 10220 9.713333333
AppFabric Reading-1 Reading-2 Reading-3 Average Time in Seconds
Iteration Count = 5 10301 10160 10186 10.21566667
Iteration Count = 10 20130 20191 20650 20.32366667
Iteration Count = 15 30747 30571 30647 30.655
Iteration Count = 20 40448 40541 40503 40.49733333
Am I missing something here? Why it is so slow?
The difference is the network overhead. In your SQL example, you hop over the network once and select N rows. In your AppFabric example, you hop over the network PER RECORD instead of in bulk. This is the difference. To prove this, temporarily store your records in AppFabric as a List and get just the list one time, or use the AppFabric Bulk API to select them all in one request - that should account for much of the difference.
This may be caused by .Net's built in serialisation.
.Net serialisation utilises reflection which in turn has very poor performance. I'd recommend looking into the use of custom written serialisation code.
I think your test is biased and your results are non optimal.
About Distributed Cache
Local Cache : you have disabled local cache feature. Cache objects are always retrieved from the server ; network transfert and deserialization have a cost.
BulkGet : BulkGet improves performance when used with small objects, for example, when retrieving many objects of 1 - 5KB or less in size .
No Data Compression : No compression between AppFabric and Cache Clients. Check this.
About Your test
Another Important thing is that your are not testing the same thing : On one side you test SELECT * and and the other side you test N x GET ITEM.
Related
I want to insert 40000 rows to Cassandra with batch. But it always stop at number 32769 and give me an exception "System.ArgumentOutOfRangeException". What should I do that can insert more than 32769 rows to Cassandra.
Here is my code:
//建立DCS 資料
DateTime ToDay = DateTime.Today;
string LotStr = ToDay.ToString("yyMMdd");
DateTime NowTime = DateTime.Now;
List<DCS_Model> DCS_list = new List<DCS_Model>();
Random rnd = new Random();
for (int i = 1; i <= 40000; i++)
{
DCS_list.Add(new DCS_Model(LotStr, String.Format("Tag_{0}", i), rnd.Next(1000) + rnd.NextDouble(), NowTime, NowTime));
}
//上傳至Cassandra
DateTime tt = DateTime.Now;
Cluster cluster = Cluster.Builder().AddContactPoint("192.168.52.182").Build();
ISession session = cluster.Connect("testkeyspace");
//List<PreparedStatement> StatementLs = new List<PreparedStatement>();
var InsertDCS = session.Prepare("INSERT INTO DCS_Test (LOT, NAME, VALUE, CREATETIME, SERVERTIME) VALUES (?, ?, ?, ?, ?)");
var batch = new BatchStatement();
foreach (DCS_Model dcs in DCS_list)
{
batch.Add(InsertDCS.Bind(dcs.LOT,dcs.NAME,dcs.VALUE,dcs.CREATETIME,dcs.SERVERTIME));
}
session.Execute(batch);
//Row result = session.Execute("select * from TestTable").First();
TimeSpan CassandraTime = DateTime.Now - tt;
//Console.WriteLine(CassandraTime);
It will stop at batch.Add(InsertDCS.Bind(dcs.LOT,dcs.NAME,dcs.VALUE,dcs.CREATETIME,dcs.SERVERTIME))
when batch add 32768 times.
Please help me. Thanks!!
Batch functionality in the RDBMS world does not even remotely mirror batch functionality with Cassandra. They might be named the same, but they were designed for different purposes. In fact, Cassandra's should probably be renamed to "atomic" to avoid confusion.
Instead of batching them together all at once, try sending 40k individual requests, async with listenable futures (so that you know when they are all done). I believe the C# equivalent of Java's ListenableFuture is SettableFuture. You should look into that.
Sending 40k individual transactions might seem counter-intuitive. But it certainly beats hammering one Cassandra node as a coordinator (along with all the network traffic that the node will generate) to process and ensure atomicity for 40k upserts.
Also, make sure to use the Token Aware load balancing policy. That will direct your upsert to the exact node that it needs to go (saving you a network hop from using a coordinator).
Cluster cluster = Cluster.Builder().AddContactPoint("192.168.52.182").Build()
.WithLoadBalancingPolicy(new TokenAwarePolicy
(new DCAwareRoundRobinPolicy("westDC")));
I found that the source code of the function "BatchStatement" will throw a Exception when the add count more than Int16.MaxValue. So I change the source code then I solve this problem!!
I have just started experimenting with Cassandra, and I'm using C# and the DataStax driver (v 3.0.8). I wanted to do some performance tests to see how fast Cassandra is handling time series data.
The results are chocking in that it takes an eternity to do a SELECT. So I guess I'm doing something wrong.
I have setup Cassandra on my local computer and I have created a table:
CREATE KEYSPACE dm WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
CREATE TABLE dm.daily_data_by_day (
symbol text,
value_type int,
as_of_day date,
revision_timestamp_utc timestamp,
value decimal,
PRIMARY KEY ((symbol, value_type), as_of_day, revision_timestamp_utc)
) WITH CLUSTERING ORDER BY (as_of_day ASC, revision_timestamp_utc ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have filled this table with about 15 million rows, divided into about 10000 partitions, each containing up to 10000 rows.
Here's the test I'm running (updated on request by phact):
[Test]
public void SelectPerformance()
{
_cluster = Cluster.Builder().AddContactPoint("127.0.0.1").Build();
_stopwatch = new Stopwatch();
var items = new[]
{
// 20 different items...
};
foreach (var item in items)
{
var watch = Stopwatch.StartNew();
var rows = ExecuteQuery(item.Symbol, item.FieldType, item.StartDate, item.EndDate);
watch.Stop();
Console.WriteLine($"{watch.ElapsedMilliseconds}\t{rows.Length}");
}
Console.WriteLine($"Average Execute: {_stopwatch.ElapsedMilliseconds/items.Length}");
_cluster.Dispose();
}
private Row[] ExecuteQuery(string symbol, int fieldType, LocalDate startDate, LocalDate endDate)
{
using (var session = _cluster.Connect("dm"))
{
var ps = session.Prepare(
#"SELECT
symbol,
value_type,
as_of_day,
revision_timestamp_utc,
value
FROM
daily_data_by_day
WHERE
symbol = ? AND
value_type = ? AND
as_of_day >= ? AND as_of_day < ?");
var statement = ps.Bind(symbol, fieldType, startDate, endDate);
statement.EnableTracing();
_stopwatch.Start();
var rowSet = session.Execute(statement);
_stopwatch.Stop();
return rowSet.ToArray();
}
}
The stopwatch tells me that session.Execute() takes 20-30 milliseconds to execute (update: after changing the code to create the cluster only once I'm down to about 15 milliseconds). So I enabled some tracing and got the following result:
activity | source_elapsed
--------------------------------------------------------------------------------------------
Parsing SELECT symbol, value_type, as_of_day, revision_timestamp_utc,...; | 47
Preparing statement | 98
Executing single-partition query on daily_data_by_day | 922
Acquiring sstable references | 939
Skipped 0/5 non-slice-intersecting sstables, included 0 due to tombstones | 978
Bloom filter allows skipping sstable 74 | 1003
Bloom filter allows skipping sstable 75 | 1015
Bloom filter allows skipping sstable 72 | 1024
Bloom filter allows skipping sstable 73 | 1032
Key cache hit for sstable 63 | 1043
Merged data from memtables and 5 sstables | 1329
Read 100 live and 0 tombstone cells | 1353
If I understand this trace correctly, Cassandra spends less than 1.4 milliseconds executing my query. So what is the DataStax driver doing the rest of the time?
(As a reference, I have done the same performance test against a local SQL Server instance resulting in about 1-2 milliseconds executing the same query from C#.)
Update:
I have attempted to do some profiling, which is not that easy to do with asynchronous code that you don't own...
My conclusion is that most of the time is spend parsing the response. Each response contains between 2000 - 3000 rows and parsing takes about 9 ms per response. Deserializing takes most of the time, about 6.5 ms, with decimal being the worst, about 3 ms per field. The other fields (text, int, date and timestamp) take about 0.5 ms per field.
Looking at my measured times I ought to have suspected this: the more rows in the response, the longer time it takes, and almost linearly.
#xmas79 Highlighted a great point. You should not create too many sessions instances (better use 1 for each keyspace), but there are also another guidelines that could help you. Follow the guidelines below and reference:
Use one Cluster instance per (physical) cluster (per application lifetime)
Use at most one Session per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a PreparedStatement
You can reduce the number of network roundtrips and also have
atomic operations by using Batches
http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra
EDIT
Also, taking a second look at your code, your are creating a prepared statement for every same query you are executing. The prepared statement should be created only once and you should use its reference to execute the queries. What prepared statements does is to send to the server the CQL that you will execute often so the server already parses the string and return to the user an identification for that . So, my suggestion to you is dont use it if you are not going to share the PreparedStatment object for each query. Or change your code to something like this:
[Test]
public void SelectPerformance()
{
_cluster = Cluster.Builder().AddContactPoint("127.0.0.1").Build();
var session = _cluster.Connect("dm");
var ps = session.Prepare(#"SELECT symbol, value_type, as_of_day, revision_timestamp_utc, value FROM daily_data_by_day WHERE symbol = ? AND value_type = ? AND as_of_day >= ? AND as_of_day < ?");
var items = new[]
{
// 20 different items...
};
foreach (var item in items)
{
var watch = Stopwatch.StartNew();
var rows = ExecuteQuery(session, ps, item.Symbol, item.FieldType, item.StartDate, item.EndDate);
watch.Stop();
Console.WriteLine($"{watch.ElapsedMilliseconds}\t{rows.Length}");
}
Console.WriteLine($"Average Execute: { _stopwatch.ElapsedMilliseconds/items.Length}");
_cluster.Dispose();
}
private Row[] ExecuteQuery(Session session, PreparedStatement ps, string symbol, int fieldType, LocalDate startDate, LocalDate endDate)
{
var statement = ps.Bind(symbol, fieldType, startDate, endDate);
// Do not enable request tracing for latency benchmarking
// statement.EnableTracing();
var rowSet = session.Execute(statement);
return rowSet.ToArray();
}
Short answer you want to keep the cluster object to Cassandra open and re use it across requests.
The creation of the cluster object itself is costly but gives benefits like automatic load balancing, token awareness, automatic failover , etc etc.
Why do you execute
using (var session = _cluster.Connect("dm"))
on every query? You should build your Cluster instance once, connect to the cluster and get the Session once, and reuse them everywhere. I think Cluster object configures important parameters like fail over, loadbalancing etc.. Session object manages them for you. Connecting everytime will give you performance penalties.
EDIT
It seems you are performing SELECT with the latency of 10ms-15ms each. Are you getting the same tracing numbers (eg 1.4ms) at every query? What's your storage IO system? If you are on spinning disks it could be a seek time penalty of your disk subsystem.
We've got a system that seems to be consuming a lot of data, it uses Dapper for database queries and Seq for logging. I was wondering if other than with SQL Profiler whether there was a way to add logging to Dapper to log the size of the dataset returned in MB -so we can flag large datasets for review?
This question has been asked a while ago but I was wondering whether there was now a way of doing it without wireshark and ideally without iterating over the rows/cells?
I would configure Provider Statistics for SQL Server for the connection in the base repository class. You can add a config setting to switch it on and easily save this information off to a log file or where ever you want.
Example code from MSDN
using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
namespace CS_Stats_Console_GetValue
{
class Program
{
static void Main(string[] args)
{
string connectionString = GetConnectionString();
using (SqlConnection awConnection =
new SqlConnection(connectionString))
{
// StatisticsEnabled is False by default.
// It must be set to True to start the
// statistic collection process.
awConnection.StatisticsEnabled = true;
string productSQL = "SELECT * FROM Production.Product";
SqlDataAdapter productAdapter =
new SqlDataAdapter(productSQL, awConnection);
DataSet awDataSet = new DataSet();
awConnection.Open();
productAdapter.Fill(awDataSet, "ProductTable");
// Retrieve the current statistics as
// a collection of values at this point
// and time.
IDictionary currentStatistics =
awConnection.RetrieveStatistics();
Console.WriteLine("Total Counters: " +
currentStatistics.Count.ToString());
Console.WriteLine();
// Retrieve a few individual values
// related to the previous command.
long bytesReceived =
(long) currentStatistics["BytesReceived"];
long bytesSent =
(long) currentStatistics["BytesSent"];
long selectCount =
(long) currentStatistics["SelectCount"];
long selectRows =
(long) currentStatistics["SelectRows"];
Console.WriteLine("BytesReceived: " +
bytesReceived.ToString());
Console.WriteLine("BytesSent: " +
bytesSent.ToString());
Console.WriteLine("SelectCount: " +
selectCount.ToString());
Console.WriteLine("SelectRows: " +
selectRows.ToString());
Console.WriteLine();
Console.WriteLine("Press any key to continue");
Console.ReadLine();
}
}
private static string GetConnectionString()
{
// To avoid storing the connection string in your code,
// you can retrive it from a configuration file.
return "Data Source=localhost;Integrated Security=SSPI;" +
"Initial Catalog=AdventureWorks";
}
}
}
Not really a complete answer, but might help you out.
sys.dm_exec_query_stats and sys.dm_exec_connections might help you trace large result sets. For example:
SELECT * FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
(Units are pages of 8k).
This sort of gives you what you're using wireshark for at the moment (kinda :)
SELECT * FROM sys.dm_exec_connections
CROSS APPLY sys.dm_exec_sql_text(most_recent_sql_handle)
ORDER BY num_writes DESC
https://msdn.microsoft.com/en-us/library/ms189741.aspx
https://msdn.microsoft.com/en-AU/library/ms181509.aspx
You can estimate the size required by a row summing up each column type size, then multiply by the number of rows. It should be accurate if you don't have TEXT / VARCHAR in your query:
int rowSize = 0;
foreach(DataColumn dc in Dataset1.Tables[0].Columns) {
rowSize += sizeof(dc.DataType);
}
int dataSize = rowSize * Dataset1.Tables[0].Rows.Count;
In case you need a more accurate figure, sum up the size of each individual value using Marshal.SizeOf:
int dataSize = 0;
foreach(DataRow dr in Dataset1.Tables[0].Rows)
{
int rowSize = 0;
for (int i = 0; i < Dataset1.Tables[0].Columns.Count; i++)
{
rowSize += System.Runtime.InteropServices.Marshal.SizeOf(dr[i]);
}
dataSize += rowSize;
}
Ideas for performance gain if high accuracy is not a concern:
Compute the size of just a sample. Let's say, instead of iterating through all rows, pick 1 in every 100, then multiply your result by 100 in the end.
Use [Marshal.SizeOf]((https://msdn.microsoft.com/en-us/library/y3ybkfb3.aspx) to compute the size of each DataRow dr instead of iterating through all it's values. It will give you a higher number since a DataRow object has additional properties, but that's something you can tweak by subtracting the size of an empty DataRow.
Know the average size of a single row beforehand, by it's columns, and just multiply by the number of rows.
We got around this limitation by just capping/limiting the query size. This prevents us from having to worry about size and doing double queries. The PR we used that you could also use is https://github.com/DapperLib/Dapper/pull/1758/files
on my site I allow people to buy subscriptions to my site in bulk(I call them vouchers). Once they have these vouchers, they give them to whoever and they enter that code into their account to upgrade them.
Right now I am thinking of doing 4 alphanumeric code(upper case, lower case and digits) and will have something like this
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[4];
var random = new Random();
for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}
var finalString = new String(stringChars);
For now I think that will give me more than enough combinations and if I ever do run out I can always up the length of the code. I want to keep it short because I don't want the user to have to type in huge as numbers.
I also don't have the time to make a more elegant solution maybe were they click a link or something in their email and it activates their account and of course this would cut down on someone trying to randomly guess a voucher number.
These are things I would deal with if the site every becomes more popular.
I am wondering though how can I handle the possible duplicate generation of the same voucher. My first thought was to check the database each time a voucher is created and if it exists then make a new one.
However that seems like it could be slow. So I thought also maybe getting all the keys first and store them in memory and they check there but if the list keeps growing I might run into out of memory exceptions and all that great stuff.
So does anyone have any ideas? Or am I stuck doing one of the 2 method I listed above?
I am using nhibernate, asp.net mvc and C#.
Edit
static void Main(string[] args)
{
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
HashAlgorithm sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
hold.Add(hex.Substring(0,3));
Console.WriteLine(hex.Substring(0, 4));
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
above is my attempt to try to use hashing. However I think I am missing something as it seems to have quite a bit more duplicates then expected.
Edit 2
I think I added what I was missing but not sure if this is exactly what he meant. I am also not sure what to do in a situation when I moved it as far as I can move it(my has seems to give me a length of 40 places I can move it).
static void Main(string[] args)
{
int subStringLength = 4;
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
SHA1CryptoServiceProvider sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
int startingPositon = 0;
string possibleVoucherCode = hex.Substring(startingPositon,subStringLength);
string voucherCode = Move(subStringLength, hold, hex, startingPositon, possibleVoucherCode);
hold.Add(voucherCode);
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
private static string Move(int subStringLength, List<string> hold, string hex, int startingPositon, string possibleVoucherCode)
{
if (hold.Contains(possibleVoucherCode))
{
int newPosition = startingPositon + 1;
if (newPosition <= hex.Length)
{
if ((newPosition + subStringLength) > hex.Length)
{
possibleVoucherCode = hex.Substring(newPosition, subStringLength);
return Move(subStringLength, hold, hex, newPosition, possibleVoucherCode);
}
// return something
return "0";
}
else
{
// return something
return "0";
}
}
else
{
return possibleVoucherCode;
}
}
}
It is going to be slow because you want to generate the vouchers randomly and then check the database for every generated code.
I would create a table vouchers with an id, the code and an is_used column. I would fill that table once with enough random codes. Since this can be done in a separate process, the performance won't be such a big problem. Let it run in the evening and the next day you get a fully filled vouchers-table.
If you want to prevent generating duplicate vouchers, that won't be a problem. You can generate them anyway and put them either in a System.Collections.Generic.HashSet (which prevents adding duplicates without throwing an exception) or call the Linq-method Distinct(), before adding them to that vouchers table.
If you insist on short codes:
Use a GUID as a primary key, generate one random number. How you might want to translate this in to alpha-num is up to you.
Use the last byte or two of the guid and the random number. 1234-684687 This should make it slightly less easy to bruteforce coupons. And handle any (rare) collisions with an exception.
Easy way to shorten an int, change it's base (from 10 to 62). (in VB, and this is old code)
This yields "2lkCB1" when given Int32.MaxValue
''//given intValue as your random integer
Dim result As String = String.Empty
Dim digits as String = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
Dim x As Integer
While (intValue > 0)
x = intValue Mod digits.Length
result = digits(x) & result
intValue = intValue - x
intValue = intValue \ digits.Length
End While
Return result
But now we're already answering more than one question.
For a bulk data operation like this, I would recommend not using NHibernate and just doing straight ADO.NET.
Batch Check
Since you anticipate generating big batches of codes at once, you should batch multiple code checks into a single round-trip to the database. If you're using SQL Server 2008 or higher, you could do this using table-valued parameters, checking a whole list of codes at once.
SELECT DISTINCT b.Code
FROM #batch b
WHERE NOT EXISTS (
SELECT v.Code
FROM dbo.Voucher v
WHERE v.Code = b.Code
);
Concurrency
Now, what about concurrency issues? What if two users generate the same code at roughly the same time? Or simply in-between the time when we check the code for uniqueness and when we insert it into the Voucher table?
We can take care of that by modifying the query as follows:
DECLARE #batchid uniqueidentifier;
SET #batchid = NEWID();
INSERT INTO dbo.Voucher (Code, BatchId)
SELECT DISTINCT b.Code, #batchid
FROM #batch b
WHERE NOT EXISTS (
SELECT Code
FROM dbo.Voucher v
WHERE b.Code = v.Code
);
SELECT Code
FROM dbo.Voucher
WHERE BatchId = #batchid;
Executing via .NET
Assuming that you have defined the following table-valued user type...
CREATE TYPE dbo.VoucherCodeList AS TABLE (
Code nvarchar(8) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL
/* !!! Remember to specify the collation on your Voucher.Code column too, since you want upper and lower-case codes. */
);
... you could execute this query via .NET code like this:
public ICollection<string> GenerateCodes(int numberOfCodes)
{
var result = new List<string>(numberOfCodes);
while (result.Count < numberOfCodes)
{
var batchSize = Math.Min(_batchSize, numberOfCodes - result.Count);
var batch = Enumerable.Range(0, batchSize)
.Select(x => GenerateRandomCode());
var oldResultCount = result.Count;
result.AddRange(FilterAndSecureBatch(batch));
var filteredBatchSize = result.Count - oldResultCount;
var collisionRatio = ((double)batchSize - filteredBatchSize) / batchSize;
// Automatically increment length of random codes if collisions begin happening too frequently
if (collisionRatio > _collisionThreshold)
CodeLength++;
}
return result;
}
private IEnumerable<string> FilterAndSecureBatch(IEnumerable<string> batch)
{
using (var command = _connection.CreateCommand())
{
command.CommandText = _sqlQuery; // the concurrency-safe query listed above
var metaData = new[] { new SqlMetaData("Code", SqlDbType.NVarChar, 8) };
var param = command.Parameters.Add("#batch", SqlDbType.Structured);
param.TypeName = "dbo.VoucherCodeList";
param.Value = batch.Select(x =>
{
var record = new SqlDataRecord(metaData);
record.SetString(0, x);
return record;
});
using (var reader = command.ExecuteReader())
while (reader.Read())
yield return reader.GetString(0);
}
}
Performance
After implementing all of this (and moving the command and parameter creation out of the loop so it would be re-used between batches), I was able to insert 10,000 codes using a batch size of 500 consistently in approx. 0.5 to 2 seconds, or 5 to 20 codes per millisecond.
Code Density / Collisions / Guessability
The _collisionThreshold field limits the density of your codes. It's a value between 0 and 1. Actually, it must be less than 1 or else you would wind up in an infinite loop when the 4 digit codes were exhausted (probably should add an assertion for this in code). I would recommend never turning it above 0.5 for performance reasons. More than 50% collisions would mean it's spending more time testing already-used codes than actually generating new ones.
Keeping the collision threshold low is how you would control how hard-to-guess your codes are. Setting _collisionThreshold to 0.01 would generate codes such that there's approximately a 1% chance of someone guessing a code.
If collisions occur too frequently, CodeLength (which is used by the GenerateRandomCode() method) will be incremented. This value needs to be persisted somewhere. After executing GenerateCodes(), check CodeLength to see if it has changed and then save the new value.
Source Code
The full code is available here: https://gist.github.com/3217856. I am the author of this code, and am releasing it under the MIT license. I had fun with this little challenge, and also got to learn how to pass a table-valued parameter to an inline parametrized query. I hadn't ever done that before. I've only ever passed them to full-fledged stored procedures.
A possible solution for you is like this:
Find the maximum ID of a voucher (an integer). Then, run any hash function on it, take the first 32 bits and convert to the string you want to show the user (or use a 32bit hash function such as Jenkins hash function). This will probably work, hash collisions are pretty rare. But this solution is very similar to yours, in the point of randomness.
You could run a test which finds the first 10 or 100 collisions (this should be enough for you) and forces the algorithm to "skip" them and use a different starting value. Then, you don't need to check the database at all (well, at least until you reach about 4294967296 vouchers...)
how about utilizing nHibernate's HiLo algorithm?
Here is an example on how you can get the next value (without DB access).
I'm using MongoDB 1.8.2 (Debian) and mongo-csharp-driver 1.1.0.4184 (IIS 7.5/.Net 4.0 x64).
Multiple items are inserted every second in a existing collection with ~ 3,000,000 objects (~ 1.9 GB).
The WebServer memory is increasing by ~ 1 MB after every insert -> which leads very fast to > 2 GB memory usage.
The memory is never released anymore and only an application pool recycle can free the memory.
Any ideas?
MongoServer server = MongoServer.Create(mongoDBConnectionString);
MongoDatabase database = server.GetDatabase(dbName);
MongoCollection<Result> resultCollection = database.GetCollection<Result>("Result");
resultCollection.Insert(result);
result looks like
private class Result
{
public ObjectId _id { get; set; }
public DateTime Timestamp { get; set; }
public int Location { get; set; }
public string Content { get; set; }
}
UPDATE:
My problem is not the insert, it's the select - sorry weird codebase to investigate ;-)
I reproduced it with this sample Code:
Console.WriteLine("Start - " + GC.GetTotalMemory(false).ToString("#,###,##0") + " Bytes");
for (int i = 0; i < 10; i++)
{
MongoServer server = MongoServer.Create(mongoDBConnectionString);
MongoDatabase database = server.GetDatabase(dbName);
MongoCollection<Result> resultCollection = database.GetCollection<Result>("Result");
var query = Query.And ( Query.EQ("Location", 1), Query.GTE("Timestamp", DateTime.Now.AddDays(-90)) );
MongoCursor<Result> cursor = resultCollection.FindAs<Result>(query);
foreach (Result result in cursor)
{
// NOOP
}
Console.WriteLine(i + " - " + GC.GetTotalMemory(false).ToString("#,###,##0") + " Bytes");
}
Output from a .Net 4.0 Console Application with 10.000 results in the cursor:
Start - 193.060 Bytes
0 - 12.736.588 Bytes
1 - 24.331.600 Bytes
2 - 16.180.484 Bytes
3 - 13.223.036 Bytes
4 - 30.974.892 Bytes
5 - 13.335.236 Bytes
6 - 13.439.448 Bytes
7 - 13.942.436 Bytes
8 - 14.026.108 Bytes
9 - 14.113.352 Bytes
Output from a .Net 4.0 Web Application with the same 10.000 results in the cursor:
Start - 5.258.376 Bytes
0 - 20.677.816 Bytes
1 - 29.893.880 Bytes
2 - 43.783.016 Bytes
3 - 20.921.280 Bytes
4 - 34.814.088 Bytes
5 - 48.698.704 Bytes
6 - 62.576.480 Bytes
7 - 76.453.728 Bytes
8 - 90.347.360 Bytes
9 - 104.232.800 Bytes
RESULT:
Bug was reported to 10gen and they will fix it in version >= 1.4 (they are working currently on 1.2)!!!
According to documentation you should create one instance of MongoServer for one server you connected to:
MongoServer class
This class serves as the root object for working with a MongoDB
server. You will create one instance of this class for each server you
connect to.
So you need only one instance of MongoServer. This should solve your memory leaks issue.
I suggest to use dependency injection (unity, structure map, ..) to register MongoServer instance in container as singletone or just use classic singletone pattern for MongoServer. If you new in dependency injection you can take a look in Martin Fowler article.
Update:
Probably issue was not in mongodb c# driver. I've made following testing:
w7, iis 7.5, same c# driver, same mongodb:
for (int i = 0; i < 3000000; i++)
{
MongoServer server = MongoServer.Create("mongodb://localhost:27020");
MongoDatabase database = server.GetDatabase("mpower_read");
MongoCollection<Result> resultCollection =
database.GetCollection<Result>("results");
resultCollection.Insert(new Result()
{
_id = ObjectId.GenerateNewId(),
Content = i.ToString(),
Location = i,
Timestamp = DateTime.Now
});
}
I've even run this test 3 times and server not eat memory at all. So...