Passing a .NET Datatable to MATLAB - c#

I'm building an interface layer for a Matlab component which is used to analyse data maintained by a separate .NET application which I am also building. I'm trying to serialise a .NET datatable as a numeric array to be passed to the MATLAB component (as part of a more generalised serialisation routine).
So far, I've been reasonably successful with passing tables of numeric data but I've hit a snag when trying to add a column of datatype DateTime. What I've been doing up to now is stuffing the values from the DataTable into a double array, because MATLAB only really cares about doubles, and then doing a straight cast to a MWNumericArray, which is essentially a matrix.
Here's the current code;
else if (sourceType == typeof(DataTable))
{
DataTable dtSource = source as DataTable;
var rowIdentifiers = new string[dtSource.Rows.Count];
// I know this looks silly but we need the index of each item
// in the string array as the actual value in the array as well
for (int i = 0; i < dtSource.Rows.Count; i++)
{
rowIdentifiers[i] = i.ToString();
}
// convenience vars
int rowCount = dtSource.Rows.Count;
int colCount = dtSource.Columns.Count;
double[,] values = new double[rowCount, colCount];
// For each row
for (int rownum = 0; rownum < rowCount; rownum++)
{
// for each column
for (int colnum = 0; colnum < colCount; colnum++)
{
// ASSUMPTION. value is a double
values[rownum, colnum] = Conversion.ConvertToDouble(dtSource.Rows[rownum][colnum]);
}
}
return (MWNumericArray)values;
}
Conversion.ConvertToDouble is my own routine which caters for NULLS, DBNull and returns double.NaN, again because Matlab treats all NULLS as NaNs.
So here's the thing; Does anyone know of a MATLAB datatype that would allow me to pass in a contiguous array with multiple datatypes? The only workaround I can conceive of involves using a MWStructArray of MWStructArrays, but that seems hacky and I'm not sure how well it would work in the MATLAB code, so I'd like to try to find a more elegant solution if I can. I've had a look at using an MWCellArray, but it gives me a compile error when I try to instantiate it.
I'd like to be able to do something like;
object[,] values = new object[rowCount, colCount];
// fill loosely-typed object array
return (MWCellArray)values;
But as I said, I get a compile error with this, also with passing an object array to the constructor.
Apologies if I have missed anything silly. I've done some Googling, but information on Matlab to .NET interfaces seems a little light, so that is why I posted it here.
Thanks in advance.
[EDIT]
Thanks to everyone for the suggestions.
Turns out that the quickest and most efficient way for our specific implementation was to convert the Datetime to an int in the SQL code.
However, of the other approaches, I would recommend using the MWCharArray approach. It uses the least fuss, and it turns out I was just doing it wrong - you can't treat it like another MWArray type, as it is of course designed to deal with multiple datatypes you need to iterate over it, sticking in MWNumerics or whatever takes your fancy as you go. One thing to be aware of is that MWArrays are 1-based, not 0-based. That one keeps catching me out.
I'll go into a more detailed discussion later today when I have the time, but right now I don't. Thanks everyone once more for your help.

As #Matt suggested in the comments, if you want to store different datatypes (numeric, strings, structs, etc...), you should use the equivalent of cell-arrays exposed by this managed API, namely the MWCellArray class.
To illustrate, I implemented a simple .NET assembly. It exposes a MATLAB function that receives a cell-array (records from a database table), and simply prints them. This function would be called from our C# application, which generates a sample DataTable, and convert it into a MWCellArray (fill table entries cell-by-cell).
The trick is to map the objects contained in the DataTable to the supported types by the MWArray-derived classes. Here are the ones I used (check the documentation for a complete list):
.NET native type MWArray classes
------------------------------------------
double,float,int,.. MWNumericArray
string MWCharArray
DateTime MWNumericArray (using Ticks property)
A note about the date/time data: in .NET, the System.DateTime expresses date and time as:
the number of 100-nanosecond intervals that have elapsed since January
1, 0001 at 00:00:00.000
while in MATLAB, this is what the DATENUM function has to say:
A serial date number represents the whole and fractional number of
days from a specific date and time, where datenum('Jan-1-0000
00:00:00') returns the number 1
For this reason, I wrote two helper functions in the C# application to convert the DateTime "ticks" to match the MATLAB definition of serial date numbers.
First, consider this simple MATLAB function. It expects to receive a numRos-by-numCols cellarray containing the table data. In my example, the columns are: Name (string), Price (double), Date (DateTime)
function [] = my_cell_function(C)
names = C(:,1);
price = cell2mat(C(:,2));
dt = datevec( cell2mat(C(:,3)) );
disp(names)
disp(price)
disp(dt)
end
Using deploytool from MATLAB Builder NE, we build the above as a .NET assembly. Next, we create a C# console application, then add a reference to the MWArray.dll assembly, in addition to the above generated one. This is the program I am using:
using System;
using System.Data;
using MathWorks.MATLAB.NET.Utility; // MWArray.dll
using MathWorks.MATLAB.NET.Arrays; // MWArray.dll
using CellExample; // CellExample.dll assembly created
namespace CellExampleTest
{
class Program
{
static void Main(string[] args)
{
// get data table
DataTable table = getData();
// create the MWCellArray
int numRows = table.Rows.Count;
int numCols = table.Columns.Count;
MWCellArray cell = new MWCellArray(numRows, numCols); // one-based indices
// fill it cell-by-cell
for (int r = 0; r < numRows; r++)
{
for (int c = 0; c < numCols; c++)
{
// fill based on type
Type t = table.Columns[c].DataType;
if (t == typeof(DateTime))
{
//cell[r+1,c+1] = new MWNumericArray( convertToMATLABDateNum((DateTime)table.Rows[r][c]) );
cell[r + 1, c + 1] = convertToMATLABDateNum((DateTime)table.Rows[r][c]);
}
else if (t == typeof(string))
{
//cell[r+1,c+1] = new MWCharArray( (string)table.Rows[r][c] );
cell[r + 1, c + 1] = (string)table.Rows[r][c];
}
else
{
//cell[r+1,c+1] = new MWNumericArray( (double)table.Rows[r][c] );
cell[r + 1, c + 1] = (double)table.Rows[r][c];
}
}
}
// call MATLAB function
CellClass obj = new CellClass();
obj.my_cell_function(cell);
// Wait for user to exit application
Console.ReadKey();
}
// DateTime <-> datenum helper functions
static double convertToMATLABDateNum(DateTime dt)
{
return (double)dt.AddYears(1).AddDays(1).Ticks / (10000000L * 3600L * 24L);
}
static DateTime convertFromMATLABDateNum(double datenum)
{
DateTime dt = new DateTime((long)(datenum * (10000000L * 3600L * 24L)));
return dt.AddYears(-1).AddDays(-1);
}
// return DataTable data
static DataTable getData()
{
DataTable table = new DataTable();
table.Columns.Add("Name", typeof(string));
table.Columns.Add("Price", typeof(double));
table.Columns.Add("Date", typeof(DateTime));
table.Rows.Add("Amro", 25, DateTime.Now);
table.Rows.Add("Bob", 10, DateTime.Now.AddDays(1));
table.Rows.Add("Alice", 50, DateTime.Now.AddDays(2));
return table;
}
}
}
The output of this C# program as returned by the compiled MATLAB function:
'Amro'
'Bob'
'Alice'
25
10
50
2011 9 26 20 13 8.3906
2011 9 27 20 13 8.3906
2011 9 28 20 13 8.3906

One option, is to just open up .NET code directly from matlab, and have matlab query the database directly, using your .net interface instead of trying to go through this serialization process you describe. I have done this repeatedly in our environment with great success. In such an an endeavor
Net.addAssembly is your biggest friend.
Details are here.
http://www.mathworks.com/help/matlab/ref/net.addassembly.html
A second option would be to go with Matlab Cell Array's. You can set it up, so the columns are different data types, each column forming a cell. That is a trick matlab itself uses in the textscan function. I'd recommend reading the documentation for that function here:
http://www.mathworks.com/help/techdoc/ref/textscan.html
A third option, is to use textscan completely. Write a text file out from your .net code, and let textscan handle the parsing of it. Textscan is very powerful mechanism for getting this kind of data into matlab. You can point textscan to a file, or to a bunch of strings.

I have tried the functions written by #Amro but the result for certain dates are not correct.
What I tried was:
Create a date in C#
Use function to convert to Matlab date num as supplied by #Amro
Use that number in Matlab to check its correctness
It seems to have problems with date with 1 Jan 00:00:00 for some years e.g. 2014, 2015. For example,
DateTime dt = new DateTime(2014, 1, 1, 0, 0, 0);
double dtmat = convertToMATLABDateNum(dt);
I got dtmat = 735599.0 from this.
I used in Matlab as follow:
datestr(datenum(735599.0))
I got this in return:
ans = 31-Dec-2013
When I tried 1 Jan 2012 it was OK. Any suggestion or why this happens?

I had the same issue as #Johan.
The problem is in Leap years that not calculate correctly the date
To fix it I change the code that converts the DateTime to the following:
private static long MatlabDateConversionFactor = (10000000L * 3600L * 24L);
private static long tickDiference = 367;
public static double convertToMATLABDateNum(DateTime dt) {
var converted = ((double)dt.Ticks / (double)MatlabDateConversionFactor);
return converted + tickDiference;
}
public static DateTime convertFromMATLABDateNum(double datenum) {
var ticks = (long)((datenum - 367) * MatlabDateConversionFactor);
return new DateTime(ticks, DateTimeKind.Utc);
}

Related

Export DataTable C# To Access - Slow Issue - 1 Million rows & 29 columns

I am currently working on a C# project to export a datatable from C# to an Access accdb file.
For the exportation function, I am using this function which comes from another post: Writing large number of records (bulk insert) to Access in .NET/C#
public static void InsertDataIntoAccessTable_Version3(DataTable dtOutData, String DBPath, String TableNm)
{
DAO.DBEngine dbEngine = new DAO.DBEngine();
Boolean CheckFl = false;
DateTime start = DateTime.Now;
try
{
DAO.Database db = dbEngine.OpenDatabase(DBPath);
DAO.Recordset AccesssRecordset = db.OpenRecordset(TableNm);
DAO.Field[] AccesssFields = new DAO.Field[dtOutData.Columns.Count];
//Loop on each row of dtOutData
for (Int32 rowCounter = 0; rowCounter < dtOutData.Rows.Count; rowCounter++)
{
AccesssRecordset.AddNew();
Console.WriteLine(rowCounter);
//Loop on column
for (Int32 colCounter = 0; colCounter < dtOutData.Columns.Count; colCounter++)
{
// for the first time... setup the field name.
if (!CheckFl)
AccesssFields[colCounter] = AccesssRecordset.Fields[dtOutData.Columns[colCounter].ColumnName];
AccesssFields[colCounter].Value = dtOutData.Rows[rowCounter][colCounter];
}
AccesssRecordset.Update();
CheckFl = true;
}
AccesssRecordset.Close();
db.Close();
double elapsedTimeInSeconds = DateTime.Now.Subtract(start).TotalSeconds;
Console.WriteLine("Append took {0} seconds", elapsedTimeInSeconds);
}
finally
{
System.Runtime.InteropServices.Marshal.ReleaseComObject(dbEngine);
dbEngine = null;
}
}
According to the another post, it exported 120,000 Rows - 20 columns in 4 seconds. However, I cannot achieve such performance in my case where the datatable contains 1 Million rows and 29 columns. According to my estimation, it takes more then 20 minutes to finish.
On the other hand, I also came across another method using adatper but I cannot implement it yet.
I would like to know whether you have other solutions, suggestions, or advice to perform a much faster exportation from C# datatable to Access table.
For the technical environment, I am using Visual Studio 2017 32bit and Access 365.
Thank you in advance.

In C#, what is the best way to Parse out and sort by time string?

I am reading and loading in files into Excel using C# VSTO and the filenames are something like this:
C:\myfiles\1000AM.csv
C:\myfiles\1100AM.csv
C:\myfiles\1200PM.csv
C:\myfiles\100PM.csv
C:\myfiles\200PM.csv
And then i am putting these in a list and need to sort these by "time".
How can i convert the string in the format above into a time object that i can use to sort on?
You need extract the time parts somehow and then compare them to each other.
You could for example do this using a Comparison<string>. Here is an example that uses the Span<T> type to do this without allocating any additional garbage:
List<string> list = new List<string>() { ... }
list.Sort((a, b) =>
{
//compare AM/PM
int compareAmAndPm = a.AsSpan().Slice(a.Length - 6, 2)
.CompareTo(b.AsSpan().Slice(b.Length - 6, 2), StringComparison.Ordinal);
if (compareAmAndPm != 0)
return compareAmAndPm;
//compare the times as integers
int index = a.LastIndexOf('\\');
var firstTime = int.Parse(a.AsSpan().Slice(index + 1, a.Length - index - 7));
index = b.LastIndexOf('\\');
var secondTime = int.Parse(b.AsSpan().Slice(index + 1, b.Length - index - 7));
return firstTime.CompareTo(secondTime);
});
It should give you a result of this:
C:\myfiles\1000AM.csv
C:\myfiles\1100AM.csv
C:\myfiles\100PM.csv
C:\myfiles\200PM.csv
C:\myfiles\1200PM.csv
From practice we figured out, that a Time or Date on it's own does not work 99% of the cases. We need both plus the timezone to have any hope of processing them meaningfully.
That is why we only have things like DateTime, nowadays. Ideally those file names should consist of the full DateTime, in UTC and Invariant culture. If you got the option to change how those are created, use it.
However if you consistently only have one part that is not an issue: DateTime simply used default values for the other two. And as those two will be consistent, they will work. The only issue will be a finding a culture setting that eats that AM/PM format.

Excel / C# - Time String Being Interpreted as Double?

I am writing a program that pulls data from a CSV file (which, due to the structure, is easier to work with through Excel). There are columns that hold a date and time. The date column processes correctly, yet the time column (F) is being interpreted as a double. For example, in the following loop, it sees the value as 0.00 on the first loop, 0.25 on the second, 0.26041666666666669 on the third, 0.27083333333333331o on the fourth iteration, and so on.
for (i = startRow; i <= endRow; i++)
{
PeriodSales saleRow = new PeriodSales();
DateTime saleDate = Convert.ToDateTime((sheet.Cells[i, 5] as Excel.Range).Value);
var timeString = (sheet.Cells[i, 6] as Excel.Range).Value;
DateTime timeOfSale;
timeOfSale = new DateTime(saleDate.Year, saleDate.Month, saleDate.Day, 0, 0, 0);
// the lines below were commented out for testing purposes
(so I could see the value of timeString in the loop */
/* if (timeString != "0")
{
String[] timeArray = timeString.Split(':');
timeOfSale = new DateTime(saleDate.Year, saleDate.Month, saleDate.Day, Convert.ToInt32(timeArray[0]), Convert.ToInt32(timeArray[1]), 0);
}
else
{
timeOfSale = new DateTime(saleDate.Year, saleDate.Month, saleDate.Day, 0, 0, 0);
} */
Attached is a screenshot of my spreadsheet/CSV
The underlying CSV (in Notepad++)
Thanks for any guidance.
This discussion might answer your question:
Capturing Time Values from an Excel Cell
The first answer there may be what you're looking for. There may be some additional C# you're going to need to write to get the conversions you're looking for.
You want to use
timeOfSale = DateTime.FromOADate(timeString).TimeOfDay;
which will convert the date using excel format

Pitfalls in C# for a new user. (FWHM calculation)

This is my idea to program a simple math module (function) that can be called from another main program. It calculates the FWHM(full width at half the max) of a curve. Since this is my first try at Visual Studio and C#. I would like to know few basic programming structures I should learn in C# coming from a Mathematica background.
Is double fwhm(double[] data, int c) indicate the input arguments
to this function fwhm should be a double data array and an Integer
value? Did I get this right?
I find it difficult to express complex mathematical equations (line 32/33) to express them in parenthesis and divide one by another, whats the right method to do that?
How can I perform Mathematical functions on elements of an Array like division and store the results in the same Array?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace DEV_2
{
class fwhm
{
static double fwhm(double[] data, int c) // data as 2d data and c is integer
{
double[] datax;
double[] datay;
int L;
int Mag = 4;
double PP = 2.2;
int CI;
int k;
double Interp;
double Tlead;
double Ttrail;
double fwhm;
L = datay.Length;
// Create datax as index for the number of elemts in data from 1-Length(data).
for (int i = 1; i <= data.Length; i++)
{
datax[i] = (i + 1);
}
//Find max in datay and divide all elements by maxValue.
var m = datay.Length; // Find length of datay
Array.ForEach(datay, (x) => {datay[m++] = x / datay.Max();}); // Divide all elements of datay by max(datay)
double maxValue = datay.Max();
CI = datay.ToList().IndexOf(maxValue); // Push that index to CI
// Start to search lead
int k = 2;
while (Math.Sign(datay[k]) == Math.Sign(datay[k-1]-0.5))
{
k=k+1;
}
Interp = (0.5-datay[k-1])/(datay[k]-datay[k-1]);
Tlead = datax[k-1]+Interp*(datax[k]-datax[k-1]);
CI = CI+1;
// Start search for the trail
while (Math.Sign(datay[k]-0.5) == Math.Sign(datay[k-1]-0.5) && (k<=L-1))
{
k=k+1;
}
if (k != L)
{
Interp = (0.5-datay[k-1])/(datay[k]-datay[k-1]);
Ttrail = datax[k-1] + Interp*(datax[k]-datax[k-1]);
fwhm =((Ttrail-Tlead)*PP)/Mag;
}
}//end main
}//end class
}//end namespace
There are plenty of pitfalls in C#, but working through problems is a great way to find and learn them!
Yes, when passing parameters to a method the correct syntax is MethodName(varType varName) seperated by a comma for multiple parameters. Some pitfalls arise here with differences in passing Value types and Reference types. If you're interested here is some reading on the subject.
Edit: As pointed out in the comments you should write code as best as possible to require as few comments as possible (thus paragraph between #3 and #4), however if you need to do very specific and slightly complex math then you should comment to clarify what is occuring.
If you mean difficulties understanding, make sure you comment your code properly. If you mean difficulties writing it, you can create variables to simplify reading your code (but generally unnecessary) or look up functions or libraries to help you, this is a bit open ended question if you have a particular functionality you are looking for perhaps we could be of more help.
You can access your array via indexes such as array[i] will get the ith index. Following this you can manipulate the data that said index is pointing to in any way you wish, array[i] = (array[i]/24)^3 or array[i] = doMath(array[i])
A couple things you can do if you like to clean a little, but they are preference based, is not declare int CI; int k; in your code before you initialize them with int k = 2;, there is no need (although you can if it helps you). The other thing is to correctly name your variables, common practice is a more descriptive camelCase naming, so perhaps instead of int CI = datay.ToList().IndexOf(maxValue); you coud use int indexMaxValueYData = datay.ToList().IndexOf(maxValue);
As per your comment question "What would this method return?" The method will return a double, as declared above. returnType methodName(parameters) However you need to add that in your code, as of now I see no return line. Such as return doubleVar; where doubleVar is a variable of type double.

Best Way to Check for Used Key with Nhibernate?

on my site I allow people to buy subscriptions to my site in bulk(I call them vouchers). Once they have these vouchers, they give them to whoever and they enter that code into their account to upgrade them.
Right now I am thinking of doing 4 alphanumeric code(upper case, lower case and digits) and will have something like this
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[4];
var random = new Random();
for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}
var finalString = new String(stringChars);
For now I think that will give me more than enough combinations and if I ever do run out I can always up the length of the code. I want to keep it short because I don't want the user to have to type in huge as numbers.
I also don't have the time to make a more elegant solution maybe were they click a link or something in their email and it activates their account and of course this would cut down on someone trying to randomly guess a voucher number.
These are things I would deal with if the site every becomes more popular.
I am wondering though how can I handle the possible duplicate generation of the same voucher. My first thought was to check the database each time a voucher is created and if it exists then make a new one.
However that seems like it could be slow. So I thought also maybe getting all the keys first and store them in memory and they check there but if the list keeps growing I might run into out of memory exceptions and all that great stuff.
So does anyone have any ideas? Or am I stuck doing one of the 2 method I listed above?
I am using nhibernate, asp.net mvc and C#.
Edit
static void Main(string[] args)
{
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
HashAlgorithm sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
hold.Add(hex.Substring(0,3));
Console.WriteLine(hex.Substring(0, 4));
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
above is my attempt to try to use hashing. However I think I am missing something as it seems to have quite a bit more duplicates then expected.
Edit 2
I think I added what I was missing but not sure if this is exactly what he meant. I am also not sure what to do in a situation when I moved it as far as I can move it(my has seems to give me a length of 40 places I can move it).
static void Main(string[] args)
{
int subStringLength = 4;
List<string> hold = new List<string>();
for (int i = 0; i < 10000; i++)
{
SHA1CryptoServiceProvider sha = new SHA1CryptoServiceProvider();
byte[] result = sha.ComputeHash(BitConverter.GetBytes(i));
string hex = null;
foreach (byte x in result)
{
hex += String.Format("{0:x2}", x);
}
int startingPositon = 0;
string possibleVoucherCode = hex.Substring(startingPositon,subStringLength);
string voucherCode = Move(subStringLength, hold, hex, startingPositon, possibleVoucherCode);
hold.Add(voucherCode);
}
Console.WriteLine("Number of Distinct values {0}", hold.Distinct().Count());
}
private static string Move(int subStringLength, List<string> hold, string hex, int startingPositon, string possibleVoucherCode)
{
if (hold.Contains(possibleVoucherCode))
{
int newPosition = startingPositon + 1;
if (newPosition <= hex.Length)
{
if ((newPosition + subStringLength) > hex.Length)
{
possibleVoucherCode = hex.Substring(newPosition, subStringLength);
return Move(subStringLength, hold, hex, newPosition, possibleVoucherCode);
}
// return something
return "0";
}
else
{
// return something
return "0";
}
}
else
{
return possibleVoucherCode;
}
}
}
It is going to be slow because you want to generate the vouchers randomly and then check the database for every generated code.
I would create a table vouchers with an id, the code and an is_used column. I would fill that table once with enough random codes. Since this can be done in a separate process, the performance won't be such a big problem. Let it run in the evening and the next day you get a fully filled vouchers-table.
If you want to prevent generating duplicate vouchers, that won't be a problem. You can generate them anyway and put them either in a System.Collections.Generic.HashSet (which prevents adding duplicates without throwing an exception) or call the Linq-method Distinct(), before adding them to that vouchers table.
If you insist on short codes:
Use a GUID as a primary key, generate one random number. How you might want to translate this in to alpha-num is up to you.
Use the last byte or two of the guid and the random number. 1234-684687 This should make it slightly less easy to bruteforce coupons. And handle any (rare) collisions with an exception.
Easy way to shorten an int, change it's base (from 10 to 62). (in VB, and this is old code)
This yields "2lkCB1" when given Int32.MaxValue
''//given intValue as your random integer
Dim result As String = String.Empty
Dim digits as String = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
Dim x As Integer
While (intValue > 0)
x = intValue Mod digits.Length
result = digits(x) & result
intValue = intValue - x
intValue = intValue \ digits.Length
End While
Return result
But now we're already answering more than one question.
For a bulk data operation like this, I would recommend not using NHibernate and just doing straight ADO.NET.
Batch Check
Since you anticipate generating big batches of codes at once, you should batch multiple code checks into a single round-trip to the database. If you're using SQL Server 2008 or higher, you could do this using table-valued parameters, checking a whole list of codes at once.
SELECT DISTINCT b.Code
FROM #batch b
WHERE NOT EXISTS (
SELECT v.Code
FROM dbo.Voucher v
WHERE v.Code = b.Code
);
Concurrency
Now, what about concurrency issues? What if two users generate the same code at roughly the same time? Or simply in-between the time when we check the code for uniqueness and when we insert it into the Voucher table?
We can take care of that by modifying the query as follows:
DECLARE #batchid uniqueidentifier;
SET #batchid = NEWID();
INSERT INTO dbo.Voucher (Code, BatchId)
SELECT DISTINCT b.Code, #batchid
FROM #batch b
WHERE NOT EXISTS (
SELECT Code
FROM dbo.Voucher v
WHERE b.Code = v.Code
);
SELECT Code
FROM dbo.Voucher
WHERE BatchId = #batchid;
Executing via .NET
Assuming that you have defined the following table-valued user type...
CREATE TYPE dbo.VoucherCodeList AS TABLE (
Code nvarchar(8) COLLATE SQL_Latin1_General_CP1_CS_AS NOT NULL
/* !!! Remember to specify the collation on your Voucher.Code column too, since you want upper and lower-case codes. */
);
... you could execute this query via .NET code like this:
public ICollection<string> GenerateCodes(int numberOfCodes)
{
var result = new List<string>(numberOfCodes);
while (result.Count < numberOfCodes)
{
var batchSize = Math.Min(_batchSize, numberOfCodes - result.Count);
var batch = Enumerable.Range(0, batchSize)
.Select(x => GenerateRandomCode());
var oldResultCount = result.Count;
result.AddRange(FilterAndSecureBatch(batch));
var filteredBatchSize = result.Count - oldResultCount;
var collisionRatio = ((double)batchSize - filteredBatchSize) / batchSize;
// Automatically increment length of random codes if collisions begin happening too frequently
if (collisionRatio > _collisionThreshold)
CodeLength++;
}
return result;
}
private IEnumerable<string> FilterAndSecureBatch(IEnumerable<string> batch)
{
using (var command = _connection.CreateCommand())
{
command.CommandText = _sqlQuery; // the concurrency-safe query listed above
var metaData = new[] { new SqlMetaData("Code", SqlDbType.NVarChar, 8) };
var param = command.Parameters.Add("#batch", SqlDbType.Structured);
param.TypeName = "dbo.VoucherCodeList";
param.Value = batch.Select(x =>
{
var record = new SqlDataRecord(metaData);
record.SetString(0, x);
return record;
});
using (var reader = command.ExecuteReader())
while (reader.Read())
yield return reader.GetString(0);
}
}
Performance
After implementing all of this (and moving the command and parameter creation out of the loop so it would be re-used between batches), I was able to insert 10,000 codes using a batch size of 500 consistently in approx. 0.5 to 2 seconds, or 5 to 20 codes per millisecond.
Code Density / Collisions / Guessability
The _collisionThreshold field limits the density of your codes. It's a value between 0 and 1. Actually, it must be less than 1 or else you would wind up in an infinite loop when the 4 digit codes were exhausted (probably should add an assertion for this in code). I would recommend never turning it above 0.5 for performance reasons. More than 50% collisions would mean it's spending more time testing already-used codes than actually generating new ones.
Keeping the collision threshold low is how you would control how hard-to-guess your codes are. Setting _collisionThreshold to 0.01 would generate codes such that there's approximately a 1% chance of someone guessing a code.
If collisions occur too frequently, CodeLength (which is used by the GenerateRandomCode() method) will be incremented. This value needs to be persisted somewhere. After executing GenerateCodes(), check CodeLength to see if it has changed and then save the new value.
Source Code
The full code is available here: https://gist.github.com/3217856. I am the author of this code, and am releasing it under the MIT license. I had fun with this little challenge, and also got to learn how to pass a table-valued parameter to an inline parametrized query. I hadn't ever done that before. I've only ever passed them to full-fledged stored procedures.
A possible solution for you is like this:
Find the maximum ID of a voucher (an integer). Then, run any hash function on it, take the first 32 bits and convert to the string you want to show the user (or use a 32bit hash function such as Jenkins hash function). This will probably work, hash collisions are pretty rare. But this solution is very similar to yours, in the point of randomness.
You could run a test which finds the first 10 or 100 collisions (this should be enough for you) and forces the algorithm to "skip" them and use a different starting value. Then, you don't need to check the database at all (well, at least until you reach about 4294967296 vouchers...)
how about utilizing nHibernate's HiLo algorithm?
Here is an example on how you can get the next value (without DB access).

Categories

Resources