efficient way to do multi threaded calls to sql server? - c#

i have an sql stored procedure that will call to TOP 1000 records from a table that function like a queue-in this table there will be more or less 30,000-40,000 records.the call to the SP takes ~4 seconds (there's an xml column) so to finish the calls it will take ~2 minutes.
i thought to use multi threaded calls and to insert the records to a sync dictionary\list.
did someone did that before? any efficient way to end the calls as soon as possible?
Thanks...

Consider optimizing the query before resorting to threads.
In my experience, when beginners at multi-threading implement threads, it usually does not improve performance. Worse, it usually introduces subtle errors which can be difficult to debug.
Optimize the query first, and you may find that you don't need threads.
Even if you implemented them, eventually you'll have SQL Server doing too much work, and the threaded requests will simply have to wait.

Basic mistake is wanting to insert into the database from multiple threads and overload server with connections, locks, and eventually bring it to its knees.
If you are READING the data, you will do much better if you find a query that will perform faster and fetch as much data as you can at once.
To me, it seems like your problem is not solvable on its level - maybe if you elaborate what you want to do you'll get better advice.
EDIT:
I did use SQL as a queue once - and I just remembered - to dequeue, you'll have to use result from the first query to get input to the second, so threads are out of the question. Or, you'll have to MARK your queued data 'done' in the database, and your READ will become UPDATE -> resulting to locking.
If you are reading, and you want to react as soon as possible, you can use DataReader, then read ALL of the data, and chunk your processing into threads - read 100 records, fork a thread and pass it to it... then next records and so on. That way you'll be able to balance your resource usage.

Try reading the data asynchronously using DataReader; fetch the columns that can uniquely identify the row in the database .Populate the Queue to hold the returned data value (Custom Object) and run work threads to perform the task against the queue.
You have to decide how many worker threads should be implemented to perform the task as threads have their own overheads and if not implemented correctly could be a nightmare.

If you really have to you can start BGWorkers that individually make connections to the server and report back with their progress.
I did the same thing for an elaborate export/import application to move roughly 50GB of data (4GB deflatestream'ed) except I only used the BGWorker to do the work consecutively, not concurrently, without locking up the UI-thread..

It isn't clear if you're selecting the 1000 most recently added rows, or the 1000 rows with the highest value in a particular column, nor is it clear whether your rows are mutable -- i.e. a row might qualify for the top 1000 yesterday but then get updated so that it no longer qualifies today. But if the individual rows are not mutable, you could have a separate table for the TOP1000, and when the 1001st row is inserted into it, an after-insert trigger would move the 1001st row (however you determine that row) to a HISTORY table. That would make the selection virtually instantaneous: select * from TOP1000. You'd simply combine the two tables with a UNION when you need to query the TOP1000 and HISTORY as though they were one table. Or instead of trigger you could wrap the insert and 1001st-row delete in a transaction.
Different can of worms, though, if the rows mutate, and can move in and out of the top 1000.

public struct BillingData
{
public int CustomerTrackID, CustomerID;
public DateTime BillingDate;
}
public Queue<BillingData> customerQueue = new Queue<BillingData>();
volatile static int ThreadProcessCount = 0;
readonly static object threadprocesslock = new object();
readonly static object queuelock = new object();
readonly static object countlock = new object();
AsyncCallback asyncCallback
// Pulling the Data Aync from the Database
private void StartProcess()
{
SqlCommand command = SQLHelper.GetCommand("GetRecordsByBillingTrackID");
command.Connection = SQLHelper.GetConnection("Con");SQLHelper.DeriveParameters(command);
command.Parameters["#TrackID"].Value = trackid;
asyncCallback = new AsyncCallback(FetchData);
command.BeginExecuteXmlReader(asyncCallback, command);
}
public void FetchData(IAsyncResult c1)
{
SqlCommand comm1 = (SqlCommand)c1.AsyncState;
System.Xml.XmlReader xr = comm1.EndExecuteXmlReader(c1);
xr.Read();
string data = "";
while (!xr.EOF)
{
data = xr.ReadOuterXml();
XmlDocument dom = new XmlDocument();
dom.LoadXml("<data>" + data + "</data>");
BillingData billingData;
billingData.CustomerTrackID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerTrackID"].Value);
billingData.CustomerID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerID"].Value);
billingData.BillingDate = Convert.ToDateTime(dom.FirstChild.ChildNodes[0].Attributes["BillingDate"].Value);
lock (queuelock)
{
if (!customerQueue.Contains(billingData))
{
customerQueue.Enqueue(billingData);
}
}
AssignThreadProcessToTheCustomer();
}
xr.Close();
}
// Assign the Threads based on the data pulled
private void AssignThreadProcessToTheCustomer()
{
int TotalProcessThreads = 5;
int TotalCustomersPerThread = 5;
if (ThreadProcessCount < TotalProcessThreads)
{
int ThreadsNeeded = (customerQueue.Count % TotalCustomersPerThread == 0) ? (customerQueue.Count / TotalCustomersPerThread) : (customerQueue.Count / TotalCustomersPerThread + 1);
int count = 0;
if (ThreadsNeeded > ThreadProcessCount)
{
count = ThreadsNeeded - ThreadProcessCount;
if ((count + ThreadProcessCount) > TotalProcessThreads)
count = TotalProcessThreads - ThreadProcessCount;
}
for (int i = 0; i < count; i++)
{
ThreadProcess objThreadProcess = new ThreadProcess(this);
ThreadPool.QueueUserWorkItem(objThreadProcess.BillingEngineThreadPoolCallBack, count);
lock (threadprocesslock)
{
ThreadProcessCount++;
}
}
public void BillingEngineThreadPoolCallBack(object threadContext)
{
BillingData? billingData = null;
while (true)
{
lock (queuelock)
{
billingData = ProcessCustomerQueue();
}
if (billingData != null)
{
StartBilling(billingData.Value);
}
else
break;
More....
}

Related

How can i make this run faster?

Can I do threads instead of tasks to make this run faster?
I'm trying to get 114000 products into the database. As my code is right now I get about 100 products into the database a minute.
My Tasks (producers) each scrape an XML File which contains a product data, packages it in the Product class, then queue's it for the consumer.
my Consumer takes each product from the queue and puts it into the database 1 at a time. I use Entity Framework so it's not safe for threading.
public static void GetAllProductsFromIndexes_AndPutInDB(List<IndexModel> indexes, ProductContext context)
{
BlockingCollection<IndexModel> inputQueue = CreateInputQueue(indexes);
BlockingCollection<Product> productsQueue = new BlockingCollection<Product>(5000);
var consumer = Task.Run(() =>
{
foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
{
InsertProductInDB(readyProduct, context);
}
});
var producers = Enumerable.Range(0, 100)
.Select(_ => Task.Run(() =>
{
foreach (IndexModel index in inputQueue.GetConsumingEnumerable())
{
Product product = new Product();
byte[] unconvertedByteArray;
string xml;
string url = #"https://data.Icecat.biz/export/freexml.int/en/";
unconvertedByteArray = DownloadIcecatFile(index.IndexNumber.ToString() + ".xml", url);
xml = Encoding.UTF8.GetString(unconvertedByteArray);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xml);
GetProductDetails(product, xmlDoc, index);
XmlNodeList nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductFeature"));
product.FeaturesLink = GetProductFeatures(product, nodeList);
nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductGallery/ProductPicture"));
product.Images = GetProductImages(nodeList);
productsQueue.Add(product);
}
})).ToArray();
Task.WaitAll(producers);
productsQueue.CompleteAdding();
consumer.Wait();
}
A couple of things you must do.
Detach each Product entity after you isnert it, or they will all accumulate in the Change Tracker.
Don't call SaveChanges after every product. Batch up a hundred or so. Like this:
var consumer = Task.Run(() =>
{
var batch = new List<Product>();
foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
{
batch.Add(readyProduct);
if (batch.Count >= 100)
{
context.Products.AddRange(batch);
context.SaveChanges();
foreach (var p in batch)
{
context.Entry(p).State = EntityState.Detached;
}
batch.Clear();
}
}
context.Products.AddRange(batch);
context.SaveChanges();
foreach (var p in batch)
{
context.Entry(p).State = EntityState.Detached;
}
});
If you're on EF Core and your provider supports it (Like SQL Server), you'll even get statement batching. You should expect several hunderd rows per second using basic best-practices here. If you need more than that, you can switch to a bulk load API ( like SqlBulkCopy for SQL Server).
First, read the speed rant to make sure this is even worth investigating.
Can I do threads instead of tasks to make this run faster?
Extremely unlikely. Multithreading has been used as a cheap way to implement Multitasking for a while, but it is technically only usefull if the task is CPU bound. You are doing a DB operation. That one will be Network bound. More likely DB bound (they apply additional bottlenecks as part of their reliabiltiy and concurrency issue prevention).
Im trying to get 114000 products into the db.
Then your best bet is not trying to do that in code. Every DBMS worth it's memory footprint has bulk insert options. Doing that in C# code? That will just make it slower and less reliable.
At best you add the Network load to send the data to the DB to the whole operation. At worst, you make it even slower then that. It is one of the most common mistakes with DB's, thinking you can beat the DBMS performance with code. It will not work.

How to read a large IEnumerable<> fast

I have the following code returning 22,000,000 records from the database pretty quick:
var records = from row in dataContext.LogicalMapTable
select
new
{
row.FwId,
row.LpDefId,
row.FlDefMapID
};
The code following the database call above takes over 60 seconds to run:
var cache = new Dictionary<string, int>();
foreach (var record in records)
{
var tempHashCode = record.FirmwareVersionID + "." + record.LogicalParameterDefinitionID;
cache.Add(tempHashCode, record.FirmwareLogicalDefinitionMapID);
}
return cache;
Is there a better way to do this with improve performance?
The second part of your code is not slow. It just causes LINQ query evaluation, you can see this by consuming your query earlier, for example
var records = (from row in dataContext.LogicalMapTable
select
new
{
row.FwId,
row.LpDefId,
row.FlDefMapID
}).ToList();
So it is your LINQ query that is slow, and here is how you can fix it.
You probably don't need 22M records cached in memory. Things you can try:
Pagination (take, skip)
Change queries to include specific ids or other columns. E.g. before select * ..., after select * ... where id in (1,2,3) ...
Do most of the analytic work at database, it's fast and doesn't take up your app memory
Prefer queries that bring small data batches fast. You can run several of these concurrently to update different bits of your UI
As others have mentioned in comments, reading the entire list like that is very inefficient.
Based on the code you posted, I am assuming that after the list is loaded into your "Cache", you lookup the FirmwareLogicalDefinitionMapID using the Key of FirmwareVersionID + "." + LogicalParameterDefinitionID;
My suggestion to improve overall performance and memory usage is to implement an actual caching pattern, something like this:
public static class CacheHelper
{
public static readonly object _SyncLock = new object();
public static readonly _MemoryCache = MemoryCache.Default;
public static int GetFirmwareLogicalDefinitionMapID(int firmwareVersionID, int logicalParameterDefinitionID)
{
int result = -1;
// Build up the cache key
string cacheKey = string.Format("{0}.{1}", firmwareVersionID, logicalParameterDefinitionID);
// Check if the object is in the cache already
if(_MemoryCache.Countains(cacheKey))
{
// It is, so read it and type cast it
object cacheObject = _MemoryCache[cacheKey];
if(cacheObject is int)
{
result = (int)cacheObject;
}
}
else
{
// The object is not in cache, aquire a sync lock for thread safety
lock(_SyncLock)
{
// Double check that the object hasnt been put into the cache by another thread.
if(!_MemoryCache.Countains(cacheKey))
{
// Still not there, now Query the database
result = (from i in dataContext.LogicalMapTable
where i.FwId == firmwareVersionID && i.LpDefId == logicalParameterDefinitionID
select i.FlDefMapID).FirstOrDefault();
// Add the results to the cache so that the next operation that asks for this object can read it from ram
_MemoryCache.Add(new CacheItem(cacheKey, result), new CacheItemPolicy() { SlidingExpiration = new TimeSpan(0, 5, 0) });
}
else
{
// we lost a concurrency race to read the object from source, its in the cache now so read it from there.
object cacheObject = _MemoryCache[cacheKey];
if(cacheObject is int)
{
result = (int)cacheObject;
}
}
}
}
// return the results
return result;
}
}
You should also read up on the .Net MemoryCache: http://www.codeproject.com/Articles/290935/Using-MemoryCache-in-Net-4-0
Hope this helps!

Most efficient multi-threaded object duplicate filtering (using custom key)?

I'm writing a service client that recalls huge delimited strings containing individual records from a remote service. Due to the size of these strings, I'm dividing the remote service calls up into chunks (date ranges) and looping over the date ranges in parallel to call the remote service and parse the data. The problem is, 50%+ of the records are duplicates so I want filter those out...
Here's my current approach:
// We want to filter out duplicate markets by using the MarketId field...
HashSet<ParsedMarketData> exchangeFixtures =
new HashSet<ParsedMarketData>(
new GenericEqualityComparer<ParsedMarketData, int>(pmd => pmd.MarketId));
DateTime[][] splitTimes =
SplitDateRange(startDate, endDate, TimeSpan.FromDays(1));
// Effectively a Tasks.Parallel.ForEach call...
_parallel.ForEach(splitTimes, startEndTime =>
{
DateTime start = startEndTime[0];
DateTime end = startEndTime[1];
string marketDataString = remoteServiceProxy.GetMarketData(start, end);
IEnumerable<ParsedMarketData> rows =
_marketDataParser.ParseMarketData(marketDataString);
foreach (ParsedMarketData marketDataRow in rows)
{
lock (_syncObj)
{
// Ignore the return value as we don't care
// if it gets added or not...
marketDataList.Add(exchangeFixture);
}
}
});
Fundamentally, is a locked data structure (that finds duplicates) the most efficient approach to this problem or can it be improved?
It might be probably worth knowing that the majority (95%+) of the 'duplicate' items occur within each time bracket. I.e. if we're retrieving "Day A" and "Day B" in parallel, there won't be many (or any) duplicates between Day A and Day B (but many within each day - and in my solution, each thread).
You'll need to tune your code to take advantage of the concurrency opportunities in the data and service. Sounds like one thread per day could be an option.
Actually seeing an improvement ought to be rare. Multiple threads buy your more cpu cycles, not more Internet connections, network cards or service machines. Odds are high that just two threads are optimal. One to get the data from the service, another to process it. Allowing these two operations to be overlapped, a thread-safe producer/consumer queue between them. You could only get a benefit from more threads if the processing thread requires more time than the data retrieval thread. Also a scenario that lets you easily profile the code, you can speedup the processing but not the retrieval. You don't even need a profiler for a first estimate. If the data processing thread doesn't burn 100% core then you're done.
Given the data, and high chance of duplicates within each thread (and low chance of duplicates between threads) I decided to go with the following solution, it allows each thread to perform its stuff without impediment from locks and does a little filtering at the end on the caller thread to ensure the filtering is done correctly.
It also has the added benefit that the order in which the objects are returned from the service calls (date order) is maintained across threads so no need to sort it at the end.
public IEnumerable<Stuff> GetStuffs(DateTime startDate, DateTime endDate)
{
if (startDate >= endDate)
throw new ArgumentException("startDate must be before endDate", "startDate");
IDateRange dateRange = new DateRange(startDate, endDate);
IDateRange[] dateRanges = _dateRangeSplitter.DivideRange(dateRange, TimeSpan.FromDays(1)).ToArray();
IEnumerable<Stuff>[] resultCollections = new IEnumerable<Stuff>[dateRanges.Length];
_parallel.For(0, dateRanges.Length, i =>
{
IDateRange splitRange = dateRanges[i];
IEnumerable<Stuff> stuffs = GetMarketStuffs(splitRange);
resultCollections[i] = stuffs;
});
Stuff[] marketStuffs = resultCollections.SelectMany(ef => ef).Distinct(ef => ef.EventId).ToArray();
return marketStuffs;
}
private IEnumerable<Stuff> GetMarketStuffs(IDateRange splitRange)
{
IList<Stuff> stuffs = new List<Stuff>();
HashSet<int> uniqueStuffIds = new HashSet<int>();
string marketStuffString = _slowStuffStringProvider.GetMarketStuffs(splitRange.Start, splitRange.End);
IEnumerable<ParsedStuff> rows = _stuffParser.ParseStuffString(marketStuffString);
foreach (ParsedStuff parsedStuff in rows)
{
if (!uniqueStuffIds.Add(parsedStuff.EventId))
{
continue;
}
stuffs.Add(new Stuff(parsedStuff));
}
return stuffs;
}

Would there be any performance difference between looping every row of dataset and same dataset list form

I need to loop every row of a dataset 100k times.
This dataset contains 1 Primary key and another string column. Dataset has 600k rows.
So at the moment i am looping like this
for (int i = 0; i < dsProductNameInfo.Tables[0].Rows.Count; i++)
{
for (int k = 0; k < dsFull.Tables[0].Rows.Count; k++)
{
}
}
Now dsProductNameInfo has 100k rows and dsFull has 600k rows. Should i convert dsFull to a KeyValuePaired string list and loop that or there would not be any speed difference.
What solution would work fastest ?
Thank you.
C# 4.0 WPF application
In the exact scenario you mentioned, the performance would be the same except converting to the list would take some time and cause the list approach to be slower. You can easily find out by writing a unit test and timing it.
I would think it'd be best to do this:
// create a class for each type of object you're going to be dealing with
public class ProductNameInformation { ... }
public class Product { ... }
// load a list from a SqlDataReader (much faster than loading a DataSet)
List<Product> products = GetProductsUsingSqlDataReader(); // don't actually call it that :)
// The only thing I can think of where DataSets are better is indexing certain columns.
// So if you have indices, just emulate them with a hashtable:
Dictionary<string, Product> index1 = products.ToDictionary( ... );
Here are references to the SqlDataReader and ToDictionary concepts that you may or may not be familiar with.
The real question is, why isn't this kind of heavy processing done at the database layer? SQL servers are much more optimized for this type of work. Also, you may not have to actually do this, why don't you post the original problem and maybe we can help you optimize deeper?
HTH
There might be quite a few things that could be optimized not related to the looping. E.g. reducing the number of iteration would yield a lot at pressent the body of the inner loop is executed 100k * 600k times so eliminating one iteration of the outer loop would eliminate 600k iterations of the inner (or you might be able to switch the inner and outer loop if it's easier to remove iterations from the inner loop)
One thing that you could do in any case is only index once for each table:
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows
var productInfoCount = productNameInfoRows.Count;
var fullRows = dsFull.Tables[0].Rows;
var fullCount = fullRows.Count;
for (int i = 0; i < productInfoCount; i++)
{
for (int k = 0; k < fullCount; k++)
{
}
}
inside the loops you'd get to the rows with productNameInfoRows[i] and FullRows[k] which is faster than using the long hand I'm guessing there might be more to gain from optimizing the body than the way you are looping over the collection. Unless of course you have already profiled the code and found the actual looping to be the bottle neck
EDIT After reading your comment to Marc about what you are trying to accomplish. Here's a go at how you could do this. It's worth noting that the below algorithm is probabalistic. That is there's a 1:2^32 for two words being seen as equal without actually being it. It is however a lot faster than comparing strings.
The code assumes that the first column is the one you are comparing.
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
var full = new List<int[]>(fullCount);
for (int i = 0; i < productInfoCount; i++){
//we're going to compare has codes and not strings
var prd = productNameInfoRows[i][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray();
for (int k = 0; k < fullCount; k++){
//caches the calculation for all subsequent oterations of the outer loop
if (i == 0) {
full.Add(fullRows[k][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray());
}
var fl = full[k];
var count = 0;
for(var j = 0;j<fl.Length;j++){
var f = fl[j];
//the values are sorted so we can exit early
for(var m = 0;m<prd.Length && prd[m] <= f;m++){
count += prd[m] == f ? 1 : 0;
}
}
if((double)(fl.Length + prd.Length)/count >= 0.6){
//there's a match
}
}
}
EDIT your comment motivated me to give it another try. The below code could have fewer iterations. Could have is because it depends on the number of matches and the number of unique words. A lot of unique words and a lot of matches for each (which would require a LOT of words per column) would potentially yield more iterations. However under the assumption that each row has few words this should yield substantial fewer iterations. your code has a NM complexity this has N+M+(matchesproductInfoMatches*fullMatches). In other words the latter would have to be almost 99999*600k for this to have more iterations than yours
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
//Create a list of the words from the product info
var lists = new Dictionary<int, Tuple<List<int>, List<int>>>(productInfoCount*3);
for(var i = 0;i<productInfoCount;i++){
foreach (var token in productNameInfoRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (!lists.ContainsKey(token)){
lists.Add(token, Tuple.Create(new List<int>(), new List<int>()));
}
lists[token].Item1.Add(i);
}
}
//Pair words from full with those from productinfo
for(var i = 0;i<fullCount;i++){
foreach (var token in fullRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (lists.ContainsKey(token)){
lists[token].Item2.Add(i);
}
}
}
//Count all matches for each pair of rows
var counts = new Dictionary<int, Dictionary<int, int>>();
foreach(var key in lists.Keys){
foreach(var p in lists[key].Item1){
if(!counts.ContainsKey(p)){
counts.Add(p,new Dictionary<int, int>());
}
foreach(var f in lists[key].Item2){
var dic = counts[p];
if(!dic.ContainsKey(f)){
dic.Add(f,0);
}
dic[f]++;
}
}
}
If performance is the critical factor, then I would suggest trying an array-of-struct; this has minimal indireaction (DataSet/DataTable has quite a lot of indirection). You mention KeyValuePair, and that would work, although it might not necessarily be my first choice. Milimetric is right to say that there is an overhead if you create a DataSet first and then build an array/list from tht - however, even then the time savings when looping may exceed the build time. If you can restructure the load to remove the DataSet completely, great.
I would also look carefully at the loops, to see if anything could reduce the actual work needed; for example, would building a dictionary/grouping allow faster lookups? Would sorting allow binary search? Can any operations be per-aggregated and applied at a higher level (with fewer rows)?
What are you doing with the data inside the nested loop?
Is the source of your datasets a SQL database? If so, the best possible performance you could get would be to perform your calculation in SQL using an inner join and return the result to .net.
Another alternative would be to use the dataset's built in querying methods that act like SQL, but in-memory.
If neither of those options are appropriate, you would get a performance improvement by retrieving the 'full' dataset as a DataReader and looping over it as the outer loop. A dataset loads all of the data from SQL into memory in one hit. With 600k rows, this will take up a lot of memory! Whereas a DataReader will keep the connection to the DB open and stream rows as they are read. Once you have read a row the memory will be reused/reclaimed by the garbage collector.
In your comment reply to my earlier answer you said that both datasets are essentially lists of strings and each string a delimited list of tags effectively. I would first look to normalise the csv strings in the database. I.e. Split the CSVs, add them to a tag table and link from the product to the tags via a link table.
You can then quite easily create a SQL statement that will do your matching according to the link records rather than by string (which be more performant in it's own right).
The issue you would then have is that if your sub-set product list needs to be passed into SQL from .net you would need to call the SP 100k times. Thankfully SQL 2008 R2, introduced TableTypes. You could define a table type in your database with one column to hold your product ID, have your SP accept that as an input parameter and then perform an inner join between your actual tables and your table parameter.. I've used this in my own project with very large datasets and the performance gain was massive.
On the .net side you can create a DataTable matching the structure of the SQL table type and then pass that as a command parameter when calling your SP (once!).
This article shows you how to do both the SQL and .net sides. http://www.mssqltips.com/sqlservertip/2112/table-value-parameters-in-sql-server-2008-and-net-c/

Optimize code: Linq and foreach loop 15k records

this is my code
void fixInstellingenTabel(object source, ElapsedEventArgs e)
{
NASDataContext _db = new NASDataContext();
List<Instellingen> newOnes = new List<Instellingen>();
List<InstellingGegeven> li = _db.InstellingGegevens.ToList();
foreach (InstellingGegeven i in li) {
if (_db.Instellingens.Count(q => q.INST_LOC_REF == i.INST_LOC_REF && q.INST_LOCNR == i.INST_LOCNR && q.INST_REF == i.INST_REF && q.INST_TYPE == i.INST_TYPE) <= 0) {
// There is no item yet. Create one.
Instellingen newInst = new Instellingen();
newInst.INST_LOC_REF = i.INST_LOC_REF;
newInst.INST_LOCNR = i.INST_LOCNR;
newInst.INST_REF = i.INST_REF;
newInst.INST_TYPE = i.INST_TYPE;
newInst.Opt_KalStandaard = false;
newOnes.Add(newInst);
}
}
_db.Instellingens.InsertAllOnSubmit(newOnes);
_db.SubmitChanges();
}
basically, the InstellingGegevens table gest filled in by some procedure from another server.
the thing i then need to do is check if there are new records in this table, and fill in the new ones in Instellingens.
this code runs for like 4 minutes on 15k records. how do I optimize it? or is the only way a Stored Procedure?
this code runs in a timer, running every 6h. IF a stored procedure is best, how to I use that in a timer?
Timer Tim = new Timer(21600000); //6u
Tim.Elapsed += new ElapsedEventHandler(fixInstellingenTabel);
Tim.Start();
Doing this in a stored procedure would be a lot faster. We do something quite similar, only there is about 100k items in the table, it's updated every five minutes, and has a lot more fields in it. Our job takes about two minutes to run, and then it does updates in several tables across three databases, so your job would reasonably take only a couple of seconds.
The query you need would just be something like:
create procedure UpdateInstellingens as
insert into Instellingens (
INST_LOC_REF, INST_LOCNR, INST_REF, INST_TYPE, Opt_KalStandaard
)
select q.INST_LOC_REF, q.INST_LOCNR, q.INST_REF, q.INST_TYPE, cast(0 as bit)
from InstellingGeven q
left join Instellingens i
on q.INST_LOC_REF = i.INST_LOC_REF and q.INST_LOCNR = i.INST_LOCNR
and q.INST_REF = i.INST_REF and q.INST_TYPE = i.INST_TYPE
where i.INST_LOC_REF is null
You can run the procedure from a job in the SQL server, without involving any application at all, or you can use ADO.NET to execute the procedure from your timer.
One way you could optimise this is by changing the Count(...) <= 0 into Any(). However, an even better optimisation would be to retrieve this information in a single query outside the loop:
var instellingens = _db.Instellingens
.Select(q => new { q.INST_LOC_REF, q.INST_LOCNR, q.INST_REF, q.INST_TYPE })
.Distinct()
.ToDictionary(q => q, q => true);
(On second thought, a HashSet would be most appropriate here, but there is unfortunately no ToHashSet() extension method. You can write one of your own if you like!)
And then inside your loop:
if (instellingens.ContainsKey(new { q.INST_LOC_REF, q.INST_LOCNR,
q.INST_REF, q.INST_TYPE })) {
// There is no item yet. Create one.
// ...
}
Then you can optimise the loop itself by making it lazy-retrieve:
// No need for the List<InstellingGegeven>
foreach (InstellingGegeven i in _db.InstellingGegevens) {
// ...
}
What Guffa said, but using Linq here is not the best course if performance is what you are after. Linq, like every other ORM, sacrifices performance for usability. Which is usually a great tradeoff for typical application execution paths. On the other hand, SQL is very, very good at set based operations so that really is the way to fly here.

Categories

Resources