LINQ To SQL in a parallel loop: How to prevent duplicate insertions? - c#

I'm running into some trouble in trying to parallelize a computationally expensive API integration.
The integration queries an API in parallel and populates a ConcurrentBag collection. Some processing is done, and then it is passed to Parallel.ForEach() in which it is interfaced with the database by using LINQ To Sql.
There is:
one outer loop which runs in parallel for Courses
an inner loop through Disciplines
inside it, another loop iterating through Lessons.
The problem I'm running into is: as any one lesson may belong to more than one course, looping over courses in parallel means that sometimes a lesson will be inserted more than once.
The code currently looks like this:
(externalCourseList is the collection of type ConcurrentBag<ExternalCourse>.)
Parallel.ForEach(externalCourseList, externalCourse =>
{
using ( var context = new DataClassesDataContext() )
{
var dbCourse = context.Courses.Single(
x => x.externalCourseId == externalCourse.courseCode.ToString());
dbCourse.ShortDesc = externalCourse.Summary;
//dbCourse.LongDesc = externalCourse.MoreInfo;
//(etc)
foreach (var externalDiscipline in externalCourse.Disciplines)
{
var dbDiscipline = context.Disciplines.Where(
x => x.ExternalDisciplineID == externalDiscipline.DisciplineCode
.ToString())
.SingleOrDefault();
if (dbDiscipline == null)
dbDiscipline = new Linq2SQLEntities.Discipline();
dbDiscipline.Title = externalDiscipline.Name;
//(etc)
dbDiscipline.ExternalDisciplineID = externalDiscipline.DisciplineCode
.ToString();
if (!dbDiscipline.IsLoaded)
context.Disciplines.InsertOnSubmit(dbDiscipline);
// relational table used as one-to-many relationship for legacy reasons
var courseDiscipline = dbDiscipline.Course_Disciplines.SingleOrDefault(
x => x.CourseID == dbCourse.CourseID);
if (courseDiscipline == null)
{
courseDiscipline = new Course_Discipline
{
Course = dbCourse,
Discipline = dbDiscipline
};
context.Course_Disciplines.InsertOnSubmit(courseDiscipline);
}
foreach (var externalLesson in externalDiscipline.Lessons)
{
/// The next statement throws an exception
var dbLesson = context.Lessons.Where(
x => x.externalLessonID == externalLesson.LessonCode)
.SingleOrDefault();
if (dbLesson == null)
dbLesson = new Linq2SQLEntities.Lesson();
dbLesson.Title = externalLesson.Title;
//(etc)
dbLesson.externalLessonID = externalLesson.LessonCode;
if (!dbLesson.IsLoaded)
context.Lessons.InsertOnSubmit(dbLesson);
var disciplineLesson = dbLesson.Discipline_Lessons.SingleOrDefault(
x => x.DisciplineID == dbDiscipline.DisciplineID
&& x.LessonID == dbLesson.LessonID);
if (disciplineLesson == null)
{
disciplineLesson = new Discipline_Lesson
{
Discipline = dbDiscipline,
Lesson = dbLesson
};
context.Discipline_Lessons.InsertOnSubmit(disciplineLesson);
}
}
}
context.SubmitChanges();
}
});
(IsLoaded is implemented as described here.)
An exception is thrown at the line preceded with /// because the same lesson is often inserted multiple times and calling .SingleOrDefault() on context.Lessons.Where(x => x.externalLessonID == externalLesson.LessonCode) fails.
What would be the best way to solve this?

One approach could be to separate the insertion of the lessons in the database from the other work that has to be done in parallel. I haven't studied your code deeply, so I am not sure if this approach is feasible, but I'll give an example anyway. The basic idea is to serialize the insertion of the lessons, in order to avoid the problems caused by the parallelization:
IEnumerable<Lesson[]> query = externalCourseList
.AsParallel()
.AsOrdered() // Optional
.Select(externalCourse =>
{
using DataClassesDataContext context = new();
List<Lesson> results = new();
// Here do the work that adds lessons in the results list.
return results.ToArray();
}
.AsSequential();
This is a parallel query (PLINQ), that does the parallel work while it is enumerated. So at this point it hasn't started yet. Now let's enumerate it:
using DataClassesDataContext context = new();
foreach (Lesson lesson in query.SelectMany(x => x))
{
// Here insert the lesson in the DB.
}
The work of inserting the lessons in the DB will be done exclusively on the current thread. This thread will also participate in the work inside the parallel query, along with ThreadPool threads. In case this is a problem, you could offload the enumeration of the query on a ThreadPool thread, freeing the current thread from doing anything else than the lesson-inserting work. I've posted an OffloadEnumeration extension method here, that you could use just before starting the enumeration of the query:
query = OffloadEnumeration(query);

Related

How to properly cache a table in Entity Framework for this use case

var fdPositions = dbContext.FdPositions.Where(s => s.LastUpdated > DateTime.UtcNow.AddDays(-1));
foreach (JProperty market in markets)
{
// bunch of logic that is irrelevant here
var fdPosition = fdPositions.Where(s => s.Id == key).FirstOrDefault();
if (fdPosition is not null)
{
fdPosition.OddsDecimal = oddsDecimal;
fdPosition.LastUpdated = DateTime.UtcNow;
}
else
{
// bunch of logic that is irrelevant here
}
}
await dbContext.SaveChangesAsync();
This block of code will make 1 database call on this line
var fdPosition = fdPositions.Where(s => s.Id == key).FirstOrDefault();
for each value in the loop, there will be around 10,000 markets to loop through.
What I thought would happen, and what I would like to happen, is 1 database call is made
var fdPositions = dbContext.FdPositions.Where(s => s.LastUpdated > DateTime.UtcNow.AddDays(-1));
on this line, then in the loop, it is checking against the local table I thought I pulled on the first line, making sure I still properly am updating the DB Object in this section though
if (fdPosition is not null)
{
fdPosition.OddsDecimal = oddsDecimal;
fdPosition.LastUpdated = DateTime.UtcNow;
}
So my data is properly propagated to the DB when I call
await dbContext.SaveChangesAsync();
How can I update my code to accomplish this so I am making 1 DB call to get my data rather than 10,000 DB calls?
Define your fdPositions variable as a Dictionary<int, T>, in your query do a GroupBy() on Id, then call .ToDictionary(). Now you have a materialized dictionary that lets you index by key quickly.
var fdPositions = context.FdPositions.Where(s => s.LastUpdatedAt > DateTime.UtcNow.AddDays(-1))
.GroupBy(x=> x.Id)
.ToDictionary(x=> x.Key, x=> x.First());
//inside foreach loop:
// bunch of logic that is irrelevant here
bool found = fdPositions.TryGetValue(key, out var item);

this code takes a 2hrs to compare and sort 20,000 items each, is there a better way to write this c# code

I am trying to sort all the updated item in DataTableA, by coloring the item that has not been completely updated, and removing the item that has been updated completely from the DataTable. both The item that has been updated completely and the incomplete updated item are in "managed" table in the database, the discharge date will be null if it has not been completely updated.
Below code works but it can take all day for the page to run. This is C# webform.
The code below is writing on my code behind file:
foreach (GridDataItem dataItem in RadGrid1.Items)
{
var panu = dataItem["Inumber"];
var panum = panu.Text;
var _cas = db.managed.Any(b =>
b.panumber == panum && b.dischargedate != null);
var casm = db.managed.Any(b =>
b.panumber == panum && b.dischargedate == null);
if (_cas == true)
{
dataItem.Visible = false;
}
if (casm == true)
{
dataItem.BackColor = Color.Yellow;
}
}
As mentioned in the comment, each call to db.managed.Any will create a new SQL query.
There are various improvements you can make to speed this up:
First, you don't need to call db.managed.Any twice inside the loop, if it's checking the same unique entity. Call it just once and check dischargedate. This alone with speed up the loop 2x.
// one database call, fetching one column
var dischargedate = db.managed
.Select(x => x.dischargedate)
.FirstOrDefault(b => b.panumber == panum);
var _cas = dischargedate != null;
var casm = dischargedate == null;
If panumber is not a unique primary key and you don't have a sql index for this column, then each db.managed.Any call will scan all items in the table on each call. This can be easily solved by creating an index with panum and dischargedate, so if you don't have this index, create it.
Ideally, if the table is not huge, you can just load it all at once. But even if you have tens of millions of records, you can split the loop into several chunks, instead of repeating the same query over and over again.
Consider using better naming for your variables. _cas and casm are a poor choice of variable names.
Pro tip: Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.
So if you don't have hundreds of thousands of items, here is the simplest fix: load panumber and discharge values for all rows from that table into memory, and then use a dictionary to instantly find the items:
// load all into memory
var allDischargeDates = await db.managed
.Select(x => new { x.panumber, x.dischargedate })
.ToListAsync(cancellationToken);
// create a dictionary so that you can quickly map panumber -> dischargedate
var dischargeDateByNumber = dbItems
.ToDictionary(x => x.panumber, x => x.dischargedate);
foreach (var dataItem in RadGrid1.Items)
{
var panu = dataItem["Inumber"];
var panum = panu.Text;
// this is very fast to check now
if (!dischargeDateByNumber.TryGetValue(panum, out DateTime? dischargeDate))
{
// no such entry - in this case your original code will just skip the item
return;
}
if (dischargeDate != null)
{
dataItem.Visible = false;
}
else
{
dataItem.BackColor = Color.Yellow;
}
}
If the table is huge and you only want to load certain items, you would do:
// get the list of numbers to fetch from the database
// (this should not be a large list!)
var someList = RadGrid1
.Items
.Select(x => x["Inumber"].Text)
.ToList();
// load these items into memory
var allDischargeDates = await db.managed
.Where(x => someList.Contains(x.panumber))
.Select(x => new { x.panumber, x.dischargedate })
.ToListAsync(cancellationToken);
But there is a limit on how large someList can be (you don't want to run this query for a list of 200 thousand items).
Well, 900 items might be worth simply fetching to a list in memory and then process that. It will definitely be faster, although it consumes more memory.
You can do something like this (assuming the type of managed is Managed):
List<Managed> myList = db.managed.ToList();
That will fetch the whole table.
Now replace your code with:
foreach (GridDataItem dataItem in RadGrid1.Items)
{
var panu = dataItem["Inumber"];
var panum = panu.Text;
var _cas = myList .Any(b =>
b.panumber == panum && b.dischargedate != null);
var casm = myList .Any(b =>
b.panumber == panum && b.dischargedate == null);
if (_cas == true)
{
dataItem.Visible = false;
}
if (casm == true)
{
dataItem.BackColor = Color.Yellow;
}
}
You should see a huge performance approvement.
Another thing: You don't mention what database you're using, but you should make sure the panumber column is properly indexed.

Improve EFCore Query for fast operation

I read a sqlite database (The size of the database is about 3 MB, So there is not much information, each table is about 1 or 2 thousand rows) and extract information from it, Then I add this information to a new database.
The whole operation takes about 40 seconds.
How can I reduce this time and get the operation done as quickly as possible? (Task, Parallel, async,...)
I am currently using this code:
await Task.Run(async () =>
{
var pkgs = new ManifestTable();
var mydb = new dbContext();
await mydb.Database.EnsureDeletedAsync();
await mydb.Database.EnsureCreatedAsync();
using (var msixDB = new MSIXContext())
{
foreach (var item in await msixDB.IdsMSIXTable.ToListAsync())
{
var rowId = item.rowid;
var manifests = await msixDB.Set<ManifestMSIXTable>().Where((e) => e.id == rowId).ToListAsync();
foreach (var manifest in manifests)
{
pkgs = new ManifestTable();
pkgs.PackageId = item.id;
var productMap = await msixDB.ProductCodesMapMSIXTable.FirstOrDefaultAsync((e) => e.manifest == manifest.rowid);
if (productMap != null)
{
var prdCode = await msixDB.ProductCodesMSIXTable.FirstOrDefaultAsync((e) => e.rowid == productMap.productcode);
if (prdCode != null)
{
pkgs.ProductCode = prdCode.productcode;
}
}
var publisherMap = await msixDB.Set<PublishersMapMSIXTable>().FirstOrDefaultAsync((e) => e.manifest == manifest.rowid);
if (publisherMap != null)
{
var publisher = await msixDB.PublishersMSIXTable.FirstOrDefaultAsync((e) => e.rowid == publisherMap.norm_publisher);
if (publisher != null)
{
pkgs.Publisher = publisher.norm_publisher;
}
}
var pathPart = manifest.pathpart;
var yml = await msixDB.PathPartsMSIXTable.FirstOrDefaultAsync((e) => e.rowid == pathPart);
if (yml != null)
{
pkgs.YamlName = yml.pathpart;
}
var version = await msixDB.VersionsMSIXTable.FirstOrDefaultAsync((e) => e.rowid == manifest.version);
if (version != null)
{
pkgs.Version = version.version;
}
await mydb.ManifestTable.AddAsync(pkgs);
}
}
await mydb.SaveChangesAsync();
}
});
Treating database as object storage is worst idea ever. You have to reduce database roundtrips as it is possible. In your case - by just one request. Also do not play with Task.Run, Parallel, etc. if you do not know which part is slow. In your case - database roundtrips.
var mydb = new dbContext();
await mydb.Database.EnsureDeletedAsync();
await mydb.Database.EnsureCreatedAsync();
using (var msixDB = new MSIXContext())
{
var query =
from item in msixDB.IdsMSIXTable
from manifest in msixDB.Set<ManifestMSIXTable>().Where(e => e.id == item.rowId)
from productMap in msixDB.ProductCodesMapMSIXTable.Where(e => e.manifest == manifest.rowid).Take(1).DefaultIfEmpty()
from prdCode in msixDB.ProductCodesMSIXTable.Where(e => e.rowid == productMap.productcode).Take(1).DefaultIfEmpty();
from publisherMap in msixDB.Set<PublishersMapMSIXTable>().Where(e => e.manifest == manifest.rowid).Take(1).DefaultIfEmpty()
from publisher in msixDB.PublishersMSIXTable.Where(e => e.rowid == publisherMap.norm_publisher).Take(1).DefaultIfEmpty()
from yml in msixDB.PathPartsMSIXTable.Where(e => e.rowid == manifest.pathpart).Take(1).DefaultIfEmpty()
from version in msixDB.VersionsMSIXTable.Where(e => e.rowid == manifest.version).Take(1).DefaultIfEmpty()
select new ManifestTable
{
PackageId = item.id,
ProductCode = prdCode.productcode,
Publisher = publisher.norm_publisher,
YamlName = yml.pathpart,
Version = version.version
};
mydb.ManifestTable.AddRange(await query.ToListAsync());
await mydb.SaveChangesAsync();
}
You should start by seeing if there are any algorithmic improvements before trying to do things in parallel etc.
You have two nested loops, so if each table have a few thousands of rows the inner loop body will be running on the magnitude of 10^6, not terrible, but a fair amount.
In the inner loop you are then running a whole bunch of FirstOrDefaultAsync statements. If these are not indexed it will require all rows to be scanned, and this will be slow. So, to start of ensure you have appropriate indices for all the tables. This is done on to ensure that searching for a specific item is in constant time.
You also seem to be doing repeated lookups for PublishersMapMSIXTable with the same parameters. Avoiding unnecessarily repeated operations should be one of the first things to fix, since it is just wasted cycles.
If the whole operation is run on a background thread it is unlikely that all the async calls will help much, it will save a little bit of memory, but cause some bouncing between threads. So if performance if important regular synchronous methods will probably be a little bit faster.
And as always with regards to performance, measure. A good performance profiler should tell you with what most of the time is spent in, and adding some stopwatches is easy if you do not have one. Even very experienced programmers can be completely wrong if they try to guess what the slow parts are.

How can i optimise these loops

I have the below snippet which takes a long time to run as the data increase.
OrderEntityColection is a List and samplePriceList is a List
OrderEntityColection = 30k trades
samplePriceList = 1million prices
Takes easily 10-15 minute to finish or more
I have tested this with 1500 orders and 300k prices but it takes around 40-50 seconds as well and as the orders increase so do prices and even takes longer
Can you see how i can improve this. I have alreadyy cut it down to these numbers before in hand from a big set.
MarketId = int
Audit = string
foreach (var tradeEntity in OrderEntityColection)
{
Parallel.ForEach(samplePriceList,new ParallelOptions {MaxDegreeOfParallelism = 8}, (price) =>
{
if (price.MarketId == tradeEntity.MarketId)
{
if (tradeEntity.InstructionPriceAuditId == price.Audit)
{
// OrderExportColection.Enqueue(tradeEntity);
count++;
}
}
});
}
So you want to do data in memory, ok - you need to be smart about the way you formulate the data up front. First thing is you're getting a list of prices by MarketId - so create that first:
var pricesLookupByMarketId = samplePriceList.ToDictionary(
p => p.MarketId,
v => v.ToDictionary(k => k.Market));
Now you have a Dictionary<int,Dictionary<int,Price>>(); (note ive assumed both MarketId and Audit are ints. If they're not it should still work)
Now your code becomes super simple and a lot faster
foreach (var tradeEntity in OrderEntityColection)
{
if(pricesLookupByMarketId.ContainsKey(tradeEntity.MarketId)
&& pricesLookupByMarketId[tradeEntity.MarketId].ContainsKey(tradeEntity.InstructionPriceAuditId))
{
count++;
}
}
Or, if you'er a fan of one long line
var count = OrderEntityColection.Count(tradeEntity => pricesLookupByMarketId.ContainsKey(tradeEntity.MarketId)
&& pricesLookupByMarketId[tradeEntity.MarketId].ContainsKey(tradeEntity.InstructionPriceAuditId))
As pointed out in the comments, this can be further optimized to stop repeated reads of the dictionaries - but the exact implementation depends on how you want to use this data in the end.
In the parallel loop you have cases, where you skip the processing for certain items. That's quite expensive, as you rely on that check to also happen on a separate thread. I'd just filter out the results first before processing those, as follows:
foreach (var tradeEntity in OrderEntityColection)
{
Parallel.ForEach(samplePriceList.Where(item=>item.MarketId == tradeEntity.MarketId && item.Audit == tradeEntity.InstructionPriceAuditId) ,new ParallelOptions {MaxDegreeOfParallelism = 8}, (price) =>
{
// Do whatever processing is required here
Interlocked.Increment(ref count);
});
}
On a side note, seems like you need to replace count++ with Interlocked.Increment(ref count), to be thread safe.
Manage to do this with the help of my friend
var samplePriceList = PriceCollection.GroupBy(priceEntity=> priceEntity.MarketId).ToDictionary(g=> g.Key,g=> g.ToList());
foreach (var tradeEntity in OrderEntityColection)
{
var price = samplePriceList[tradeEntity.MarketId].FirstOrDefault(obj => obj.Audit == tradeEntity.Audit);
if (price != null)
{
count+=1;
}
}

Why did my code work when I changed it from IEnumerable to List?

I'm trying to get my head around this rather than just chalking it up to general voodoo.
I do an EF query and get some data back, and I .ToList() it, like this:
IEnumerable<DatabaseMatch<CatName>> nameMatches = nameLogicMatcher.Match(myIQueryableOfCats).ToList();
Some cats appear twice in the database because they have multiple names, but each cat has a primary name. So in order to filter this down, I get all of the ids of the cats in a list:
List<int> catIds = nameMatches.Select(c => c.Match.CatId).ToList();
I then iterate through all of the distinct ids, get all of the matching cat names, and remove anything that isn't a primary name from the list, like this:
foreach (int catId in catIds.Distinct())
{
var allCatNameMatches = nameMatches.Where(c => c.Match.CatId == catId);
var primaryMatch = allCatNameMatches.FirstOrDefault(c => c.Match.NameType == "Primary Name");
nameMatches = nameMatches.Except(allCatNameMatches.Where(c => c != primaryMatch));
}
Now this code, when I first ran it, just hung. Which I thought was odd. I stepped through it, and it seemed to work but after 10 iterations (it is capped at 100 cats in total) it started to slow down and then eventually it was glacial and then hung completely.
I thought maybe it was doing some intensive database work by mistake, but the profiler shows no SQL executed except that which retrieves the initial list of cat names.
I decided to change it from IEnumerable of nameMatches to a List, and put the appropriate .ToList() on the last line. It worked instantly and perfectly after I did this.
The question I'd like to ask is, why?
Without the ToList() you are building up in nameMatches a nested chain of IEnumerables awaiting delayed execution. This might not be so bad, except you are also calling FirstOrDefault on each iteration which will execute the chain. So on iteration number n, you are executing the filter operations contained in the loop n-1 times. If you had 1000 distinct cats, the Linq chain is getting executed 1000 + 99 + ... + 1 times. (I think you have something that is O(n³)!)
The moral is, if you want to use delayed execution, make very sure that you're only executing your chain once.
Let's simplify your code a little:
foreach (int catId in catIds.Distinct())
{
var allCatNameMatches = nameMatches.Where(c => c.Match.CatId == catId);
var primaryMatch = null;
nameMatches = nameMatches.Except(allCatNameMatches.Where(c => c != primaryMatch));
}
And a little more:
foreach (int catId in catIds.Distinct())
{
nameMatches = nameMatches.Where(c => c.Match.CatId == catId);
var primaryMatch = null;
nameMatches = nameMatches.Except(nameMatches.Where(c => c != primaryMatch));
}
In the latter one it is obvious that due to deferred execution each pass of foreach body lengthens the chain of Where and Except. Then remember var primaryMatch = allCatNameMatches.FirstOrDefault. It is not deferred executed, so in each iteration of foreach it should execute all chain. Therefore it hangs.

Categories

Resources