I'm trying to implement a bulk insert and delete operation based on some specifications i.e IList<Specification> provided by the user. The data which needs to be queried is roundabout 1 million, based on which I need to insert or delete records to/from another table i.e UserSpecifications table which have only two columns UserId and SpecificationId and PK is UserId + SpecificationId
Order in which the records get inserted/deleted into UserSpecifications table doesn't matter
Hence I'm trying to create Parallel tasks for each user specification and process data into UserSpecifications
await Task.Run(() => Parallel.ForEach(specifications, async specification =>
{
await insertBulkRecordsAsync(specification); // this uses SqlBulkCopy for data insertion with transaction scope , on 1 million records
await deleteRecordsAsync(specification); // this uses https://entityframework-plus.net/batch-delete operation
}));
both insertBulkRecordsAsync & deleteRecordsAsync uses userContext
When I try to run this piece of code I get the following exception
The context cannot be used while the model is being created. This exception may be thrown if the context is used inside the OnModelCreating method or if the same context instance is accessed by multiple threads concurrently. Note that instance members of DbContext and related classes are not guaranteed to be thread-safe.
I know that EF 6 doesn't support sharing dbContext, but if I run a normal forEach loop things do work but it takes a lot of time to process the updates
Following works without any issue
foreach (var specification in specifications)
{
await insertBulkRecordsAsync(specification);
await deleteRecordsAsync(specification);
}
So do we have any way to share the context and increase the execution time
Related
We have an application using SDK provided by our provider to integrate easily with them. This SDK connects to AMQP endpoint and simply distributes, caches and transforms messages to our consumers. Previously this integration was over HTTP with XML as a data source and old integration had two ways of caching DataContext - per web request and per managed thread id. (1)
Now, however, we do not integrate over HTTP but rather AMQP which is transparent to us since the SDK is doing all the connection logic and we are only left with defining our consumers so there is no option to cache DataContext "per web request" so only per managed thread id is left.
I implemented chain of responsibility pattern, so when an update comes to us it is put in one pipeline of handlers which uses DataContext to update the database according to the new updates. This is how the invocation method of pipeline looks like:
public Task Invoke(TInput entity)
{
object currentInputArgument = entity;
for (var i = 0; i < _pipeline.Count; ++i)
{
var action = _pipeline[i];
if (action.Method.ReturnType.IsSubclassOf(typeof(Task)))
{
if (action.Method.ReturnType.IsConstructedGenericType)
{
dynamic tmp = action.DynamicInvoke(currentInputArgument);
currentInputArgument = tmp.GetAwaiter().GetResult();
}
else
{
(action.DynamicInvoke(currentInputArgument) as Task).GetAwaiter().GetResult();
}
}
else
{
currentInputArgument = action.DynamicInvoke(currentInputArgument);
}
}
return Task.CompletedTask;
}
The problem is (at least what I think it is) that this chain of responsibility is chain of methods returning/starting new tasks so when an update for entity A comes it is handled by managed thread id = 1 let's say and then only sometime after again same entity A arrives only to be handled by managed thread id = 2 for example. This leads to:
System.InvalidOperationException: 'An entity object cannot be referenced by multiple instances of IEntityChangeTracker.'
because DataContext from managed thread id = 1 already tracks entity A. (at least that's what I think it is)
My question is how can I cache DataContext in my case? Did you guys have the same problem? I read this and this answers and from what I understood using one static DataContext is not an option also.(2)
Disclaimer: I should have said that we inherited the application and I cannot answer why it was implemented like that.
Disclaimer 2: I have little to no experience with EF.
Comunity asked questions:
What version of EF we are using? 5.0
Why do entities live longer than the context? - They don't but maybe you are asking why entities need to live longer than the context. I use repositories that use cached DataContext to get entities from the database to store them in an in-memory collection which I use as a cache.
This is how entities are "extracted", where DatabaseDataContext is the cached DataContext I am talking about (BLOB with whole database sets inside)
protected IQueryable<T> Get<TProperty>(params Expression<Func<T, TProperty>>[] includes)
{
var query = DatabaseDataContext.Set<T>().AsQueryable();
if (includes != null && includes.Length > 0)
{
foreach (var item in includes)
{
query = query.Include(item);
}
}
return query;
}
Then, whenever my consumer application receives AMQP message my chain of responsibility pattern begins checking if this message and its data I already processed. So I have method that looks like that:
public async Task<TEntity> Handle<TEntity>(TEntity sportEvent)
where TEntity : ISportEvent
{
... some unimportant business logic
//save the sport
if (sport.SportID > 0) // <-- this here basically checks if so called
// sport is found in cache or not
// if its found then we update the entity in the db
// and update the cache after that
{
_sportRepository.Update(sport); /*
* because message update for the same sport can come
* and since DataContext is cached by threadId like I said
* and Update can be executed from different threads
* this is where aforementioned exception is thrown
*/
}
else // if not simply insert the entity in the db and the caches
{
_sportRepository.Insert(sport);
}
_sportRepository.SaveDbChanges();
... updating caches logic
}
I thought that getting entities from the database with AsNoTracking() method or detaching entities every time I "update" or "insert" entity will solve this, but it did not.
Whilst there is a certain overhead to newing up a DbContext, and using DI to share a single instance of a DbContext within a web request can save some of this overhead, simple CRUD operations can just new up a new DbContext for each action.
Looking at the code you have posted so far, I would probably have a private instance of the DbContext newed up in the Repository constructor, and then new up a Repository for each method.
Then your method would look something like this:
public async Task<TEntity> Handle<TEntity>(TEntity sportEvent)
where TEntity : ISportEvent
{
var sportsRepository = new SportsRepository()
... some unimportant business logic
//save the sport
if (sport.SportID > 0)
{
_sportRepository.Update(sport);
}
else
{
_sportRepository.Insert(sport);
}
_sportRepository.SaveDbChanges();
}
public class SportsRepository
{
private DbContext _dbContext;
public SportsRepository()
{
_dbContext = new DbContext();
}
}
You might also want to consider the use of Stub Entities as a way around sharing a DbContext with other repository classes.
Since this is about some existing business application I will focus on ideas that can help solve the issue rather than lecture about best practices or propose architectural changes.
I know this is kind of obvious but sometimes rewording error messages helps us better understand what's going on so bear with me.
The error message indicates entities are being used by multiple data contexts which indicates that there are multiple dbcontext instances and that entities are referenced by more than one of such instances.
Then the question states there is a data context per thread that used to be per http request and that entities are cached.
So it seems safe to assume entities read from a db context upon a cache miss and returned from the cache on a hit. Attempting to update entities loaded from one db context instance using a second db context instance cause the failure. We can conclude that in this case the exact same entity instance was used in both operations and no serialization/deserialization is in place for accessing the cache.
DbContext instances are in themselves entity caches through their internal change tracker mechanism and this error is a safeguard protecting its integrity. Since the idea is to have a long running process handling simultaneous requests through multiple db contexts (one per thread) plus a shared entity cache it would be very beneficial performance-wise and memory-wise (the change tracking would likely increase memory consumption in time) to attempt to either change db contexts lifecycle to be per message or empty their change tracker after each message is processed.
Of course in order to process entity updates they need to be attached to the current db context right after retrieving it from the cache and before any changes are applied to them.
I developed a multithread application which insert some data inside a database.
Suppose I have the following situation:
public void Foo()
{
Task.Factory.StartNew(() => AddTeams());
Task.Factory.StartNew(() => AddTable());
}
as you can see I call two different methods which insert data into a db, the problem is that each method need to check if a specific record exists in a particular table:
public void AddTeams()
{
//Pseudo code:
//Check if a team with id 1249 exist in the table team
//if not exist in the table team, insert the team with id 1249
//then insert the record attached with an `FK` to the team table.
}
the same thing happen to AddTable, so sometimes I get this error:
'Duplicate entry '1249' for key 'PRIMARY
because the checking fails on the AddTable method, the reason of the failure is the parallelization that I used, for summarize: a time problem.
How can I manage this?
The only way that come to my mind is to handle the exception, but I don't like this approach.
You simply need to not parallelize the query to get the team and create it if necessary. You don't want or need to do it twice, there's only ever a need to do it once to begin with. So get or create the team, get the primary key of that team, and then pass that to those two methods to each create a related object, and that can be done in parallel.
In the system I'm currently building I have to make webrequests to an API which provides a calculation service. This service requires a set of complex parameters which I have to retrieve from my database. Currently I'm using entity framework for retrieving these entities and for each entity I'm making a request to this api, retrieve the result and at the end save all results to the database (everything done synchronously)
There will be scaling issues with this approach when the set of entities increases (since I have to call the calculation service every 30 minutes on each entity). Because of this I would like to make the database retrieval and web request for an entity in parallell (or async) with the same operations for other entities. (Not with the purpose of reducing time for data loading but to do work while waiting for the webrequest to complete)
Since EF 5 context is not thread safe, what is my best alternatives for achieving this? Should I write specific SQL queries, use LINQ etc? Does anyone have code examples for a similar approach (db retrieval for webrequest in parallell)
EDIT
Adding a small code sample (very simplified). Assuming that the call to the webservice may take a couple of seconds this will not scale.
foreach(entityId in entityIds)
{
var entity = _repository.Find(entityId);
_repository.LoadData(entity);
_validator.ValidateData(entity);
var result = _webservice.Call(entity);
entity.State = result.State;
}
_repository.SaveChanges();
What you could do is to use a producer-consumer architecture: One thread accesses the database and adds the data to something like BlockingCollection. Another thread (or multiple threads) reads the data from the collection and performs the web request.
There are different ways for you to parallelize this. It all depends on what you really want/need.
One way would be to use a Paralle.ForEach.
Parallel.ForEach(
entityIds,
entityId =>
{
var entity = _repository.Find(entityId);
_repository.LoadData(entity);
_validator.ValidateData(entity);
var result = _webservice.Call(entity);
entity.State = result.State;
});
_repository.SaveChanges();
Ran into a weird problem with RavenDB
public ActionResult Save(RandomModel model)
{
//Do some stuff, validate model etc..
RavenSession.Store(model);
RavenSession.SaveChanges();
var newListOfModels = RavenSession.Query<RandomModel>().ToList();
return View("randomview",newListOfModels);
}
The newListOfModels does not contain the model i just added with the store method.
However, if i add a Thread.Sleep(100) after savechanges the stored model is included in the new list.
Am i storing and Saving stuff to RavenDB the wrong way?
How should i be doing this?
Of course there is a work around by just adding the incoming model to the newListOfModels and running SaveChanges after for example in a basecontrollers onactionexecuted method.
My primary concern is why i need to delay the thread before i can query the documentsession and find my newly added model there.
RavenDB indexes are stale by their nature. From the documentation:
RavenDB performs data indexing in a background thread, which is
executed whenever new data comes in or existing data is updated.
Running this as a background thread allows the server to respond
quickly even when large amounts of data have changed, however in that
case you may query stale indexes.
So you need to tell RavenDB when querying to wait for the index to be refressed.
You can do with the various WaitFor... customization, you will most probably want the WaitForNonStaleResultsAsOfLastWrite option:
var newListOfModels = RavenSession
.Query<RandomModel>()
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite()).ToList();
I have an ASP.NET MVC 3 Web Application using Linq-to-SQL for my data access layer. I'm trying to increment a Views field every time the Details action is called, but I'm receiving a "Row not found or changed" error on db.SubmitChanges() if two people happen to hit the action at the same time.
public ActionResult Details(int id)
{
DataClassesDataContext db = new DataClassesDataContext();
var idea = db.Ideas.Where(i => i.IdeaPK == id).Single();
idea.Views++;
db.SubmitChanges();
return View(new IdeaViewModel(idea));
}
I could set the UpdateCheck of the Views field to "Never" in my .dbml (Data Model), which would get rid of the error, but then the idea record could be updated twice with the same Views count. i.e.
First instance of Details action gets idea record with Views count of 1.
Second instance of Details action gets idea record with Views count of 1.
First instance increments Views to 2
First instance commits
Second instance increments Views to 2
Second instance commits
Result: Views field is 2
Expected Result: Views field should be 3
I looked into using a TransactionScope, but I got the following deadlock error from one of the two calls:
Transaction (Process ID 54) was
deadlocked on lock resources with
another process and has been chosen as
the deadlock victim. Rerun the
transaction.
when I updated my action to look like:
public ActionResult Details(int id)
{
DataClassesDataContext db = new DataClassesDataContext();
using (var transaction = new TransactionScope()){
var idea = db.Ideas.Where(i => i.IdeaPK == id).Single();
idea.Views++;
db.SubmitChanges();
return View(new IdeaViewModel(idea));
}
}
I also tried increasing the TransactionScope timeout using the TransactionScopeOptions and that didn't seem to help (but I may have to set it elsewhere as well). I could probably solve this example by doing the increment in a single SQL command using db.ExecuteQuery, but I was trying to figure out how to make this work so I'll know what to do in more complex scenarios (where I want to execute multiple commands in a single transaction).
I think you should make a stored procedure which will atomically increment the field you want and call it through LINQ2SQL.
Other option is to wrap your operation into a transaction with appropriate isolation level.
You should not need transactions or stored procedures. Just use DataContext.ExecuteCommand:
db.ExecuteCommand("UPDATE Ideas SET Views = Views + 1 WHERE IdeaPK = {0}", id);
This will execute it as one SQL statement, and is thus atomic.
I would try capturing the Row not Found exception and triggering a retry of the whole operation. Requery your Views Row and update it and call submit changes again. Make sure you use a counter to ensure you only retry the operation five times or so, so that you are not caught in an infinite loop.
I strongly suggest that you look look at using a stored procedure as #Dmitry suggested and wrap your increment and select up into one operation. That will give you two benefits: 1) It will eliminate your contention issue and 2) it will put the entire operation into one call to the database. Here's the basic idea:
CREATE PROCEDURE spIdeasRetrieveAndLog
#IdeaPK int
AS
BEGIN
UPDATE Ideas SET Views = Views + 1 WHERE IdeaPK = #IdeaPK
GO
SELECT * FROM Ideas WHERE IdeaPK = #IdeaPK
END
GO
I would recommend you to use one of messaging frameworks (for example, NServiceBus, but there are other options - MassTransit, Rhino Service Bus). They will help you to solve this problem in very simple and elegant way.