Code, notice the order of the values is different. So it alternates between locking rows:
static void Main( string[] args )
{
List<int> list = new List<int>();
for(int i = 0; i < 1000; i++ )
list.Add( i );
Parallel.ForEach( list, i =>
{
using( NamePressDataContext db = new NamePressDataContext() )
{
db.ExecuteCommand( #"update EBayDescriptionsCategories set CategoryId = Ids.CategoryId from EBayDescriptionsCategories
join (values (7276, 20870),(240, 20870)) as Ids(Id,CategoryId) on Ids.Id = EBayDescriptionsCategories.Id" );
db.ExecuteCommand( #"update EBayDescriptionsCategories set CategoryId = Ids.CategoryId from EBayDescriptionsCategories
join (values (240, 20870),(7276, 20870)) as Ids(Id,CategoryId) on Ids.Id = EBayDescriptionsCategories.Id" );
}
} );
}
Table def:
CREATE TABLE [dbo].[EDescriptionsCategories](
[CategoryId] [int] NOT NULL,
[Id] [int] NOT NULL,
CONSTRAINT [PK_EDescriptionsCategories] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
Exception:
Transaction (Process ID 80) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
The code works only with WITH (TABLOCK) hint. Is it possible not to lock the whole table just to update just those 2 rows in parallel?
Your two statements acquire row locks in different order. That's a classic case for deadlocks. You can fix this by ensuring that the order of locks taken is always in some global order (for example, ordered by ID). You should probably coalesce the two UPDATE statements into one and sort the list of IDs on the client before sending it to SQL Server. For many typical UPDATE plans this actually works fine (not guaranteed, though).
Or, you add retry logic in case you detect a deadlock (SqlException.Number == 1205). This is more elegant because it requires no deeper code changes. But deadlocks have performance implications so only do this for low deadlock rates.
If your parallel processing generates lots of updates, you could INSERT all those updates into a temp table (which can be done concurrently) and when you are done you execute one big UPDATE that copies all the individual update records to the main table. You just change the join source in your sample queries.
Code, notice the order of the values is different. So it alternates between locking rows
No, it doesn't alternate. It acquires the locks in two different order. Deadlock is guaranteed.
Is it possible not to ... update just those 2 rows in parallel?
Not like that it isn't. What you're asking for is the definition of a deadlock. Something gotta give. The solution must come from your business logic, there should be no attempts to process the same set of IDs from distinct transactions. What that means, is entire business specific. IF you cannot achieve that, then basically you are just begging for deadlocks. There are some things you can do, but none is bulletproof and all come at great cost. The problem is higher up the chain.
Agree with other answers as regards to the locking.
The more pressing question is what are you hoping to gain from this? There's only one cable those commands are travelling down.
You are probably making the overall performance worse by doing this. Far better to do your computation in parallel but serialize (and possibly batch) your updates.
Related
I have a code block which processes StoreProducts an then adds or updates them in the database in a for each loop. But this is slow. When I convert the code Parallel.ForEach block, then same products gets both added and updated at the same time. I could not figure out how to safely utilize for the following functionality, any help would be appreciated.
var validProducts = storeProducts.Where(p => p.Price2 > 0
&& !string.IsNullOrEmpty(p.ProductAtt08Desc.Trim())
&& !string.IsNullOrEmpty(p.Barcode.Trim())
).ToList();
var processedProductCodes = new List<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)), po,
(product) =>
{
lock (_lockThis)
{
processedProductCodes.Add(product.ProductCode);
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
The thing in here is, the list validProducts may have more than one same ProductCode, so they are variants and I have to manage that even one of them is being processed it should not be processed again.
So where condition that is found in the parallel foreach 'validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)' is not working as expected like in normal for each.
The bulk of my answer is less-so an answer to your question and more some guidance - if you were to provide some more technical details, I may be able to assist more precisely.
A Parallel.ForEach is probably not the best solution here -- especially when you have a shared list or a busy server.
You are locking to write but not to read from that shared list. So I'm surprised it's not throwing during the Where. Turn the List<string> into a ConcurrentDictionary<string, bool> (just to create a simple concurrent hash table) then you'll get better write throughput and it won't throw during reads.
But you're going to have database contention issues (if using multiple connections) because your insert will likely still require locks. Even if you simply split the workload you would run into this. This DB locking could cause blocks/deadlocks so it may end up slower than the original. If using one connection, you generally cannot parallelize commands.
I would try wrapping the majority of inserts in a transaction containing batches of say 1000 inserts or place the entire workload into one bulk insert. Then the database will keep the data in-memory and commit the entire thing to disk when finished (instead of one record at a time).
Depending on your typical workload, you may want to try different storage solutions. Databases are generally bad for inserting large volumes of records... you will likely see much better performance with alternative solutions (such as Key-Value stores). Or place the data into something like Redis and slowly persist to the database in the background.
Parallel.ForEach buffers items internally for each thread, one option you could do is switch to a partitioner that does not use buffering
var pat = Partitioner.Create(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode))
,EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(pat, po, (product) => ...
That will get you closer but you will still have a race conditions where two of the same object can be processed because you don't break out of the loop if you find a duplicate.
The better option is switch processedProductCodes to a HashSet<string> and change your code to
var processedProductCodes = new HashSet<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts, po,
(product) =>
{
//You can safely lock on processedProductCodes
lock (processedProductCodes)
{
if(!processedProductCodes.Add(product.ProductCode))
{
//Add returns false if the code is already in the collection.
return;
}
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
HashSet has a much faster lookup and is built in to the Add function.
In a heavily multi-threaded scenario, I have problems with a particular EF query. It's generally cheap and fast:
Context.MyEntity
.Any(se => se.SameEntity.Field == someValue
&& se.AnotherEntity.Field == anotherValue
&& se.SimpleField == simpleValue
// few more simple predicates with fields on the main entity
);
This compiles into a very reasonable SQL query:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM (SELECT [Extent1].[Field1] AS [Field1]
FROM [dbo].[MyEntity] AS [Extent1]
INNER JOIN [dbo].[SameEntity] AS [Extent2] ON [Extent1].[SameEntity_Id] = [Extent2].[Id]
WHERE (N'123' = [Extent2].[SimpleField]) AND (123 = [Extent1].[AnotherEntity_Id]) AND -- further simple predicates here -- ) AS [Filter1]
INNER JOIN [dbo].[AnotherEntity] AS [Extent3] ON [Filter1].[AnotherEntity_Id1] = [Extent3].[Id]
WHERE N'123' = [Extent3].[SimpleField]
)) THEN cast(1 as bit) ELSE cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
The query, in general, has optimal query plan, uses the right indices and returns in tens of milliseconds which is completely acceptable.
However, when a critical number of threads (<=40) starts executing this query, the performance on it drops to tens of seconds.
There are no locks in the database, no queries are writing data to these tables and it reproduces very well with a database that's practically isolated from any other operations. The DB resides on the same physical machine and the machine is not overloaded at any point, i.e. has plenty of spare CPU, memory and other resources the CPU is overloaded by this operation.
Now what's really bizarre is that when I replace the EF Any() call with Context.Database.ExecuteSqlCommand() with the copy-pasted SQL (also using parameters), the problem magically disappears. Again, this reproduces very reliably - replacing the Any() call with copy-pasted SQL increases the performance by 2-3 orders of magnitude .
An attached profiler (dotTrace) sampling shows that the threads seem to all spend their time in the following method:
Is there anything I've missed or did we hit some ADO.NET / SQL Server cornercase?
MORE CONTEXT
The code running this query is a Hangfire job. For the purpose of test, a script queues a lot of jobs to be performed and up to 40 threads keep processing the job. Each job uses a separate DbContext instance and it's not really being used a lot. There are a few more queries before and after the problematic query and they take expected times to execute.
We're using many different Hangfire jobs for similar purposes and they behave as expected. Same with this one, except when it gets slow under high concurrency (of exact same jobs). Also, just switching to SQL on this particular query fixes the problem.
The profiling snapshot above is representative, all the threads slow down on this particular method call and spend the vast majority of their time on it.
UPDATE
I'm currently re-running a lot of those checks for sanity and errors. The easy reproduction means it's still on a remote machine to which I can't connect using VS for debugging.
One of the checks showed that my previous statement about free CPU was false, the CPU was not entirely overloaded but multiple cores were in fact running on full capacity for the whole duration of the long running jobs.
Re-checking everything again and will come back with updates here.
Can you try as shown below and see whether is there any performance improvement or not ...
Context.MyEntity.AsNoTracking()
.Any(se => se.SameEntity.Field == someValue
&& se.AnotherEntity.Field == anotherValue
&& se.SimpleField == simpleValue
);
Check if you are reusing the context in a loop. Doing so may create many objects during your performance test and giving the garbage collector a lot of work to do.
Faulty initial assumptions. The SQL in the question was obtained by pasting the code into LINQPad and having it generate the SQL.
After attaching an SQL profiler to the actual DB used, it showed a slightly different SQL involving outer joins, which are suboptimal and didn't have a proper index in place.
It remains a mystery why LINQPad generated different SQL, even though it's using the same EntityFramework.dll, but the original problem is resolved and all that remains is to optimize the query.
Many thanks for everyone involved.
I have a large DataTable that contains Users details. I need to complete user's details into this table from several tables in DB. I run though each row in the table and make several calls to different tables in the database, using ADO.NET objects and methods, process and reorganize the results and them to the main table. It's works fine, but to slow...
My idea was to split the large table into few small tables and run the CompleteAddressDetails method in a few threads simultaneously and in the end to merge small tables into one result table. I have implemented this idea using Task object of TPL. There is a code below. It works fine, but without any improvement of execution time.
Several questions:
1. Why there no any improvement of execution time?
2. What I have to do in order to improve it?
Thank you for any advice!
resultTable1 = data.Clone();
resultTable2 = data.Clone();
resultTable3 = data.Clone();
resultTable4 = data.Clone();
resultTable5 = data.Clone();
DataTable[] tables = new DataTable[] { resultTable1, resultTable2, resultTable3, resultTable4, resultTable5 };
for (int i = 0; i < data.Rows.Count; i += 5)
{
for (int j = 0; j < 5; j++)
{
if (data.Rows.Count > i + j)
{
tables[j].Rows.Add(data.Rows[i + j].ItemArray);
}
}
}
Task[] taskArray = {Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable1)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable2)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable3)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable4)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable5))};
Task.WaitAll(taskArray);
When using multi-threaded parallelism without any performance benefit, there's basically two possibilities:
The code isn't CPU-bound, so throwing more CPUs on the task isn't going to help
The code uses too much synchronization to actually allow realistic parallel execution
In this case, 1 is likely the cause. Your code isn't doing enough CPU work to benefit from multi-threading. Most likely, you're simply waiting for the database to do the work.
It's hard to give any pointers without seeing what the CompleteAddressDetails method does - I assume it goes through all the rows one by one, and executes a couple of separate queries to fill in the details. Even if each individual query is fast enough, doing thousands of separate queries is going to hurt your performance no matter what you do - and especially so if those queries require locking some shared state in the DB.
First, think of a better way to fill in the details. Perhaps you can join some of those queries together, or maybe you can even load all of the rows at once. Second, try profiling the actual queries as they happen on the server. Find out if there's something you can do to improve their performance - say, by adding some indices, or by better using the existing ones.
There is no improvement because you can't code your way around how the sql server database handles your calls.
I would recommend using a User-Defined Table Type on SQL Server, a Stored Procedure that accepts this table type, and then just send the DataTable you have through to the Stored Procedure and do your processing in there. You'd then be able to optimize from there going forward.
Application presented with "Sequence Contains More Than One Entity" error. Knowing this is usually the result of a .SingleOrDefault() command in Linq, I started investigating. I can verify that the production server has many instances of duplicate keywords, so that's where I begin.
I have the following Keyword table:
id INT (NOT NULL, PRIMARY KEY)
text NVARCHAR(512)
active INT
active is just a way to "enable/disable" data if the need strikes me. I'm using LINQ to SQL and have the following method implemented:
public Keyword GetKeyword(String keywordText)
{
return db.Keywords.SingleOrDefault(k => (k.text.ToUpper() == keywordText.ToUpper()));
}
The idea is that I attach keywords through an association table so that multiple objects can reference the same keyword. There should be no duplicate text entries within the Keyword table. This is not enforced on the database, but rather through code. This might not be best practice, but that is the least of my problems at the moment. So when I create my object, I do this:
Keyword keyword = GetKeyword(keywordText)
if(keyword == null)
{
keyword = new Keyword();
keyword.text = keywordText;
keyword.active = Globals.ACTIVE;
db.Keywords.InsertOnSubmit(keyword);
}
KeywordReference reference = new KeywordReference();
reference.keywordId = keyword.id;
myObject.KeywordReferences.Add(reference);
db.SubmitChanges();
this code is actually paraphrased, I use a repository pattern, so all relevant code would be much longer. However, I can assure you the code is working as intended, as I've extensively tested it. The problem seems to be happening on the database level.
So I run a few tests and manual queries on my test database and find that the keyword I passed in my tests is not in the database, yet if I do a JOIN, I see that it is. So I dig a little deeper and opt to manually scan through the whole list:
SELECT * FROM Keyword
return 737 results. I decided I didn't want to look through them all, so I ORDER BY text and get 737 results. I look for the word I recently added, and it does not show up in the list.
Confused, I do a lookup for all keywordIds associated with the object I recently worked on and see a few with ids over 1,000 (the id is set to autoincrement by 1). Knowing that I never actually delete a keyword (hence the "Active" column) so all ids from 1 to at least those numbers over 1,000 should all be present, I do another query:
SELECT * FROM Keyword ORDER BY id
returns 737 results, max ID stops at 737. So I try:
SELECT * FROM Keyword ORDER BY id DESC
returns 1308 rows
I've seen disparities like this before if there is no primary key or no unique identifier, but I have confirmed that the id column is, in fact, both. I'm not really sure where to go from here, and I now have the additional problem of about 4,000+ keywords on production that are duplicate, and several more objects that reference different instances of each.
If you somehow can get all your data then your table is fine, but index is corrupted.
Try to rebuild index on your table:
ALTER INDEX ALL ON Your.Table
REBUILD WITH (FILLFACTOR = 80, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = ON);
In rare situations you can see such errors if your hard drive fails to write data due to power loss. Chkdsk should help
I'd like to increase performance of very simple select and update queries of .NET & MSSQL 2k8.
My queries always select or update a single row. The DB tables have indexes on the columns I query on.
My test .NET code looks like this:
public static MyData GetMyData(int accountID, string symbol)
{
using (var cnn = new SqlConnection(connectionString))
{
cnn.Open();
var cmd = new SqlCommand("MyData_Get", cnn);
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.Add(CreateInputParam("#AccountID", SqlDbType.Int, accountID));
cmd.Parameters.Add(CreateInputParam("#Symbol", SqlDbType.VarChar, symbol));
SqlDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
var MyData = new MyData();
MyData.ID = (int)reader["ID"];
MyData.A = (int)reader["A"];
MyData.B = reader["B"].ToString();
MyData.C = (int)reader["C"];
MyData.D = Convert.ToDouble(reader["D"]);
MyData.E = Convert.ToDouble(reader["E"]);
MyData.F = Convert.ToDouble(reader["F"]);
return MyData;
}
}
}
and the according stored procedure looks like this:
PROCEDURE [dbo].[MyData_Get]
#AccountID int,
#Symbol varchar(25)
AS
BEGIN
SET NOCOUNT ON;
SELECT p.ID, p.A, p.B, p.C, p.D, p.E, p.F FROM [MyData] AS p WHERE p.AccountID = #AccountID AND p.Symbol = #Symbol
END
What I'm seeing if I run GetMyData in a loop, querying MyData objects, I'm not exceeding about ~310 transactions/sec. I was hoping to achieve well over a 1000 transactions/sec.
On the SQL Server side, not really sure what I can improve for such a simple query.
ANTS profiler shows me that on the .NET side, as expected, the bottleneck is cnn.Open and cnn.ExecuteReader, however I have no idea how I could significantly improve my .NET code?
I've seen benchmarks though where people seem to easily achieve 10s of thousands transactions/sec.
Any advice on how I can significantly improve the performance for this scenario would be greatly appreciated!
Thanks,
Tom
EDIT:
Per MrLink's recommendation, adding "TOP 1" to the SELECT query improved performance to about 585 transactions/sec from 310
EDIT 2:
Arash N suggested to have the select query "WITH(NOLOCK)" and that dramatically improved the performance! I'm now seeing around 2500 transactions/sec
EDIT 3:
Another slight optimization that I just did on the .NET side helped me to gain another 150 transactions/sec. Changing while(reader.Read()) to if(reader.Read()) surprisingly made quite a difference. On avg. I'm now seeing 2719 transactions/sec
Try using WITH(NOLOCK) in your SELECT statement to increase the performance. This would select the row without locking it.
SELECT p.ID, p.A, p.B, p.C, p.D, p.E, p.F FROM [MyData] WITH(NOLOCK) AS p WHERE p.AccountID = #AccountID AND p.Symbol = #Symbol
Some things to consider.
First, your not closing the server connection. (cnn.Close();) Eventually, it will get closed by the garbage collector. But until that happens, your creating a brand new connection to the database every time rather than collecting one from the connection pool.
Second, Do you have an index set in Sql Server covering the AccountID and Symbol columns?
Third, While accountId being and int is nice and fast. The Symbol column being varchar(25) is always going to be much slower. Can you change this to an int flag?
Make sure your database connections are actually pooling. If you are seeing a bottleneck in cnn.Open, there would seem to be a good chance they are not getting pooled.
I was hoping to achieve well over a 1000 transactions/sec [when running GetMyData in a loop]
What you are asking for is for GetMyData to run in less than 1ms - this is just pointless optimisation! At the bare minimum this method involves a round trip to the database server (possibly involving network access) - you wouldn't be able to make this method much faster if your query was SELECT 1.
If you have a genuine requirement to make more requests per second then the answer is either to use multiple threads or to buy a faster PC.
There is absolutely nothing wrong with your code - I'm not sure where you have seen people managing 10,000+ transactions per second, but I'm sure this must have involved multiple concurrent clients accessing the same database server rather than a single thread managing to execute queries in less than a 10th of a ms!
Is your method called frequently? Could you batch your requests so you can open your connection, create your parameters, get the result and reuse them for several queries before closing the whole thing up again?
If the data is not frequently invalidated (updated) I would implement a cache layer. This is one of the most effective ways (if used correctly) to gain performance.
You could use output parameters instead of a select since there's always a single row.
You could also create the SqlCommand ahead of time and re-use it. If you are executing a lot of queries in a single thread you can keep executing it on the same connection. If not, you could create a pool of them or do cmdTemplate.Clone() and set the Connection.
Try re-using the Command and Prepare'ing it first.
I can't say that it will definitely help, but it seems worth a try.
In no particular order...
Have you (or your DBAs) examined the execution plan your stored procedure is getting? Has SQL Server cached a bogus execution plan (either due to oddball parameters or old stats).
How often are statistics updated on the database?
Do you use temp tables in your stored procedure? If so, are they create upfront. If not, you'll be doing a lot of recompiles as creating a temp table invalidates the execution plan.
Are you using connection pooling? Opening/Closing a SQL server connection is an expensive operation.
Is your table clustered on accountID and Symbol?
Finally...
Is there a reason you're hitting this table by account and symbol rather than, say, just retrieving all the data for an entire account in one fell swoop?