Is there a batch size limit to CreateBatchWrite() - c#

When attempting to upload ~30,000 users into a dynamodb table using the Amazon.DynamoDBv2 wrapper for .net, not all records made it, however, there was no exception either.
var userBatch = _context.CreateBatchWrite<Authentication_User>();
userBatch.AddPutItems(users);
userBatch.ExecuteAsync();
Approximately 2,500'ish records were written to the table. Has anyone found a limit to number or size of batch inserts?

From the documentation (emphasis mine):
When using the object persistence model, you can specify any number of operations in a batch.

I exited the process after the return from ExecuteAsync(). That was the problem. When I let it run I can see the data slowly build up. However, in the bulk insert I used a Task.Wait() because there was nothing to do until the records had been loaded. However, Iridium's answer above assited me with the second issue that revolved around ProvisionedThroughput exceptions.

Related

How to troubleshoot SqlException deadlocked on lock | communication buffer resources

There are varying versions of this question on stackoverflow already, but none of them helped me to get to the bottom of my issue. So, here I go again with more specific details of my problem.
We've been randomly getting Transaction (Process ID xx) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction. . Let me be clear, this is not row or table level locking. I've tried enough guessed/random things; I need exact step by step guide on how to troubleshoot deadlock on communication buffer.
If you interested in specific details then read on.
Specific Details of Scenario : We have a very simple Dapper ORM based C# .net core Web API that takes in requests and performs CRUD operations to a database hosted on this Microsoft Sql server. To do this, Connection manager (registered as a scoped service) opens a new IDbConnection connection in request scope; this connection is used to execute deletes, inserts, updates or get. For insert/update/delete C# line looks like this await connection.ExecuteAsync("<Create update or delete statement>", entity); For GET requests we simply run await connection.QueryFirstOrDefaultAsync<TEntity>("<select statement>", entity); ; There are 5 types of entity (all presenting simple non relational tables). They all CRUD by ID.
What has been tried so far
MAXDOP=1 query hint on SQL statements
Ensuring only 1 entity CRUD at given point in time for the one kind of entity.
Restarting SQL server/application instance
Ensuring ports/RAM/CPU/network bandwidth are not exhausted
Alter DATABASE XXXXXX SET READ_COMMITTED_SNAPSHOT ON/OFF
Keeping transactions as small as possible
Persistent retry policy as a workaround (to handle random transient nature of the issue)
Single thread per entity type
Server Specifications:
We have Microsoft Sql Server 2016 On Azure hosted in virtual machine with 64 cores and 400GB RAM. Usual workload on this server is 10% CPU and 30% RAM, occasionally it goes up to 80% CPU and 350GB RAM. At all the times when this issue occurred, CPU usage was noticed under 20% (mostly around 10%, only one occasion 20%, RAM was under 30% on all occasions).
Deadlock XML Event as per #Dan Guzman's request
File size was too large for this post so created this google drive file. Please click on the following link then in top right corner click download. It is a zip file.
https://drive.google.com/file/d/1oZ4dT8Yrd2uW2oBqBy9XK_laq7ftGzFJ/view?usp=sharing
#DanGuzman helped so I had to upvote/choose his answer as accepted answer. But, I'd like to summarize what went here, what I learned and a step by step approach on how to troubleshoot deadlock on communication buffer (or any deadlock for that matter).
Step - 1
Pull the deadlock report. I used following query but you could also use the query #DanGuzman suggested (in comment section to this question).
SELECT
xed.value('#timestamp', 'datetime2(3)') as CreationDate,
xed.query('.') AS XEvent
FROM
(
SELECT CAST([target_data] AS XML) AS TargetData
FROM sys.dm_xe_session_targets AS st
INNER JOIN sys.dm_xe_sessions AS s
ON s.address = st.event_session_address
WHERE s.name = N'system_health'
AND st.target_name = N'ring_buffer'
) AS Data
CROSS APPLY TargetData.nodes('RingBufferTarget/event[#name="xml_deadlock_report"]') AS XEventData (xed)
ORDER BY CreationDate DESC
Step - 2
Locate the deadlock event corresponding to your sql exception timing/data. Then read this report in conjunction with Detecting and Ending Deadlocks guide to understand the root cause of your deadlock issue. In my case I was getting deadlock on communication buffer so as per this guide the Memory (the Memory section of Detecting and Ending Deadlocks guide) must have been causing the problem. As Dan pointed out, in my case, following query appeared in deadlock report which was using way too much buffer (as a result of inefficient query). So what is deadlock on communication buffer? Well, if this query requires too much buffer to finish its execution then two such queries could start their execution at the same time and start claiming the buffer they need but at some point available buffer might not be enough and they'll have to wait for more buffer freed up from completion of execution of other queries. So both query wait for each other to complete in a hope to get some more buffer freed up. this could lead to deadlock on buffer (as per the Memory section of guide)
<inputbuf>
#SomeStatus1 nvarchar(4000),#ProductName nvarchar(4000),#ProductNameSide nvarchar(4000),#BayNo nvarchar(4000),#CreatedDateTime datetime,#EffectiveDate datetime,#ForSaleFrom datetime,#ForSaleTo datetime,#SetupInfoNode nvarchar(4000),#LocationNumber nvarchar(4000),#AverageProductPrice decimal(3,2),#NetAverageCost decimal(3,1),#FocustProductType nvarchar(4000),#IsProduceCode nvarchar(4000),#ActivationIndicator nvarchar(4000),#ResourceType nvarchar(4000),#ProductIdentifierNumber nvarchar(4000),#SellingStatus nvarchar(4000),#SectionId nvarchar(4000),#SectionName nvarchar(4000),#SellPriceGroup nvarchar(4000),#ShelfCapacity decimal(1,0),#SellingPriceTaxExclu decimal(2,0),#SellingPriceTaxInclu decimal(2,0),#UnitToSell nvarchar(4000),#VendorNumber nvarchar(4000),#PastDate datetime,#PastPrice decimal(29,0))
UPDATE dbo.ProductPricingTable
SET SellingPriceTaxExclu = #SellingPriceTaxExclu, SellingPriceTaxInclu = #SellingPriceTaxInclu,
SellPriceGroup = #SellPriceGroup,
ActivationIndicator = #ActivationIndicator,
IsProduceCode = #IsProduceCode,
EffectiveDate = #EffectiveDate,
NetCos
</inputbuf>
Step 3 (The Fix)
Wait !!!! But I used Dapper. Then how come it could convert my query into such a deadly query? Well Dapper is great for most situation with out of box defaults, however, clearly, in my situation this default 4000 nvarchar killed it (please read Dan's answer for understanding how could such a query could cause problem). As Dan suggested, I had automatic parameter building from input entity like this await connection.ExecuteAsync("<Create update or delete statement>", entity);, where entity is an instance of C# model class. I changed it custom parameters as shown below. (for sake of simplicity I only added one parameter but you could use all required)
var parameters = new DynamicParameters();
parameters.Add("Reference", entity.Reference, DbType.AnsiString, size: 18 );
await connection.ExecuteAsync("<Create update or delete statement>", parameters );
I can see in profiler that requests are now having exact matching type column parameter type. That's it, this fix made the problem go away. Thanks Dan.
Conclusion
I could conclude that in my case deadlock on communication buffer occurred because of a bad query that took too much buffer to execute. This was the case because I blindly used default Dapper parameter builder. Using Dapper's custom parameter builder solved the problem.
Deadlocks are often a symptom that query and index tuning is needed. Below is an example query from the deadlock trace that suggests the root cause of the deadlocks:
<inputbuf>
#SomeStatus1 nvarchar(4000),#ProductName nvarchar(4000),#ProductNameSide nvarchar(4000),#BayNo nvarchar(4000),#CreatedDateTime datetime,#EffectiveDate datetime,#ForSaleFrom datetime,#ForSaleTo datetime,#SetupInfoNode nvarchar(4000),#LocationNumber nvarchar(4000),#AverageProductPrice decimal(3,2),#NetAverageCost decimal(3,1),#FocustProductType nvarchar(4000),#IsProduceCode nvarchar(4000),#ActivationIndicator nvarchar(4000),#ResourceType nvarchar(4000),#ProductIdentifierNumber nvarchar(4000),#SellingStatus nvarchar(4000),#SectionId nvarchar(4000),#SectionName nvarchar(4000),#SellPriceGroup nvarchar(4000),#ShelfCapacity decimal(1,0),#SellingPriceTaxExclu decimal(2,0),#SellingPriceTaxInclu decimal(2,0),#UnitToSell nvarchar(4000),#VendorNumber nvarchar(4000),#PastDate datetime,#PastPrice decimal(29,0))
UPDATE dbo.ProductPricingTable
SET SellingPriceTaxExclu = #SellingPriceTaxExclu, SellingPriceTaxInclu = #SellingPriceTaxInclu,
SellPriceGroup = #SellPriceGroup,
ActivationIndicator = #ActivationIndicator,
IsProduceCode = #IsProduceCode,
EffectiveDate = #EffectiveDate,
NetCos
</inputbuf>
Although the SQL statement text is truncated, it does show that all parameter declarations are nvarchar(4000) (a common problem with ORMs). This may prevent indexes from being used efficiently when column types referenced in join/where clauses are different, resulting in full scans that lead to deadlocks during concurrent queries.
Change the parameter types to match that of the referenced columns and check the execution plan for efficiency.

Figuring out what process is running on my SQL that is being called by my c# code

I am working on a .NET nop commerce application where I have around 5 million+ results in the database and I need to query all of that data for extraction. But the data from SQL is never returned to my code while my GC keeps on growing (it goes beyond 1gb) but when I run the same stored procedure in SQL after providing the respective parameters, it takes less than 2 minutes. I need to somehow figure out why call from my code is taking so much time.
NopCommerce uses entity framework libraries to call the databases stored procedure but that is not async so I am just trying to call the stored procedure in an async way using this function:
await dbcontext.Database.SqlQuery<TEntity>(commandText, parameters).ToListAsync();
as of my research from another SO post ToListAsync(); turns this call into an async when so the task is sent back to the task library.
now I need to figure out 3 things that currently I'm unable to do:
1) I need to figure out if that thread is running in the background? I assume it is as GC keeps growing but I'm just not sure, below is a pic of how I tried that using Diagnostics tool in Visual Studio:
2) I need to make sure if SQL processes are giving enough time to the database calls from my code, I tried following queries but they don't show me any value for the process running for that particular data export initiated by my code
I tried this query:
select top 50
sum(qs.total_worker_time) as total_cpu_time,
sum(qs.execution_count) as total_execution_count,
count(*) as number_of_statements,
qs.plan_handle
from
sys.dm_exec_query_stats qs
group by qs.plan_handle
order by sum(qs.total_worker_time) desc
also tried this one:
SELECT
r.session_id
,st.TEXT AS batch_text
,SUBSTRING(st.TEXT, statement_start_offset / 2 + 1, (
(
CASE
WHEN r.statement_end_offset = - 1
THEN (LEN(CONVERT(NVARCHAR(max), st.TEXT)) * 2)
ELSE r.statement_end_offset
END
) - r.statement_start_offset
) / 2 + 1) AS statement_text
,qp.query_plan AS 'XML Plan'
,r.*
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
ORDER BY cpu_time DESC
also when I use sp_who or sp_who2 the statuses for the processes for my database stay in the 'runnable' form like this, also those CPU and DISKIO:
3) I need to know, what if my DB call has been completed successfully but mapping them to the relevant list is taking a lot of time?
I would very much appreciate someone pointing me in the right direction, maybe help me with a query that can help me see the right results, or help me with viewing the running background threads and their status or maybe helping me with learning more about viewing the GC or threads and CPU utilization in a better way.
Any help will be highly appreciated. Thanks
A couple of diagnostic things to try:
Try adding a top 100 clause to the select statement, to see if there's a problem in the communication layer, or in the data mapper.
How much data is being returned by the stored procedure? If the procedure is returning more than a million rows, you may not be querying the data you mean.
Have you tried running it both synchronously and asynchronously?

SqlDataReader get specific ResultSet

When having a query that returns multiple results, we are iterating among them using the NextResult() of the SqlDataReader. This way we are accessing results sequentially.
Is there any way to access the result in a random / non sequential way. For example jump first to the third result, then to the first e.t.c
I am searching if there is something like rdr.GetResult(1), or a workaround.
Since I was asked Why I want something like this,
First of all I have no access to the query and so I can not changes, so in my client I will have the Results in the sequence that server writes / produces them.
To process (build collections, entities --> business logic) the first I need the Information from both the second and the third one.
Again since it is not an option to modify some of the code, I can not somehow (without writing a lot of code) 'store' the connection info (eg. ids) in order to connect in a later step the two ResultSets
The most 'elegant' solution (for sure not the only one) is to process the result sets in non sequential way. So that is why I am asking if there is such a way.
Update 13/6
While Jeroen Mostert answer gives a thoughtful explanation on why, Think2ceCode1ce answer shows the right directions for a workaround. The content of the link in the comments how in additional dataset could be utilized to work in an async way. IMHO this would be the way to go if was going to write a general solution. However in my case, I based my solution in the nature of my data and the logic behind them. In short terms, (1) I read the data as they come sequentially using the SqlDataReader; (2) I store some of the data I need in a dictionary and a Collection, when I am reading the first in row but second in logic ResultSet; (3) When I am Reading the third in row, but first in logic ResultSet I am iterating in through the collection I built earlier and based on the dictionary data I am building my final result.
The final code seems more efficient and it is more maintainable than using the async DataAdapter. However this is a very specific solution based on my data.
Provides a way of reading a forward-only stream of rows from a SQL
Server database.
https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqldatareader(v=vs.110).aspx
You need to use DataAdapter for disconnected and non-sequential access. To use this you just have to change bit of your ADO.NET code.
Instead of
SqlDataReader sqlReader = sqlCmd.ExecuteReader();
You need
DataTable dt = new DataTable();
SqlDataAdapter sqlAdapter = new SqlDataAdapter(sqlCmd);
sqlAdapter.Fill(dt);
If your SQL returns multiple result sets, you would use DataSet instead of DataTable, and then access result sets like ds.Tables[index_or_name].
https://msdn.microsoft.com/en-us/library/bh8kx08z(v=vs.110).aspx
No, this is not possible. The reason why is quite elementary: if a batch returns multiple results, it must return them in order -- the statement that returns result set #2 does not run before the one that returns result set #1, nor does the client have any way of saying "please just skip that first statement entirely" (as that could have dire consequences for the batch as a whole). Indeed, there's not even any way in general to tell how many result sets a batch will produce -- all of this is done at runtime, the server doesn't know in advance what will happen.
Since there's no way, server-side, to skip or index result sets, there's no meaningful way to do it client-side either. You're free to ignore the result sets streamed back to you, but you must still process them in order before you can move on -- and once you've moved on, you can't go back.
There are two possible global workarounds:
If you process all data and cache it locally (with a DataAdapter, for example) you can go back and forth in the data as you please, but this requires keeping all data in memory.
If you enable MARS (Multiple Active Result Sets) you can execute another query even as the first one is still processing. This does require splitting up your existing single batch code into individual statements (which, if you really can't change anything about the SQL at all, is not an option), but you could go through result sets at will (without caching). It would still not be possible for you to "go back" within a single result set, though.

Speed up LINQ inserts

I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?
I've created a simple Repository method to save a record:
public void SaveOffer(Offer offer)
{
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
// add new offer
if (dbOffer == null)
{
this.db.Offers.InsertOnSubmit(offer);
}
//update existing offer
else
{
dbOffer = offer;
}
this.db.SubmitChanges();
}
But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).
On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.
P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.
Updates/Edits:
After 18hours, the LINQ version added just ~200K rows.
I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.
I still have to try batch commit, manually connecting to the db and compiled queries.
SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.
While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.
See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx
It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.
Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?
Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.
Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.
Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function.
You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ
Good luck !
I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).
I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).
I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.
Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.
Alex gave the best answer, but I think a few things are being over looked.
One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.
A few other areas where you could speed things up:
While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.
If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.
Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.
Converting this to a compiled query is the easiest way I can think of to boost your performance here:
Change the following:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
to:
Offer dbOffer = RetrieveOffer(offer.offer_id);
private static readonly Func<DataContext, int> RetrieveOffer
{
CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}
This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.
As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.
Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.
P.S. I've tried to use SqlBulkCopy,
but I need to do some transformations
on Offer before inserting it into the
db, and I think that defeats the
purpose of SqlBulkCopy.
I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds
I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:
Offer dbOffer = this.db.Offers.SingleOrDefault (
o => o.offer_id == offer.offer_id);
If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.
Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.
Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.
I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.
This code runs ok, and prevents large amounts of data:
if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
repository2.GeoItems.SubmitChanges();
}
Then, at the end of the bulk insertion, use this:
repository2.GeoItems.SubmitChanges();

How can I get a percentage of LINQ to SQL submitchanges?

I wonder if anyone else has asked a similar question.
Basically, I have a huge tree I'm building up in RAM using LINQ objects, and then I dump it all in one go using DataContext.SubmitChanges().
It works, but I can't find how to give the user a sort of visual indication of how far has the query progressed so far. If I could ultimately implement a sort of progress bar, that would be great, even if there is a minimal loss in performance.
Note that I have quite a large amount of rows to put into the DB, over 750,000 rows.
I haven't timed it exactly, but it does take a long while to put them in.
Edit: I thought I'd better give some indication of what I'm doing.
Basically, I'm building a suffix tree from the Lord of the Rings. Thus, there are a lot of Nodes, and certain Nodes have positions associated to them (Nodes that happen to be at the end of a suffix). I am building the Linq objects along these lines.
suffixTreeDB.NodeObjs.InsertOnSubmit(new NodeObj()
{
NodeID = 0,
ParentID = 0,
Path = "$"
});
After the suffix tree has been fully generated in RAM (which only takes a few seconds), I then call suffixTreeDB.submitChanges();
What I'm wondering is if there is any faster way of doing this. Thanks!
Edit 2: I've did a stopwatch, and apparently it takes precisely 6 minutes for the DB to be written.
I suggest you divide the calls you are doing, as they are sent in separate calls to the db anyway. This will also reduce the size of the transaction (which linq does when calling submitchanges).
If you divide them in 10 blocks of 75.000, you can provide a rough estimate on a 1/10 scale.
Update 1: After re-reading your post and your new comments, I think you should take a look at SqlBulkCopy instead. If you need to improve the time of the operation, that's the way to go. Check this related question/answer: What's the fastest way to bulk insert a lot of data in SQL Server (C# client)
I was able to get percentage progress for ctx.submitchanges() by using ctx.Log and ActionTextWriter
ctx.Log = new ActionTextWriter(s => {
if (s.StartsWith("INSERT INTO"))
insertsCount++;
ReportProgress(insertsCount);
});
more details are available at my blog post
http://epandzo.wordpress.com/2011/01/02/linq-to-sql-ctx-submitchanges-progress/
and stackoverflow question
LINQ to SQL SubmitChangess() progress
This isn't ideal, but you could create another thread that periodically queries the table you're populating to count the number of records that have been inserted. I'm not sure how/if this will work if you are running in a transaction though, since there could be locking/etc.
What I really think I need is a form of Bulk-Insert, however it appears that Linq doesn't support it.

Categories

Resources