I have an importer process which is running as a windows service (debug mode as an application) and it processes various xml documents and csv's and imports into an SQL database. All has been well until I have have had to process a large amount of data (120k rows) from another table (as I do the xml documents).
I am now finding that the SQL server's memory usage is hitting a point where it just hangs. My application never receives a time out from the server and everything just goes STOP.
I am still able to make calls to the database server separately but that application thread is just stuck with no obvious thread in SQL Activity Monitor and no activity in Profiler.
Any ideas on where to begin solving this problem would be greatly appreciated as we have been struggling with it for over a week now.
The basic architecture is c# 2.0 using NHibernate as an ORM data is being pulled into the actual c# logic and processed then spat back into the same database along with logs into other tables.
The only other prob which sometimes happens instead is that for some reason a cursor is being opening on this massive table, which I can only assume is being generated from ADO.net the statement like exec sp_cursorfetch 180153005,16,113602,100 is being called thousands of times according to Profiler
When are you COMMITting the data? Are there any locks or deadlocks (sp_who)? If 120,000 rows is considered large, how much RAM is SQL Server using? When the application hangs, is there anything about the point where it hangs (is it an INSERT, a lookup SELECT, or what?)?
It seems to me that that commit size is way too small. Usually in SSIS ETL tasks, I will use a batch size of 100,000 for narrow rows with sources over 1,000,000 in cardinality, but I never go below 10,000 even for very wide rows.
I would not use an ORM for large ETL, unless the transformations are extremely complex with a lot of business rules. Even still, with a large number of relatively simple business transforms, I would consider loading the data into simple staging tables and using T-SQL to do all the inserts, lookups etc.
Are you running this into SQL using BCP? If not, the transaction logs may not be able to keep up with your input. On a test machine, try turning the recovery mode to Simple (non-logged) , or use the BCP methods to get data in (they bypass T logging)
Adding on to StingyJack's answer ...
If you're unable to use straight BCP due to processing requirements, have you considered performing the import against a separate SQL Server (separate box), using your tool, then running BCP?
The key to making this work would be keeping the staging machine clean -- that is, no data except the current working set. This should keep the RAM usage down enough to make the imports work, as you're not hitting tables with -- I presume -- millions of records. The end result would be a single view or table in this second database that could be easily BCP'ed over to the real one when all the processing is complete.
The downside is, of course, having another box ... And a much more complicated architecture. And it's all dependent on your schema, and whether or not that sort of thing could be supported easily ...
I've had to do this with some extremely large and complex imports of my own, and it's worked well in the past. Expensive, but effective.
I found out that it was nHibernate creating the cursor on the large table. I am yet to understand why, but in the mean time I have replaced the large table data access model with straight forward ado.net calls
Since you are rewriting it anyway, you may not be aware that you can call BCP directly from .NET via the System.Data.SqlClient.SqlBulkCopy class. See this article for some interesting perforance info.
Related
I have the following scenario: I am building a dummy web app that pulls betting odds every minute, stores all the events, matches, odds etc. to the database and then updates the UI.
I have this structure: Sports > Events > Matches > Bets > Odds and I am using code first approach and for all DB-related operations I am using EF.
When I am running my application for the very first time and my database is empty I am receiving XML with odds which contains: ~16 sports, ~145 events, ~675 matches, ~17100 bets & ~72824 odds.
Here comes the problem: how to save all this entities in timely manner? Parsing is not that time consuming operation - 0.2 seconds, but when I try to bulk store all these entities I face memory problems and the save took more than 1 minute so next odd pull is triggered and this is nightmare.
I saw somewhere to disable the Configuration.AutoDetectChangesEnabled and recreate my context on every 100/1000 records I insert, but I am not nearly there. Every suggestion will be appreciated. Thanks in advance
When you are inserting huge (though it is not that huge) amounts of data like that, try using SqlBulkCopy. You can also try using Table Value Parameter and pass it to a stored procedure but I do not suggest it for this case as TVPs perform well for records under 1000. SqlBulkCopy is super easy to use which is a big plus.
If you need to do an update to many records, you can use SqlBulkCopy for that as well but with a little trick. Create a staging table and insert the data using SqlBulkCopy into the staging table, then call a stored procedure which will get records from the staging table and update the target table. I have used SqlBulkCopy for both cases numerous times and it works pretty well.
Furthermore, with SqlBulkCopy you can do the insertion in batches as well and provide feedback to the user, however, in your case, I do not think you need to do that. But nonetheless, this flexibility is there.
Can I do it using EF only?
I have not tried but there is this library you can try.
I understand your situation but:
All actions you've been doing it all depends on your machine specs and
the software itself.
Now if machine specs cannot handle the process it will be the time to
change a plan like to limit the count of records to be inserted till
it all to be done.
I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?
A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.
Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.
From what I've read, there appears to be marginal performance benefits using stored procedures vs simply building the commands in C# and calling them explicitly in the program's code, at least when it comes to machines that share the server program and db engine (and when the procedures are simple). Most people seem to think it's a 'preference issue', and add a few other minor benefits to justify their case.
However, one I couldn't find any information on, is the benefit of a stored procedure when the database engine is located on a separate physical machine from the main application.
If I am not mistaken, in a server farm, wouldn't a stored procedure offload the processing on some cpu threads from the main server application, and have the primary processing done on the db engine server's cpu instead? Or, is this already done on the db engine's cpu anyways, when the C# libraries 'build' the information for the db engine to process?
Specifically, I have a long-running transaction that I could do multiple calls in a C# transaction block, but I suspect that a stored proc will in fact have a huge performance benefit by reducing the network calls to the db engine, as well as guaranteeing the processing is not being done on the main server application.
Is this true?
Performance gains from a stored procedure (versus something like Dapper or an OR/M like Entity Framework) can vary anywhere from nearly identical to a very noticeable performance improvement. I don't think your question can be answered without seeing the code that would be translated to a stored procedure.
Having said that, in my experience making a single stored procedure call versus multiple statements from the application code, yes, it would likely be faster.
If the SP is just a simple query (ie one SELECT statement) the performance gain is that a SP is pre-compiled. While the query is running you should not see any difference if it is a query or a SP.
I'm not sure of the effect if the SP is more complicated because this would depend on the query.
The more important benefit of a SP is that all the data are kept in the DBMS instead of being sent back and forward to the client. If you are dealing with large amount of data the benefit is more evident. The difference rises if your DB is located on a different machine and even more if the connection between them is slow.
On the contrary you must consider that a SP usually is not compiled to machine code so if the SP implements very complex logic it could be faster to implement the logic on the client.
Then you should also consider that moving the business logic to the server is not so good for code maintenance, you could add a technology debit implementing in the DB something that should be in your client code.
So, there isn't a solution valid for all the seasons but usually a well written SP is faster than the same code running on the client
There are a few issues at play here. As others have said, it kind of depends. A raw select statement will be barely noticeable. If there's a hugely complex query then a SP can save a lot of repetitive parsing. If there's a lot of intermediate data then SP will keep the data local, reducing network traffic. If your DB has a higher spec than the client it might run faster due to CPU horsepower.
Downsides can be things like bogging down the DB for everyone with processing that could be done on the client. This is generally if you're running an underpowered SQL server. Another subtle side to this is that licensing costs for a multi-core DB server can be impressive. Your $ per cycle on a SQL Server box can be many times what it is on your client.
I have an MSSQL 2008 table with a few million records. I need to iterate over each row, modify some of the data, and copy the updated record to a new table using a C# application that gets executed on a daily basis.
I have tried doing this using ADO.NET entities, but there are memory issues involved with this method, not to mention it is very slow. I have read up on bulk-copy libraries and SQL-only ways for copying one table to another, but none of them involve modifying records before copying them. I need to find a better way for performing this operation.
As you mention memory issues I'm guessing you're trying to load the million rows into memory, process them and then write them back to the database.
You can avoid this by 'streaming' the data instead of loading it entirely. The SqlDataReader will handle buffering for you so on the reading side you can do a simple WHILE loop that fetches rows one by one. The actual conversion you already have working it seems so all you need to do is take care of writing the results back into the database. IMHO the fastest way to do so is by storing a buffer of multiple results (start with 100, work up and see where the sweet spot is) in a data-table and then push that data-table into the database using the SqlBulkCopy class.
Rinse & repeat.
PS: Sounds like a 'fun' problem. Do you have any sample data sitting somewhere to test this out ? 5 hours sounds like a LONG time for something that looks trivial at first, then again 20 million times virtually nothing still adds up. More specifically I wonder how 'large' the data is on the RTF side : are the values ca 2k on average or rather 200k? And what kind of hardware do you run this on ?
The fastest performing option would be to re-write your C# application logic into a CLR stored procedure so that all processing takes place on the server.
Checking around the internet, it looks like Microsoft's official answer to converting rich to plain text is to load the data into a RichTextBox control and then pull it out with the RichTextBox.Text property. That sucks for a lot of reasons, but mostly because it means you're going to have to get your hands dirty. Your best bet is to write a small app that invokes the RichTextBox control and passes all of your data to/from the database (using the SqlDataReader should alleviate the memory issues you mentioned).
Just as a matter of process - I would suggest building an intermediary table that your "cleansed" data rows get dumped into before appending them to your production table. Once you get the stored proc figured out just right, you can create a trigger that automatically invokes your stored proc every time a record gets added to your dirty table. This will ultimately eliminate the need to run your program every day to move records, as the trigger will make sure it happens "on the fly".
Edit - one last thought
It occurred to me that you might not be comfortable writing stored procedures and triggers, which is ok. A more "programmatic" solution would be to kick all of the files in your dirty table out to a delimited text file, which can easily be downloaded and parsed. Once you have the text file, you could manipulate it with your app (read it, cleanse it, create a cleansed file..what have you) and then upload for reading back into your database. Depending on your comfort/background/skill level, this might actually be the better solution to get the job done.
Hope this helps!
Use SSIS. Schedule a daily job that does your transformation and runs the SSIS package. This will take care of batching and memory consumption, and will offer a few fast connectors for the read and write of data. You can embed your custom C# code (the RTF stripping into pure text) as an SSIS component, see Developing Custom Objects for Integration Services.
I am writing an application that logs status updates (GPS locations) from devices to a database. The updates occur at a set interval for each device, which is currently every 3 seconds. I'm using a simple table in SQL Server 08 for storing each update.
I've noticed that running the inserts is an area of slow down in my application. Its not a severe slow down, but noticable. Naturally, I'd like to write to the database in as an efficient way as possible. I have an idea to improve the performance and am looking for input and advice to see if it will help:
The status updates come in from an asynchronous Socket thread. In my current implementation, the database insert call is executed from this thread. I'm thinking I can create a queue for holding update data that the Socket thread can quickly add its update to and then go on its merry way. There would then be a separate thread whose sole responsibility would be checking the update queue and inserting the updates into the database.
Basically this whole process rests on the assumption that writing to the database from one location with a bunch of data all at once is more efficient than writing one row of data at a random time. Is my assumption correct, or way off base? Also, on the SQL side, is there a command to tell it to write a bunch of rows at once that would improve write performance?
This is how the database is being written to:
I'm using LinqToSQL in C#, so for each insert, I first create a DataContext instance. From the DataContext object I then call a stored procedure which inserts the location update.
The table is indexed by datetime, for the time of the update.
Have a look at the SqlBulkCopy class - this allows you to use BCP to insert chunks of data very quickly.
Also, make sure your indexes are efficient. If you have a clustered index on anything that does not increase sequentially (integer, date) then you will suffer performance slowdowns as the pages are filled up.
Have you looked MSMQ ( Microsoft Message Queuing (MSMQ)) ? That seems to me an option to take a look.
Yes, inserting in batches will typically be faster than separate inserts given your description. Each insert will require a connection to be set up and packets to be transferred. If you have a single small insert that takes one packet and you issue three of those, but you alternatively have three inserts that are small enough that they can all fit in one packet then it will help.
Quantifying it is difficult just based on your description - you'll need to do testing for that. For example, if you are keeping a dedicated connection open at all times anyway, as hova suggests, then you might see less of an impact.
Another area you might want to take a look at is whether you are setting up and tearing down a connection for each insert. That alone might make a performance improvement, negating the need for batching.
You'll also want to have as few indexes on the table as possible.
It sounds like a good idea. Why not give it a shot and see how it performs?
On the SQL side you'd want to have a look at making sure you are using parameterized queries.
Also batching your INSERT statements will certainly increase the performance.
Connection management is also key, of course that depends on how the application is built and whether it depends on a connection being there.
Are you not afraid to loose data while are you collecting data to batch copy?
I'm writing application doing the same. At start I will have to write data from 3,5k GPS devices. One device should send data each minute but it can send faster. Destination number of devices is 10,5k.
I'm wondering about inserting performance too. For now I'm saving received data to db on every packet using pure ADO.NET ICommand and stored procedure. On my test serwer (Xeon 3,4GHz and one 1TB hard disk - normal desktop ;) it takes for now 1ms or less.
#GRIMUS - should I wondering if there will be more devices?