I am writing a C# database program that reads data from one database, does calculations on the data, then writes the results to another database. This involves a large volume of data requiring repeated cycles of reads from the source database then writes to the destination database.
I am having problems with having my memory eaten up. I use dispose and clear functions when I can, but my research indicates that these don't really free memory of data tables and data connections.
One suggestion I found was to put the database calls within using structures. But for me this would mean opening and closing both data connections many times during a run. This seems rather inelegant and way to make the program run slower. Also, the program may evolve to the point where I will also need to write data back to the source database.
The most straight forward way to structure my workflow is to keep both database connections open all the time. But this seems to be part of my problem.
Does anyone have suggestions at to program structure to help with memory usage?
Let's break this problem down a little.
Memory is being used to hold the data you retrieve from the database.
Clearing this data will fix your memory usage problem.
So the question is, how is your data stored?
Is it a list of objects? Is it in a dataset? Is it in some sort of ORM model?
If you can clear/flush/null/dispose of the DATA you should be good to go.
Instantiating connection object inside a "using" statement only makes sure the connection object is disposed of, it doesn't explicitly clear any data you retrieved. (ORMs may dispose of the data inside of them, but not any copies you made calling .ToList() for example.)
External resources like databases, filesystem etc, it is always best to leverage a lock late release early policy and that is why you are probably getting feedback to open/close your connections as needed. But this won't help your memory issue. And it is completely fine to keep a database connection open for multiple commands.
Related
I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?
A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.
Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.
I'm trying to improve upon this program that I wrote for work. Initially I was rushed, and they don't care about performance or anything. So, I made a horrible decision to query an entire database(a SQLite database), and then store the results in lists for use in my functions. However, I'm now considering having each of my functions threaded, and having the functions query only the parts of the database that it needs. There are ~25 functions. My question is, is this safe to do? Also, is it possible to have that many concurrent connections? I will only be PULLING information from the database, never inserting or updating.
The way I've had it described to me[*] is to have each concurrent thread open its own connection to the database, as each connection can only process one query or modification at a time. The group of threads with their connections can then perform concurrent reads easily. If you've got a significant problem with many concurrent writes causing excessive blocking or failure to acquire locks, you're getting to the point where you're exceeding what SQLite does for you (and should consider a server-based DB like PostgreSQL).
Note that you can also have a master thread open the connections for the worker threads if that's more convenient, but it's advised (for your sanity's sake if nothing else!) to only actually use each connection from one thread.
[* For a normal build of SQLite. It's possible to switch things off at build time, of course.]
SQLite has no write concurrency, but it supports arbitrarily many connections that read at the same time.
Just ensure that every thread has its own connection.
25 simultanious connections is not a smart idea. That's a huge number.
I usually create a multi-layered design for this problem. I send all requests to the database through a kind of ObjectFactory class that has an internal cache. The ObjectFactory will forward the request to a ConnectionPoolHandler and will store the results in its cache. This connection pool handler uses X simultaneous connections but dispatches them to several threads.
However, some remarks must be made before applying this design. You first have to ask yourself the following 2 questions:
Is your application the only application that has access to this
database?
Is your application the only application that modifies data in this database?
If the first question is negatively, then you could encounter locking issues. If your second question is answered negatively, then it will be extremely difficult to apply caching. You may even prefer not to implement any caching it all.
Caching is especially interesting in case you are often requesting objects based on a unique reference, such as the primary key. In that case you can store the most often used objects in a Map. A popular collection for caching is an "LRUMap" ("Least-Recently-Used" map). The benifit of this collection is that it automatically arranges the most often used objects to the top. At the same time it has a maximum size and automatically removes items from the map that are rarely ever used.
A second advantage of caching is that each object exists only once. For example:
An Employee is fetched from the database.
The ObjectFactory converts the resultset to an actual object instance
The ObjectFactory immediatly stores it in cache.
A bit later, a bunch of employees are fetched using an SQL "... where name like "John%" statement.
Before converting the resultset to objects, the ObjectFactory first checks if the IDs of these records are perhaps already stored in cache.
Found a match ! Aha, this object does not need to be recreated.
There are several advantages to having a certain object only once in memory.
Last but not least in Java there is something like "Weak References". These are references that are references that in fact can be cleaned up by the garbage collector. I am not sure if it exists in C# and how it's called. By implementing this, you don't even have to care about the maximum amount of cached objects, your garbage collector will take care of it.
I have an application used to import a large dataset (millions of records) from one database to another, doing a diff in the process (IE removing things that were deleted, updating things, etc). Due to many foreign key constraints and such and to try to speed up the processing of the application, it loads up the entire destination database into memory and then tries to load up parts of the source database and does an in-memory compare, updating the destination in memory as it goes. In the end it writes these changes back to the destination. The databases do not match one to one, so a single table in one may be multiple tables in the other, etc.
So to my question: it currently takes hours to run this process (sometimes close to a day depending on the amount of data added/changed) and this makes it very difficult to debug. Historically, when we encounter a bug, we have made a change, and then rerun the app which has to load all of the data into memory again (taking quite some time) and then run the import process until we get to the part we were at and then we cross our fingers and hope our change worked. This isn't fun :(
To speed up the debugging process I am making an architectural change by moving the import code into a separate dll that is loaded into a separate appdomain so that we can unload it, make changes, and reload it and try to run a section of the import again, picking up where we left off, and seeing if we get better results. I thought that I was a genius when I came up with this plan :) But it has a problem. I either have to load up all the data from the destination database into the second appdomain and then, before unloading, copy it all to the first using the [Serializable] deal (this is really really slow when unloading and reloading the dll) or load the data in the host appdomain and reference it in the second using MarshalByRefObject (which has turned out to make the whole process slow it seems)
So my question is: How can I do this quickly? Like, a minute max! I would love to just copy the data as if it was just passed by reference and not have to actually do a full copy.
I was wondering if there was a better way to implement this so that the data could better be shared between the two or at least quickly passed between them. I have searched and found things recommending the use of a database (we are loading the data in memory to AVOID the database) or things just saying to use MarshalByRefObject. I'd love to do something that easy but it hasn't really worked yet.
I read somewhere that loading a C++ dll or unmanaged dll will cause it to ignore app domains and could introduce some problems. Is there anyway I could use this to my advantage, IE, load an unmanaged dll that holds my list for me or something, and use it to trick my application into using the same memory are for both appdomains so that the lists just stick around when I unload the other dll by unloading the app domain?
I hope this makes sense. It's my first question on here so if I've done a terrible job do help me out. This has frustrated me for a few days now.
App domains approach is a good way of separating for the sake of loading/unloading only part of your application. Unfortunately, as you discovered, exchanging data between two app domains is not easy/fast. It is just like two different system processes trying to communicate which will always be slower than the same process communication. So the way to go is to use quickest possible inter process communication mechanism. Skip WCF as it ads overhead you do not need here. Use named pipes through which you can stream data very fast. I have used it before with good results. To go even faster you can try MemoryMappedFile (link) but that's more difficult to implement. Start with named pipes and if that is too slow go for memory mapped files.
Even when using fast way of sending, you may hit another bottleneck - data serialization. For large amounts of data, standard serialization (even binary) is very slow. You may want to look at Google's protocol buffers.
One word of caution on AppDomain - any uncaught exception in one of the app domains brings the whole process down. They are not that separated, unfortunately.
On the side note. I do not know what your application does but millions of records does not seem that excessive. Maybe there is a room for optimization?
You didn't say if it were SQL Server, but did you look at using SSIS for doing this? There are evidently some techniques that can make it fast with big data.
I'm running a number of threads which each attempt to perform INSERTS to one SQLite database. Each thread creates it's own connection to the DB. They each create a command, open a Transaction perform some INSERTS and then close the transaction. It seems that the second thread to attempt anything gets the following SQLiteException: The database file is locked. I have tried unwrapping the INSERTS from the transaction as well as narrowing the scope of INSERTS contained within each commit with no real effect; subsequent access to the db file raises the same exception.
Any thoughts? I'm stumped and I'm not sure where to look next...
Update your insertion code so that if it encounters an exception indicating database lock, it waits a bit and tries again. Increase the wait time by random increments each time (the "random backoff" algorithm). This should allow the threads to each grab the global write lock. Performance will be poor, but the code should work without significant modification.
However, SQLite is not appropriate for highly-concurrent modification. You have two permanent solutions:
Move to a "real" database, such as PostgreSQL or MySQL
Serialize all your database modifications through one thread, to avoid SQLite's modifications.
Two things to check:
1) Confirmed that your version of SQLite was compiled with THREAD support
2) Confirm that you are not opening the database EXCLUSIVE
I was not doing this in C#, but rather in Android, but I got around this "database is locked" error by keeping the sqlite database always opened within the wrapper class that owns it, for the entire lifetime of the wrapper class. Each insert done within this class then can be in its own thread (because, depending on your data storage situation, sd card versus device memory etc., db writing could take a long time), and I even tried throttling it, making about a dozen insert threads at once, and each one was handled very well because the insert method didn't have to worry about opening/closing a DB.
I'm not sure if persistent DB life-cycles is considered good style, though (it may be considered bad in most cases), but for now it's working pretty well.
I have an importer process which is running as a windows service (debug mode as an application) and it processes various xml documents and csv's and imports into an SQL database. All has been well until I have have had to process a large amount of data (120k rows) from another table (as I do the xml documents).
I am now finding that the SQL server's memory usage is hitting a point where it just hangs. My application never receives a time out from the server and everything just goes STOP.
I am still able to make calls to the database server separately but that application thread is just stuck with no obvious thread in SQL Activity Monitor and no activity in Profiler.
Any ideas on where to begin solving this problem would be greatly appreciated as we have been struggling with it for over a week now.
The basic architecture is c# 2.0 using NHibernate as an ORM data is being pulled into the actual c# logic and processed then spat back into the same database along with logs into other tables.
The only other prob which sometimes happens instead is that for some reason a cursor is being opening on this massive table, which I can only assume is being generated from ADO.net the statement like exec sp_cursorfetch 180153005,16,113602,100 is being called thousands of times according to Profiler
When are you COMMITting the data? Are there any locks or deadlocks (sp_who)? If 120,000 rows is considered large, how much RAM is SQL Server using? When the application hangs, is there anything about the point where it hangs (is it an INSERT, a lookup SELECT, or what?)?
It seems to me that that commit size is way too small. Usually in SSIS ETL tasks, I will use a batch size of 100,000 for narrow rows with sources over 1,000,000 in cardinality, but I never go below 10,000 even for very wide rows.
I would not use an ORM for large ETL, unless the transformations are extremely complex with a lot of business rules. Even still, with a large number of relatively simple business transforms, I would consider loading the data into simple staging tables and using T-SQL to do all the inserts, lookups etc.
Are you running this into SQL using BCP? If not, the transaction logs may not be able to keep up with your input. On a test machine, try turning the recovery mode to Simple (non-logged) , or use the BCP methods to get data in (they bypass T logging)
Adding on to StingyJack's answer ...
If you're unable to use straight BCP due to processing requirements, have you considered performing the import against a separate SQL Server (separate box), using your tool, then running BCP?
The key to making this work would be keeping the staging machine clean -- that is, no data except the current working set. This should keep the RAM usage down enough to make the imports work, as you're not hitting tables with -- I presume -- millions of records. The end result would be a single view or table in this second database that could be easily BCP'ed over to the real one when all the processing is complete.
The downside is, of course, having another box ... And a much more complicated architecture. And it's all dependent on your schema, and whether or not that sort of thing could be supported easily ...
I've had to do this with some extremely large and complex imports of my own, and it's worked well in the past. Expensive, but effective.
I found out that it was nHibernate creating the cursor on the large table. I am yet to understand why, but in the mean time I have replaced the large table data access model with straight forward ado.net calls
Since you are rewriting it anyway, you may not be aware that you can call BCP directly from .NET via the System.Data.SqlClient.SqlBulkCopy class. See this article for some interesting perforance info.