I have an application used to import a large dataset (millions of records) from one database to another, doing a diff in the process (IE removing things that were deleted, updating things, etc). Due to many foreign key constraints and such and to try to speed up the processing of the application, it loads up the entire destination database into memory and then tries to load up parts of the source database and does an in-memory compare, updating the destination in memory as it goes. In the end it writes these changes back to the destination. The databases do not match one to one, so a single table in one may be multiple tables in the other, etc.
So to my question: it currently takes hours to run this process (sometimes close to a day depending on the amount of data added/changed) and this makes it very difficult to debug. Historically, when we encounter a bug, we have made a change, and then rerun the app which has to load all of the data into memory again (taking quite some time) and then run the import process until we get to the part we were at and then we cross our fingers and hope our change worked. This isn't fun :(
To speed up the debugging process I am making an architectural change by moving the import code into a separate dll that is loaded into a separate appdomain so that we can unload it, make changes, and reload it and try to run a section of the import again, picking up where we left off, and seeing if we get better results. I thought that I was a genius when I came up with this plan :) But it has a problem. I either have to load up all the data from the destination database into the second appdomain and then, before unloading, copy it all to the first using the [Serializable] deal (this is really really slow when unloading and reloading the dll) or load the data in the host appdomain and reference it in the second using MarshalByRefObject (which has turned out to make the whole process slow it seems)
So my question is: How can I do this quickly? Like, a minute max! I would love to just copy the data as if it was just passed by reference and not have to actually do a full copy.
I was wondering if there was a better way to implement this so that the data could better be shared between the two or at least quickly passed between them. I have searched and found things recommending the use of a database (we are loading the data in memory to AVOID the database) or things just saying to use MarshalByRefObject. I'd love to do something that easy but it hasn't really worked yet.
I read somewhere that loading a C++ dll or unmanaged dll will cause it to ignore app domains and could introduce some problems. Is there anyway I could use this to my advantage, IE, load an unmanaged dll that holds my list for me or something, and use it to trick my application into using the same memory are for both appdomains so that the lists just stick around when I unload the other dll by unloading the app domain?
I hope this makes sense. It's my first question on here so if I've done a terrible job do help me out. This has frustrated me for a few days now.
App domains approach is a good way of separating for the sake of loading/unloading only part of your application. Unfortunately, as you discovered, exchanging data between two app domains is not easy/fast. It is just like two different system processes trying to communicate which will always be slower than the same process communication. So the way to go is to use quickest possible inter process communication mechanism. Skip WCF as it ads overhead you do not need here. Use named pipes through which you can stream data very fast. I have used it before with good results. To go even faster you can try MemoryMappedFile (link) but that's more difficult to implement. Start with named pipes and if that is too slow go for memory mapped files.
Even when using fast way of sending, you may hit another bottleneck - data serialization. For large amounts of data, standard serialization (even binary) is very slow. You may want to look at Google's protocol buffers.
One word of caution on AppDomain - any uncaught exception in one of the app domains brings the whole process down. They are not that separated, unfortunately.
On the side note. I do not know what your application does but millions of records does not seem that excessive. Maybe there is a room for optimization?
You didn't say if it were SQL Server, but did you look at using SSIS for doing this? There are evidently some techniques that can make it fast with big data.
Related
I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?
A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.
Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.
I've been working on an internal developer tool on and off for a few weeks now, but I'm running into an ugly stumbling block I haven't managed to find a good solution for. I'm hoping someone can offer some ideas or guidance on the best ways to use the existing frameworks in .NET.
Background: the purpose of this tool is to load multiple different types of log files (Windows Event Log, IIS, SQL trace, etc.) to the same database table so they can be sorted and examined together. My personal goal is to make the entire thing streamlined so that we only make a single pass and do not cache the entire log either in memory or to disk. This is important when log files reach hundreds of MB or into the GB range. Fast performance is good, but slow and unobtrusive (allowing you to work on something else in the meantime) is better than running faster but monopolizing the system in the process, so I've focused on minimizing RAM and disk usage.
I've iterated through a few different designs so far trying to boil it down to something simple. I want the core of the log parser--the part that has to interact with any outside library or file to actually read the data--to be as simple as possible and conform to a standard interface, so that adding support for a new format is as easy as possible. Currently, the parse method returns an IEnumerable<Item> where Item is a custom struct, and I use yield return to minimize the amount of buffering.
However, we quickly run into some ugly constraints: the libraries provided (generally by Microsoft) to process these file formats. The biggest and ugliest problem: one of these libraries only works in 64-bit. Another one (Microsoft.SqlServer.Management.Trace TraceFile for SSMS logs) only works in 32-bit. As we all know, you can't mix and match 32- and 64-bit code. Since the entire point of this exercise is to have one utility that can handle any format, we need to have a separate child process (which in this case is handling the 32-bit-only portion).
The end result is that I need the 64-bit main process to start up a 32-bit child, provide it with the information needed to parse the log file, and stream the data back in some way that doesn't require buffering the entire contents to memory or disk. At first I tried using stdout, but that fell apart with any significant amount of data. I've tried using WCF, but it's really not designed to handle the "service" being a child of the "client", and it's difficult to get them synchronized backwards from how they want to work, plus I don't know if I can actually make them stream data correctly. I don't want to use a mechanism that opens up unsecured network ports or that could accidentally crosstalk if someone runs more than one instance (I want that scenario to work normally--each 64-bit main process would spawn and run its own child). Ideally, I want the core of the parser running in the 32-bit child to look the same as the core of a parser running in the 64-bit parent, but I don't know if it's even possible to continue using yield return, even with some wrapper in place to help manage the IPC. Is there any existing framework in .NET that makes this relatively easy?
WCF does have a P2P mode however if all your processes are local machine you are better off with IPC such as named pipes due to the latter running in Kernel Mode and does not have the messaging overhead of the former.
Failing that you could try COM which should not have a problem talking between 32 and 64 bit processes. - Tell me more
In case anyone stumbles across this, I'll post the solution that we eventually settled on. The key was to redefine the inter-process WCF service interface to be different from the intra-process IEnumerable interface. Instead of attempting to yield return across process boundaries, we stuck a proxy layer in between that uses an enumerator, so we can call a "give me an item" method over and over again. It's likely this has more performance overhead than a true streaming solution, since there's a method call for every item, but it does seem to get the job done, and it doesn't leak or consume memory.
We did follow Micky's suggestion of using named pipes, but still within WCF. We're also using named semaphores to coordinate the two processes, so we don't attempt to make service calls until the "child service" has finished starting up.
I have a c# application that generates data every 1 second (stock tick data) which can be discarded after each itteration.
I would like to pass this data to a Coldfusion (10) application and I have considered having the c# application writing the data to a file every second and then having the Coldfusion application reading that data, but this is most likely going to cause issues with the potential for both applications trying to read or write to the file at the same time ?
I was wondering if using Memory Mapped Files would be a better approach ? If so, how could I access the memory mapped file from Coldfusion ?
Any advice would be greatly appreciated. Thanks.
We have produced a number of stock applications that include tick by tick tracking of watchlists, charting etc. I think the idea of a file is probably not a great idea unless you are talking about a single stock with regular intervals. In my experience a change every "second" is probably way understating the case. Some stokes (AAPL or GOOG are good examples) have hundreds of "ticks" per second during peak times.
So if you are NOT taking every tick but really are "updating the file" every 1 second then your idea has some merit in that you could use a file watching gateway to fire events for you and "see" that the file is updated.
But keep in mind that you are in effect introducing something "in the middle". A file now stands between your Java or CF applications and the quote engine. That's going to introduce latency no matter what you choose to do (file handles getting and releasing etc). And the locks of one process may interfere with the other.
When you are dealing with facebook updates miliseconds don't really matter much - in spite of all the teenage girls who probably disagree with me :) With stock quotes however, half of the task is shaving off miliseconds to get your processes as close to real time as possible.
Our choice is usually to choose sockets instead of something in the middle bridging the data. The quote engine then keeps it's watchlist and updates it's arrays like normal but also sends any updates down stream to the socket engine which pushes it to something taht can handle it (a chart application, watchlist, socketgateway for webpage etc).
Hope this helps - it's not a clear answer but more of a clarification to the hurdles you face.
I work on a big project in company. We collect data which we get via API methods of the CMS.
ex.
DataSet users = CMS.UserHelper.GetLoggedUser(); // returns dataset with users
Now on some pages we need many different data, not just users, also Nodes of the tree of the CMS or specific data of subtreee.
So we thought of write an own "helper class" in which we later can get different data easy.
ex:
MyHelperClass.GetUsers();
MyHelperClass.Objects.GetSingleObject( ID );
Now the problem is our "Helper Class" is really big and now we like to collect different data from the "Helper Class" and write them into a typed dataset . Later we can give a repeater that typed dataset which contains data from different tables. (which even comes from the methods I wrote before via API)
Problem is: It is so slow now, even at loading the page! Does it load or init the whole class??
By the way CMS is Kentico if anyone works with it.
I'm tired. Tried whole night..but it's soooo slow. Please give a look to that architecture.
May be you find some crimes which are not allowed :S
I hope we get it work faster. Thank you.
alt text http://img705.imageshack.us/img705/3087/classj.jpg
Bottlenecks usually come in a few forms:
Slow or flakey network.
Heavy reading/writing to disk, as disk IO is 1000s of times slower than reading or writing to memory.
CPU throttle caused by long-running or inefficiently implemented algorithm.
Lots of things could affect this, including your database queries and indexes, the number of people accessing your site, lack of memory on your web server, lots of reflection in your code, just plain slow hardware etc. No one here can tell you why your site is slow, you need to profile it.
For what its worth, you asked a question about your API architecture -- from a code point of view, it looks fine. There's nothing wrong with copying fields from one class to another, and the performance penalty incurred by wrapper class casting from object to Guid or bool is likely to be so tiny that its negligible.
Since you asked about performance, its not very clear why you're connecting class architecture to performance. There are really really tiny micro-optimizations you could apply to your classes which may or may not affect performance -- but the four or five nanoseconds you'll gain with those micro-optimizations have already been lost simply by reading this answer. Network latency and DB queries will absolutely dwarf the performance subtleties of your API.
In a comment, you stated "so there is no problem with static classes or a basic mistake of me". Performance-wise, no. From a web-app point of view, probably. In particular, static fields are global and initialized once per AppDomain, not per session -- the variables mCurrentCultureCode and mcurrentSiteName sound session-specific, not global to the AppDomain. I'd double-check those to see your site renders correctly when users with different culture settings access the site at the same time.
Are you already using Caching and Session state?
The basic idea being to defer as much of the data loading to these storage mediums as possible and not do it on individual page loads. Caching especially can be useful if you only need to get the data once and want to share it between users and over time.
If you are already doing these things, ore cant directly implement them try deferring as much of this data gathering as possible, opting to short-circuit it and not do the loading up front. If the data is only occasionally used this can also save you a lot of time in page loads.
I suggest you try to profile your application and see where the bottlenecks are:
Slow load from the DB?
Slow network traffic?
Slow rendering?
Too much traffic for the client?
The profiling world should be part of almost every senior programmer. It's part of the general toolbox. Learn it, and you'll have the answers yourself.
Cheers!
First thing first... Enable Trace for your application and try to optimize Response size, caching and work with some Application and DB Profilers... By just looking at the code I am afraid no one can be able to help you better.
I have an importer process which is running as a windows service (debug mode as an application) and it processes various xml documents and csv's and imports into an SQL database. All has been well until I have have had to process a large amount of data (120k rows) from another table (as I do the xml documents).
I am now finding that the SQL server's memory usage is hitting a point where it just hangs. My application never receives a time out from the server and everything just goes STOP.
I am still able to make calls to the database server separately but that application thread is just stuck with no obvious thread in SQL Activity Monitor and no activity in Profiler.
Any ideas on where to begin solving this problem would be greatly appreciated as we have been struggling with it for over a week now.
The basic architecture is c# 2.0 using NHibernate as an ORM data is being pulled into the actual c# logic and processed then spat back into the same database along with logs into other tables.
The only other prob which sometimes happens instead is that for some reason a cursor is being opening on this massive table, which I can only assume is being generated from ADO.net the statement like exec sp_cursorfetch 180153005,16,113602,100 is being called thousands of times according to Profiler
When are you COMMITting the data? Are there any locks or deadlocks (sp_who)? If 120,000 rows is considered large, how much RAM is SQL Server using? When the application hangs, is there anything about the point where it hangs (is it an INSERT, a lookup SELECT, or what?)?
It seems to me that that commit size is way too small. Usually in SSIS ETL tasks, I will use a batch size of 100,000 for narrow rows with sources over 1,000,000 in cardinality, but I never go below 10,000 even for very wide rows.
I would not use an ORM for large ETL, unless the transformations are extremely complex with a lot of business rules. Even still, with a large number of relatively simple business transforms, I would consider loading the data into simple staging tables and using T-SQL to do all the inserts, lookups etc.
Are you running this into SQL using BCP? If not, the transaction logs may not be able to keep up with your input. On a test machine, try turning the recovery mode to Simple (non-logged) , or use the BCP methods to get data in (they bypass T logging)
Adding on to StingyJack's answer ...
If you're unable to use straight BCP due to processing requirements, have you considered performing the import against a separate SQL Server (separate box), using your tool, then running BCP?
The key to making this work would be keeping the staging machine clean -- that is, no data except the current working set. This should keep the RAM usage down enough to make the imports work, as you're not hitting tables with -- I presume -- millions of records. The end result would be a single view or table in this second database that could be easily BCP'ed over to the real one when all the processing is complete.
The downside is, of course, having another box ... And a much more complicated architecture. And it's all dependent on your schema, and whether or not that sort of thing could be supported easily ...
I've had to do this with some extremely large and complex imports of my own, and it's worked well in the past. Expensive, but effective.
I found out that it was nHibernate creating the cursor on the large table. I am yet to understand why, but in the mean time I have replaced the large table data access model with straight forward ado.net calls
Since you are rewriting it anyway, you may not be aware that you can call BCP directly from .NET via the System.Data.SqlClient.SqlBulkCopy class. See this article for some interesting perforance info.