C# Datasets, paging and large amounts of data - c#

I want to show a large amount of data in a dataset, 100,000 records approx 10 columns, this consumes a large amount of ram 700MB. I have also tried using paging which reduces this by about 15-20% but I don't really like the Previous/Next buttons involved when using paging. I'm not writing the data to disk at present, should I be? If so what is the most common method? The data isn't to be stored forever just whilst it is being viewed, then a new query may be run and a nother 70,000 records could be viewed. What is the best way to proceed?
Thanks for the advice.

The reality is that the end-user rarely needs to see the totality of their dataset, so I would use which method you like for presenting the data (listview) and build a custom pager so that the dataset is only fed with the results of the number of records desired. Otherwise, each page load would result in re-calling the dataset.
The XML method to a temp file or utilizing a temp table created through a stored proc are alternatives but you still must sift and present the data.

An important question is where this data comes from. That will help determine what options are available to you. Writing to disk would work, but it probably isn't the best choice, for three reasons:
As a user, I'd be pretty annoyed if your app suddenly chewed up 700Mb of disk space with no warning at all. But, then, I'd notice such things. I suppose a lot of users wouldn't. Still: it's a lot of space.
Depending on the source of the data, even the initial transfer could take longer than your really want to allow.
Again, as a user, there's NO WAY I'm manually digging through 700Mb worth of data. That means you almost certainly never need to show it. You want to only load the requested page, one (or a couple) pages at a time.

I would suggest memory mapped files...not sure if .NET includes support for it yet.

That is a lot of data to be working with and keeping aroudn in memory.
Is this a ASP.NET app? Or a Windows app?
I personally have found that going with a custom pager setup (to control, next previous links) and paging at the database level to be the only possible way to get the best performance, only get the data needed....

implement paging in SQL if you want to reduce the memory footprint

Related

Scaling C# database application

I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?
A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.
Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.

Speed up UniVerse access times using UniObjects

I am accessing a UniVerse database and reading out all the records in it for the purpose of synchronizing it to a MySQL database which is used for compatibility with some other applications which use the data. Some of the tables are >250,000 records long with >100 columns and the server is rather old and still used by many simultaneous users and so it takes a very ... long ... time to read the records sometimes.
Example: I execute SSELECT <file> TO 0 and begin reading through the select list, parsing each record into our data abstraction type and putting it in a .NET List. Depending on the moment, fetching each record can take between 250ms to 3/4 second depending on database usage. Removing the methods for extraction only speeds it up marginally since I think it still downloads all of the record information anyway when I call UniFile.read even if I don't use it.
Reading 250,000 records at this speed is prohibitively slow, so does anyone know a way I can speed this up? Is there some option I should be setting somewhere?
Do you really need to use SSELECT (sorted select)? The sorting on record key will create an additional performance overhead. If you do not need to synchronise in a sorted manner just use a plain SELECT and this should improve the performance.
If this doesn't help then try to automate the synchronisation to run at a time of low system usage, when either few or no users are logged onto the UniVerse system, if at all possible.
Other than that it could be that some of the tables you are exporting are in need of a resize. If they are not dynamic files (automatic-resizing - type 30), they may have gone into overflow space on disk.
To find out the size of your biggest tables and to see if they have gone into overflow you can use commands such as FILE.STAT and HASH.HELP at the command line to retrieve more information. Use HELP FILE.STAT or HELP HASH.HELP to look at the documentation for these commands, in order to extract the information that you need.
If these commands show that your files are of type 30, then they are automatically resized by the database engine. If however the file types are anything from type 2 to 18 the HASH.HELP command may recommend changes you can make to the table size to increase it's performance.
If none of this helps then you could check for useful indexes on the tables using LIST.INDEX TABLENAME ALL, which you could maybe use to speed up the selection.
Ensure your files are sized correctly using ANALYZE-FILE fileName. If not dynamic ensure there is not too much overflow.
Using SELECT instead of SSELECT will mean you are reading data from the database sequentially rather than randomly and be signicantly faster.
You should also investigate how you are extracting the data from each record and putting it into a list. Usually the pick data separators chars 254, 253 and 252 will not be compatible with the external database and need to be converted. How this is done can make an enormous difference to the performance.
It is not clear from the initial post, however a WRITESEQ would probably be the most efficient way to output the file data.

Process each row and copy it to new table using C#

I have an MSSQL 2008 table with a few million records. I need to iterate over each row, modify some of the data, and copy the updated record to a new table using a C# application that gets executed on a daily basis.
I have tried doing this using ADO.NET entities, but there are memory issues involved with this method, not to mention it is very slow. I have read up on bulk-copy libraries and SQL-only ways for copying one table to another, but none of them involve modifying records before copying them. I need to find a better way for performing this operation.
As you mention memory issues I'm guessing you're trying to load the million rows into memory, process them and then write them back to the database.
You can avoid this by 'streaming' the data instead of loading it entirely. The SqlDataReader will handle buffering for you so on the reading side you can do a simple WHILE loop that fetches rows one by one. The actual conversion you already have working it seems so all you need to do is take care of writing the results back into the database. IMHO the fastest way to do so is by storing a buffer of multiple results (start with 100, work up and see where the sweet spot is) in a data-table and then push that data-table into the database using the SqlBulkCopy class.
Rinse & repeat.
PS: Sounds like a 'fun' problem. Do you have any sample data sitting somewhere to test this out ? 5 hours sounds like a LONG time for something that looks trivial at first, then again 20 million times virtually nothing still adds up. More specifically I wonder how 'large' the data is on the RTF side : are the values ca 2k on average or rather 200k? And what kind of hardware do you run this on ?
The fastest performing option would be to re-write your C# application logic into a CLR stored procedure so that all processing takes place on the server.
Checking around the internet, it looks like Microsoft's official answer to converting rich to plain text is to load the data into a RichTextBox control and then pull it out with the RichTextBox.Text property. That sucks for a lot of reasons, but mostly because it means you're going to have to get your hands dirty. Your best bet is to write a small app that invokes the RichTextBox control and passes all of your data to/from the database (using the SqlDataReader should alleviate the memory issues you mentioned).
Just as a matter of process - I would suggest building an intermediary table that your "cleansed" data rows get dumped into before appending them to your production table. Once you get the stored proc figured out just right, you can create a trigger that automatically invokes your stored proc every time a record gets added to your dirty table. This will ultimately eliminate the need to run your program every day to move records, as the trigger will make sure it happens "on the fly".
Edit - one last thought
It occurred to me that you might not be comfortable writing stored procedures and triggers, which is ok. A more "programmatic" solution would be to kick all of the files in your dirty table out to a delimited text file, which can easily be downloaded and parsed. Once you have the text file, you could manipulate it with your app (read it, cleanse it, create a cleansed file..what have you) and then upload for reading back into your database. Depending on your comfort/background/skill level, this might actually be the better solution to get the job done.
Hope this helps!
Use SSIS. Schedule a daily job that does your transformation and runs the SSIS package. This will take care of batching and memory consumption, and will offer a few fast connectors for the read and write of data. You can embed your custom C# code (the RTF stripping into pure text) as an SSIS component, see Developing Custom Objects for Integration Services.

Is a clear and replace more efficient than a loop checking all records?

I have a C# List, that is filled from a database.. So far its only 1400 records, but I expect it to grow a LOT.. Routinely I do a check for new data on the entire list.. What I'm trying to figure out is this, is it faster to simply clear the List and reload all the data from the table, or would checking each record be faster..
Intuition tells me that the dump and load method would be faster, but I thought I should check first...
If my understanding is correct, you would have to load the list from MySql anyway, and veryfy that the in memory list is up to date, correct? So then the only issue you refer to is the in memory manegement of the list.
Well, typically I would try to profile the different behaviours first, an see which performs better.
But as you state, I would think that a clear and recreate should be faster than a systematic check and update.
You should dump and reload, definitely. I base this advice purely on my (perhaps unwarranted) fear of your code that checks for new data.
Depends how slow the load is. I'd say access something in memory is always going to be faster than loading from a DB. However you need to do your own calculations.
Remember don't optimise without numbers to back you up.
If the database is not getting hit several times an hour, then definitly reload the data from the database. Offloding as much work to the database as it is designed to take full advantage of system resources.
Add a column called 'InsertedTime' in your database. At intervals update those rows with InsertedTime>ListCreatedTime(a variable in your app)
This assumes that your db data are not deleted
If you want to handle db changes like updation and deletion. You can better set flags in your db. Later when you've updated your list, you can delete your records from your app.
But anyways profile your dump and load method and compare it with this to know which is faster in your scenario.

Dealing with huge SQL resultset

I am working with a rather large mysql database (several million rows) with a column storing blob images. The application attempts to grab a subset of the images and runs some processing algorithms on them. The problem I'm running into is that, due to the rather large dataset that I have, the dataset that my query is returning is too large to store in memory.
For the time being, I have changed the query to not return the images. While iterating over the resultset, I run another select which grabs the individual image that relates to the current record. This works, but the tens of thousands of extra queries have resulted in a performance decrease that is unacceptable.
My next idea is to limit the original query to 10,000 results or so, and then keep querying over spans of 10,000 rows. This seems like the middle of the road compromise between the two approaches. I feel that there is probably a better solution that I am not aware of. Is there another way to only have portions of a gigantic resultset in memory at a time?
Cheers,
Dave McClelland
One option is to use a DataReader. It streams the data, but it's at the expense of keeping an open connection to the database. If you're iterating over several million rows and performing processing for each one, that may not be desirable.
I think you're heading down the right path of grabbing the data in chunks, probably using MySql's Limit method, correct?
When dealing with such large datasets it is important not to need to have it all in memory at once. If you are writing the result out to disk or to a webpage, do that as you read in each row. Don't wait until you've read all rows before you start writing.
You also could have set the images to DelayLoad = true so that they are only fetched when you need them rather than implementing this functionality yourself. See here for more info.
I see 2 options.
1) if this is a windows app (as opposed to a web app) you can read each image using a data reader and dump the file to a temp folder on the disk, then you can do whatever processing you need to against the physical file.
2) Read and process the data in small chunks. 10k rows can still be a lot depending on how large the images are and how much process you want to do. Returning 5k worth of rows at a time and reading more in a separate thread when you are down to 1k remaining to process can make for a seamless process.
Also while not always recommended, forcing garbage collection before processing the next set of rows can help to free up memory.
I've used a solution like one outlined in this tutorial before:
http://www.asp.net/(S(pdfrohu0ajmwt445fanvj2r3))/learn/data-access/tutorial-25-cs.aspx
You could use multi-threading to pre-pull a portion of the next few datasets (at first pull 1-10,000 and in the background pull 10,001 - 20,000 and 20,001-30,000 rows; and delete the previous pages of the data (say if you are at 50,000 to 60,000 delete the first 1-10,000 rows to conserve memory if that is an issue). And use the user's location of the current "page" as a pointer to pull next range of data or delete some out-of-range data.

Categories

Resources