Dealing with huge SQL resultset

Dealing with huge SQL resultset - c#

I am working with a rather large mysql database (several million rows) with a column storing blob images. The application attempts to grab a subset of the images and runs some processing algorithms on them. The problem I'm running into is that, due to the rather large dataset that I have, the dataset that my query is returning is too large to store in memory.
For the time being, I have changed the query to not return the images. While iterating over the resultset, I run another select which grabs the individual image that relates to the current record. This works, but the tens of thousands of extra queries have resulted in a performance decrease that is unacceptable.
My next idea is to limit the original query to 10,000 results or so, and then keep querying over spans of 10,000 rows. This seems like the middle of the road compromise between the two approaches. I feel that there is probably a better solution that I am not aware of. Is there another way to only have portions of a gigantic resultset in memory at a time?
Cheers,
Dave McClelland

One option is to use a DataReader. It streams the data, but it's at the expense of keeping an open connection to the database. If you're iterating over several million rows and performing processing for each one, that may not be desirable.
I think you're heading down the right path of grabbing the data in chunks, probably using MySql's Limit method, correct?

When dealing with such large datasets it is important not to need to have it all in memory at once. If you are writing the result out to disk or to a webpage, do that as you read in each row. Don't wait until you've read all rows before you start writing.
You also could have set the images to DelayLoad = true so that they are only fetched when you need them rather than implementing this functionality yourself. See here for more info.

I see 2 options.
1) if this is a windows app (as opposed to a web app) you can read each image using a data reader and dump the file to a temp folder on the disk, then you can do whatever processing you need to against the physical file.
2) Read and process the data in small chunks. 10k rows can still be a lot depending on how large the images are and how much process you want to do. Returning 5k worth of rows at a time and reading more in a separate thread when you are down to 1k remaining to process can make for a seamless process.
Also while not always recommended, forcing garbage collection before processing the next set of rows can help to free up memory.

I've used a solution like one outlined in this tutorial before:
http://www.asp.net/(S(pdfrohu0ajmwt445fanvj2r3))/learn/data-access/tutorial-25-cs.aspx
You could use multi-threading to pre-pull a portion of the next few datasets (at first pull 1-10,000 and in the background pull 10,001 - 20,000 and 20,001-30,000 rows; and delete the previous pages of the data (say if you are at 50,000 to 60,000 delete the first 1-10,000 rows to conserve memory if that is an issue). And use the user's location of the current "page" as a pointer to pull next range of data or delete some out-of-range data.

Related

What's the most efficient way to update thousands of records

We have a C# application which parses data from text files. We then have to update records in our sql database based on the information in the text files. What's the most efficient way for passing the data from application to SQL server?
We currently use a delimited string and then loop through the string in a stored procedure to update the records. I am also testing using TVP (table valued parameter). Are there any other options out there?
Our files contain thousands of records and we would like a solution that takes the least amount of time.

Please do not use a DataTable as that is just wasting CPU and memory for no benefit (other than possibly familiarity). I have detailed a very fast and flexible approach in my answer to the following questions, which is very similar to this one:
How can I insert 10 million records in the shortest time possible?
The example shown in that answer is for INSERT only, but it can easily be adapted to include UPDATE. Also, it uploads all rows in a single shot, but that can also be easily adapted to set a counter for X number of records and to exit the IEnumerable method after that many records have been passed in, and then close the file once there are no more records. This would require storing the File pointer (i.e. the stream) in a static variable to keep passing to the IEnumerable method so that it can be advanced and picked up at the most recent position the next time around. I have a working example of this method shown in the following answer, though it was using a SqlDataReader as input, but the technique is the same and requires very little modification:
How to split one big table that has 100 million data to multiple tables?
And for some perspective, 50k records is not even close to "huge". I have been uploading / merging / syncing data using the method I am showing here on 4 million row files and that hit several tables with 10 million (or more) rows.
Things to not do:
Use a DataTable: as I said, if you are just filling it for the purpose of using with a TVP, it is a waste of CPU, memory, and time.
Make 1 update at a time in parallel (as suggested in a comment on the question): this is just crazy. Relational database engines are heavily tuned to work most efficiently with sets, not singleton operations. There is no way that 50k inserts will be more efficient than even 500 inserts of 100 rows each. Doing it individually just guarantees more contention on the table, even if just row locks (it's 100k lock + unlock operations). Is could be faster than a single 50k row transaction that escalates to a table lock (as Aaron mentioned), but that is why you do it in smaller batches, just so long as small does not mean 1 row ;).
Set the batch size arbitrarily. Staying below 5000 rows is good to help reduce chances of lock escalation, but don't just pick 200. Experiment with several batch sizes (100, 200, 500, 700, 1000) and try each one a few times. You will see what is best for your system. Just make sure that the batch size is configurable though the app.config file or some other means (table in the DB, registry setting, etc) so that it can be changed without having to re-deploy code.
SSIS (powerful, but very bulky and not fun to debug)
Things which work, but not nearly as flexible as a properly done TVP (i.e. passing in a method that returns IEnumerable<SqlDataRecord>). These are ok, but why dump the records into a temp table just to have to parse them into the destination when you can do it all inline?
BCP / OPENROWSET(BULK...) / BULK INSERT
.NET's SqlBulkCopy

The best way to do this in my opinion is to create a temp table then use SqlBulkCopy to insert into that temp table (https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy%28v=vs.110%29.aspx), and then simply Update the table based on the temp table.
Based on my tests(using Dapper and also LINQ), updating as a bulk or with batches takes way too longer than just creating a temp table and sending a command to the server to update the data based on the temp table. The process is faster because the SqlBulkCopy populates the data natively in a fast manner, and the rest is completed on the SQL server side which goes through less calculation steps, and the data at that point resides on the server end.

update sql server rows, while reading the same table

I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation can´t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.

Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.

Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)

There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.

Best way to page through results in SqlCommand?

I have a database table that contains RTF documents. I need to extract these programmatically (I am aware I can use a cursor to step through the table - I need to do some data manipulation). I created a C# program that will do that, but the problem is that it can not load the whole table (about 2 million rows) into memory.
There is a MSDN page here.
That says there is basically two ways to loop through the data.
use the DataAdapter.Fill method to load page by page
run the query many times, iterating by using the primary key. Basically you run it once with a TOP 500 limit (or whatever) and PK > (last PK)
I have tried option 2, and it seems to work. But can I be sure I am pulling back all the data? When I do a SELECT COUNT (*) FROM Document it pulls back the same number of rows. Still, I'm nervous. Any tips for data validation?
Also which is faster? The data query is pretty slow - I optimized the query as much as possible, but there is a ton of data to transport over the WAN.

I think the answer requires a lot more understanding of your true requirements. It's hard for me to imagine a recurring process or requirement where you have to regularly extract 2 million binary files to do some processing on them! If this is a one-time thing then alright, let's get 'er done!
Here are some initial thoughts:
Could you deploy your C# routine to SQL directly and execute everything via CLR?
Could you have run your C# app locally on the box and take advantage of shared memory protocol?
Do you have to process every single row? If, for instance, you're validating the structure of the RTF data has changes versus another file can you create hashes of each that can be compared?
If you must get all the data out, maybe try exporting it to local disk and the XCOPY'ing it to another location.
If want to get a chunk of rows at a time, create a table that just keeps a list of all ID's that have been processed. When grabbing of the next 500 rows just find rows that aren't in that table yet. Of course, update that table with the new ID's that you've exported.
If you must do all this it could have a serious effect on OLTP performance. Either throttle it to only run off hours or take a *.bak and process it on a separate box. Actually, if this is a one-time thing, restore it to the same box that's running the SQL and use the shared memory protocol.

Speed up UniVerse access times using UniObjects

I am accessing a UniVerse database and reading out all the records in it for the purpose of synchronizing it to a MySQL database which is used for compatibility with some other applications which use the data. Some of the tables are >250,000 records long with >100 columns and the server is rather old and still used by many simultaneous users and so it takes a very ... long ... time to read the records sometimes.
Example: I execute SSELECT <file> TO 0 and begin reading through the select list, parsing each record into our data abstraction type and putting it in a .NET List. Depending on the moment, fetching each record can take between 250ms to 3/4 second depending on database usage. Removing the methods for extraction only speeds it up marginally since I think it still downloads all of the record information anyway when I call UniFile.read even if I don't use it.
Reading 250,000 records at this speed is prohibitively slow, so does anyone know a way I can speed this up? Is there some option I should be setting somewhere?

Do you really need to use SSELECT (sorted select)? The sorting on record key will create an additional performance overhead. If you do not need to synchronise in a sorted manner just use a plain SELECT and this should improve the performance.
If this doesn't help then try to automate the synchronisation to run at a time of low system usage, when either few or no users are logged onto the UniVerse system, if at all possible.
Other than that it could be that some of the tables you are exporting are in need of a resize. If they are not dynamic files (automatic-resizing - type 30), they may have gone into overflow space on disk.
To find out the size of your biggest tables and to see if they have gone into overflow you can use commands such as FILE.STAT and HASH.HELP at the command line to retrieve more information. Use HELP FILE.STAT or HELP HASH.HELP to look at the documentation for these commands, in order to extract the information that you need.
If these commands show that your files are of type 30, then they are automatically resized by the database engine. If however the file types are anything from type 2 to 18 the HASH.HELP command may recommend changes you can make to the table size to increase it's performance.
If none of this helps then you could check for useful indexes on the tables using LIST.INDEX TABLENAME ALL, which you could maybe use to speed up the selection.

Ensure your files are sized correctly using ANALYZE-FILE fileName. If not dynamic ensure there is not too much overflow.
Using SELECT instead of SSELECT will mean you are reading data from the database sequentially rather than randomly and be signicantly faster.
You should also investigate how you are extracting the data from each record and putting it into a list. Usually the pick data separators chars 254, 253 and 252 will not be compatible with the external database and need to be converted. How this is done can make an enormous difference to the performance.
It is not clear from the initial post, however a WRITESEQ would probably be the most efficient way to output the file data.

C# Datasets, paging and large amounts of data

I want to show a large amount of data in a dataset, 100,000 records approx 10 columns, this consumes a large amount of ram 700MB. I have also tried using paging which reduces this by about 15-20% but I don't really like the Previous/Next buttons involved when using paging. I'm not writing the data to disk at present, should I be? If so what is the most common method? The data isn't to be stored forever just whilst it is being viewed, then a new query may be run and a nother 70,000 records could be viewed. What is the best way to proceed?
Thanks for the advice.

The reality is that the end-user rarely needs to see the totality of their dataset, so I would use which method you like for presenting the data (listview) and build a custom pager so that the dataset is only fed with the results of the number of records desired. Otherwise, each page load would result in re-calling the dataset.
The XML method to a temp file or utilizing a temp table created through a stored proc are alternatives but you still must sift and present the data.

An important question is where this data comes from. That will help determine what options are available to you. Writing to disk would work, but it probably isn't the best choice, for three reasons:
As a user, I'd be pretty annoyed if your app suddenly chewed up 700Mb of disk space with no warning at all. But, then, I'd notice such things. I suppose a lot of users wouldn't. Still: it's a lot of space.
Depending on the source of the data, even the initial transfer could take longer than your really want to allow.
Again, as a user, there's NO WAY I'm manually digging through 700Mb worth of data. That means you almost certainly never need to show it. You want to only load the requested page, one (or a couple) pages at a time.

I would suggest memory mapped files...not sure if .NET includes support for it yet.

That is a lot of data to be working with and keeping aroudn in memory.
Is this a ASP.NET app? Or a Windows app?
I personally have found that going with a custom pager setup (to control, next previous links) and paging at the database level to be the only possible way to get the best performance, only get the data needed....

implement paging in SQL if you want to reduce the memory footprint

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.