I have a database table that contains RTF documents. I need to extract these programmatically (I am aware I can use a cursor to step through the table - I need to do some data manipulation). I created a C# program that will do that, but the problem is that it can not load the whole table (about 2 million rows) into memory.
There is a MSDN page here.
That says there is basically two ways to loop through the data.
use the DataAdapter.Fill method to load page by page
run the query many times, iterating by using the primary key. Basically you run it once with a TOP 500 limit (or whatever) and PK > (last PK)
I have tried option 2, and it seems to work. But can I be sure I am pulling back all the data? When I do a SELECT COUNT (*) FROM Document it pulls back the same number of rows. Still, I'm nervous. Any tips for data validation?
Also which is faster? The data query is pretty slow - I optimized the query as much as possible, but there is a ton of data to transport over the WAN.
I think the answer requires a lot more understanding of your true requirements. It's hard for me to imagine a recurring process or requirement where you have to regularly extract 2 million binary files to do some processing on them! If this is a one-time thing then alright, let's get 'er done!
Here are some initial thoughts:
Could you deploy your C# routine to SQL directly and execute everything via CLR?
Could you have run your C# app locally on the box and take advantage of shared memory protocol?
Do you have to process every single row? If, for instance, you're validating the structure of the RTF data has changes versus another file can you create hashes of each that can be compared?
If you must get all the data out, maybe try exporting it to local disk and the XCOPY'ing it to another location.
If want to get a chunk of rows at a time, create a table that just keeps a list of all ID's that have been processed. When grabbing of the next 500 rows just find rows that aren't in that table yet. Of course, update that table with the new ID's that you've exported.
If you must do all this it could have a serious effect on OLTP performance. Either throttle it to only run off hours or take a *.bak and process it on a separate box. Actually, if this is a one-time thing, restore it to the same box that's running the SQL and use the shared memory protocol.
Related
I'm trying to record log files on my database. My question is which has the less load on making logs on the database. I'm thinking of storing long term log files ,maybe 3-5 years maximum, for an Inventory Program.
Process: I'll be using a barcode scanner.
After scanning a barcode, I'll get all the details of who is logged in, date and time, product details then saved per piece.
I came up with two ideas.
After the scanning event, It will be saved on a DataTable then after finishing a batch.. DataTable will be written on a *.txt file and then uploaded to my database.
After every scanned barcode, an INSERT query will be executed. I suspect this option will be heavy on the server side since I'm not the only one using this server
What are the pros and cons of the two options?
Are there more efficient ways of storing logs?
Based on your use case, I also think you need to consider at least 2 additional factors, the first being how important is it that the scanned item is logged in the database immediately. If you need the scanned item to be logged because you'll be checking to see if its been scanned, for example to prevent other scans, then doing a single insert is probably a very good idea. The second thing to consider is will you ever need to "unscan" an item, and at which part of the process? If the person scanning needs the ability to revert the scan immediately, it might be a good idea to wait until theyre done all their scannings before dumping the data to the database, as this will let you avoid ever having to delete from the table.
Overall I wouldnt worry too much about what the database can handle, sql-server is very good at handling simultaneous single inserts into a table thats designed for that use case. If youre only going to be inserting new data to the end of the table, and not updating or deleting existing records, performance is going to scale very well. The same goes for larger batch inserts, theyre very efficient no matter how many rows you want to bring in, assuming your table is designed for that purpose.
So overall I would probably pick the more efficient solution from the application side for your specific use case, and then once you have decided that, you can shape the database around the code, rather than trying to shape your code around suspected limitations of the database.
What are the pros and cons of the two options?
Basically your question is which way is more efficient (bulk insert or multiple single insert)?
The answers is always depends and always be situation based. So unfortunately, I don't think there's a right answer for you
The way you structure the log table.
If you choose bulk insert, how many rows do you want to insert at 1 time?
Is it read-only table? And if you want to read from it, how often do you do the read?
Do you need to scale it up?
etc...
Are there more efficient ways of storing logs?
There're some possible ways to improve I can think of (not all of them can work together)
If you go with the first option, maybe you can schedule the insert to non-peak hours
If you go with the first option, chunk the log files and do the insert
Use another database to do the logging
If you go with the second option, do some load testing
Personally, I prefer to go with second option if the project is small to medium size and the logging is critical part of the project.
hope it helps.
Go with the second option, and use transactions. This way the data will not be sent to the db until you call the transaction to complete. (Which can be scheduled.) This will also prevent broken data from getting into your database when a crash or something occurs.
Transactions in .net
Transaction Tutorial in C#
I apologize if this question is a bit nebulous.
I am writing a C# application which does data manipulation against a SQL Server database. For a group of items, I read data for each item, do calculations on the data, then write the results to the database.
The problem I am having is that the application starts to slow down relative to the time it takes to process each item when the number of items to be processed increases.
I am trying to be very careful as far as freeing memory for allocated objects as I am through with them. I want to have nothing hanging around from the processing of one item when I start the processing for the next item. I make use of "using" structures for data tables and the BulkCopy class to try to force memory cleanup.
Yet, I start to get geometrically increasing run times per item the more items I try to process in one invocation of the program.
My program is a WinForms app. I don't seem to be eating up the server's memory with what I am doing. I am trying to make the processing of each item isolated from the processing of all other items, to make sure it would not matter how many items I process in each invocation of the application.
Has anyone seen this behavior in their applications and know what to look for to correct this?
A couple of things you can be watchful for - if you're using "using" statements - are you making sure that you're not keeping your connection open while manipulating your objects? Best to make sure you get your data from the database, close the connection, do your manipulation and then send the data back.
Are you using stored procedures for fetching/sending complex objects? You can also experiment with doing some of you data manipulation inside of the stored procedure or in functions called from them - you do NOT want to offload your entire business classes to the database, but you can do some of it there, depending on what you're doing.
Make sure your data structure is optimized as well (primary key indices, foreign keys, triggers etc. you can get some scripts from http://www.brentozar.com/first-aid/ to check the optimization of your database.
As mentioned above, try using some parallel/asynchronous patterns to divy up your work - await/async is very helpful for this, especially if you want to have calculations while also sending previous data back to the server.
Thanks for all the input. I checked the issues of opening/closing connections, etc. to see that I was being tidy. The thing that really helped was removing the primary keys on the destination data table. These were setup relative to what an end user would require, but they really gummed up the speed of data inserts. A heads up to folks to think about database constraints for updating data vs. using the data.
Also, found performance issues in selecting with a filter from an in memory DataTable. Somehow what I was doing get bogged down with a larger number of rows (30,000). I realized that I was mishandling the data and did not really need to do this. But it did show me the need to micro-test each step of my logic when trying to drag so much data around.
I am accessing a UniVerse database and reading out all the records in it for the purpose of synchronizing it to a MySQL database which is used for compatibility with some other applications which use the data. Some of the tables are >250,000 records long with >100 columns and the server is rather old and still used by many simultaneous users and so it takes a very ... long ... time to read the records sometimes.
Example: I execute SSELECT <file> TO 0 and begin reading through the select list, parsing each record into our data abstraction type and putting it in a .NET List. Depending on the moment, fetching each record can take between 250ms to 3/4 second depending on database usage. Removing the methods for extraction only speeds it up marginally since I think it still downloads all of the record information anyway when I call UniFile.read even if I don't use it.
Reading 250,000 records at this speed is prohibitively slow, so does anyone know a way I can speed this up? Is there some option I should be setting somewhere?
Do you really need to use SSELECT (sorted select)? The sorting on record key will create an additional performance overhead. If you do not need to synchronise in a sorted manner just use a plain SELECT and this should improve the performance.
If this doesn't help then try to automate the synchronisation to run at a time of low system usage, when either few or no users are logged onto the UniVerse system, if at all possible.
Other than that it could be that some of the tables you are exporting are in need of a resize. If they are not dynamic files (automatic-resizing - type 30), they may have gone into overflow space on disk.
To find out the size of your biggest tables and to see if they have gone into overflow you can use commands such as FILE.STAT and HASH.HELP at the command line to retrieve more information. Use HELP FILE.STAT or HELP HASH.HELP to look at the documentation for these commands, in order to extract the information that you need.
If these commands show that your files are of type 30, then they are automatically resized by the database engine. If however the file types are anything from type 2 to 18 the HASH.HELP command may recommend changes you can make to the table size to increase it's performance.
If none of this helps then you could check for useful indexes on the tables using LIST.INDEX TABLENAME ALL, which you could maybe use to speed up the selection.
Ensure your files are sized correctly using ANALYZE-FILE fileName. If not dynamic ensure there is not too much overflow.
Using SELECT instead of SSELECT will mean you are reading data from the database sequentially rather than randomly and be signicantly faster.
You should also investigate how you are extracting the data from each record and putting it into a list. Usually the pick data separators chars 254, 253 and 252 will not be compatible with the external database and need to be converted. How this is done can make an enormous difference to the performance.
It is not clear from the initial post, however a WRITESEQ would probably be the most efficient way to output the file data.
I have an MSSQL 2008 table with a few million records. I need to iterate over each row, modify some of the data, and copy the updated record to a new table using a C# application that gets executed on a daily basis.
I have tried doing this using ADO.NET entities, but there are memory issues involved with this method, not to mention it is very slow. I have read up on bulk-copy libraries and SQL-only ways for copying one table to another, but none of them involve modifying records before copying them. I need to find a better way for performing this operation.
As you mention memory issues I'm guessing you're trying to load the million rows into memory, process them and then write them back to the database.
You can avoid this by 'streaming' the data instead of loading it entirely. The SqlDataReader will handle buffering for you so on the reading side you can do a simple WHILE loop that fetches rows one by one. The actual conversion you already have working it seems so all you need to do is take care of writing the results back into the database. IMHO the fastest way to do so is by storing a buffer of multiple results (start with 100, work up and see where the sweet spot is) in a data-table and then push that data-table into the database using the SqlBulkCopy class.
Rinse & repeat.
PS: Sounds like a 'fun' problem. Do you have any sample data sitting somewhere to test this out ? 5 hours sounds like a LONG time for something that looks trivial at first, then again 20 million times virtually nothing still adds up. More specifically I wonder how 'large' the data is on the RTF side : are the values ca 2k on average or rather 200k? And what kind of hardware do you run this on ?
The fastest performing option would be to re-write your C# application logic into a CLR stored procedure so that all processing takes place on the server.
Checking around the internet, it looks like Microsoft's official answer to converting rich to plain text is to load the data into a RichTextBox control and then pull it out with the RichTextBox.Text property. That sucks for a lot of reasons, but mostly because it means you're going to have to get your hands dirty. Your best bet is to write a small app that invokes the RichTextBox control and passes all of your data to/from the database (using the SqlDataReader should alleviate the memory issues you mentioned).
Just as a matter of process - I would suggest building an intermediary table that your "cleansed" data rows get dumped into before appending them to your production table. Once you get the stored proc figured out just right, you can create a trigger that automatically invokes your stored proc every time a record gets added to your dirty table. This will ultimately eliminate the need to run your program every day to move records, as the trigger will make sure it happens "on the fly".
Edit - one last thought
It occurred to me that you might not be comfortable writing stored procedures and triggers, which is ok. A more "programmatic" solution would be to kick all of the files in your dirty table out to a delimited text file, which can easily be downloaded and parsed. Once you have the text file, you could manipulate it with your app (read it, cleanse it, create a cleansed file..what have you) and then upload for reading back into your database. Depending on your comfort/background/skill level, this might actually be the better solution to get the job done.
Hope this helps!
Use SSIS. Schedule a daily job that does your transformation and runs the SSIS package. This will take care of batching and memory consumption, and will offer a few fast connectors for the read and write of data. You can embed your custom C# code (the RTF stripping into pure text) as an SSIS component, see Developing Custom Objects for Integration Services.
I have folders where approx 3000 new csv files come in on a daily basis, each containing between 50 and 2000 lines of information.
Currently, there is a process in place which picks these files up one at a time and takes each line one at a time and sends it to a stored procedure to insert the contents into a database.
This means that over the course of a day, it can struggle to get through the 3000 files before the next 3000 come in!
I'm looking to improve this process and had the following ideas
Use new Parallel feature of C# 4.0 to allow multiple files to be processed at once, still passing through the lines one by one to the stored proc
Create a new temporary database table where all the rows in the file can be inserted into at once then call the stored procedure on the newly added rows in the temp table.
Split the process into 2 tasks. One job to read data from the files into the temporary database table, the other to process the rows in the temporary table.
Any other ideas on how I could look at doing this? Currently it can take up to 20 seconds per file, I'd really like to improve performance on this considerably.
SQL Server Bulk Insert might be just what you need
http://msdn.microsoft.com/en-us/library/ms188365.aspx
Another issue you may be seeing with all of those inserts taking a long time is every time a row is added, your table may be getting reindexed. A search like this will give lots of good articles on ways to maybe get better performance out of your current procedure
http://www.google.com/search?q=sql+insert+performance
You can use SQL Server native BCP utility.
More info about BCP utility can be found here: Importing and Exporting Bulk Data by Using the bcp Utility
You can also take a look at: About Bulk Import and Bulk Export Operations
Let's say that all 3000 files to be imported have 2000 rows each. That's 6 million rows per day. The bottleneck might not be at the client doing the inserts, but with the database itself. If indexes are enabled on the table(s) in question, inserts could be slow, depending upon how heavily indexed the table(s) is/are. What indications have led you to conclude that it is the database which is waiting around for something to do and that it is the import routine that is lagging behind, rather than the other way around?
You said
Currently, there is a process in place
which picks these files up one at a
time and takes each line one at a time
and sends it to a stored procedure to
insert the contents into a database.
(Emphasis added.)
That seems to mean one line equals one transaction.
Fix that.
Pre-process the files so they're
acceptable for bulk loading.
Pre-process the files so they form
valid SQL INSERT statements, and load
them that way. (In a single transaction.)
I guess both of those sound like "replace your stored procedure". But the real point is to reduce the number of transactions. Either of those options would reduce the number of transactions for this process from 6 million a day (worst case) to 3000 a day.