Any good method to work with large amount of data?

Any good method to work with large amount of data? - c#

I have almost 100.000 records in the database and I need to compare them to each-other with the Longest Common Subsequence algorithm, and I need to do that with 1000 new records every day.
My application is written in c# .Net, and the problem is that this comparing is working slow on the application level, for comparing of 1000 records are needed more than 10 hours.
So does anyone knows how much faster will this go if I wrote this algorithm in Stored procedure in SQL, or is there any other way?

You might want to try and write a stored proc in C# if you are using SQL server 2005 or 2008. This might scale better in the long run as you get more and more records and can't keep them all in memory.
Check out the MSDN Introduction to SQL Server CLR Integration.
This will use more CPU on your DB server, but you don't have to transfer data back and forth.

If you have 'just' 100.000 records. Just collect them all when your app starts. Do your algorithms in memory, and store any results/alterations to the db when you finish.
It'll be much faster

I'm not sure TSQL will allow you the same flexibility as C# allows you, especially when you deal with complex algorithms like LCS. Store all needed records in memory and deal with them from there.
Now most important thing is that you can think out of box for a minute and go for other approach, try to insert flags(ranking) of some kind once new item is inserted. Noone can advice you here since you haven't provided use with little bit of data what are you doing and what are you comparing. Probably you can ease on process with some ranking made during new item insertion. I don't mean to make full comparison once new item added but to trigger event like every hour or so you update table without user input.

Its true that, stored procedure works faster than LinQ or View. That is the way, to collect your data fast.

How do you determine that two of your records follow on from each other (i.e. that they're part of a sub-sequence)? Maybe you don't need to compare the whole 1MB of each record and could speed things up by only analysing some portion of that?
Sounds to me like your algorithm's flawed or that a DB might not be the best way of storing your data if it's taking 2 seconds to compare each record?

Related

Speed up UniVerse access times using UniObjects

I am accessing a UniVerse database and reading out all the records in it for the purpose of synchronizing it to a MySQL database which is used for compatibility with some other applications which use the data. Some of the tables are >250,000 records long with >100 columns and the server is rather old and still used by many simultaneous users and so it takes a very ... long ... time to read the records sometimes.
Example: I execute SSELECT <file> TO 0 and begin reading through the select list, parsing each record into our data abstraction type and putting it in a .NET List. Depending on the moment, fetching each record can take between 250ms to 3/4 second depending on database usage. Removing the methods for extraction only speeds it up marginally since I think it still downloads all of the record information anyway when I call UniFile.read even if I don't use it.
Reading 250,000 records at this speed is prohibitively slow, so does anyone know a way I can speed this up? Is there some option I should be setting somewhere?

Do you really need to use SSELECT (sorted select)? The sorting on record key will create an additional performance overhead. If you do not need to synchronise in a sorted manner just use a plain SELECT and this should improve the performance.
If this doesn't help then try to automate the synchronisation to run at a time of low system usage, when either few or no users are logged onto the UniVerse system, if at all possible.
Other than that it could be that some of the tables you are exporting are in need of a resize. If they are not dynamic files (automatic-resizing - type 30), they may have gone into overflow space on disk.
To find out the size of your biggest tables and to see if they have gone into overflow you can use commands such as FILE.STAT and HASH.HELP at the command line to retrieve more information. Use HELP FILE.STAT or HELP HASH.HELP to look at the documentation for these commands, in order to extract the information that you need.
If these commands show that your files are of type 30, then they are automatically resized by the database engine. If however the file types are anything from type 2 to 18 the HASH.HELP command may recommend changes you can make to the table size to increase it's performance.
If none of this helps then you could check for useful indexes on the tables using LIST.INDEX TABLENAME ALL, which you could maybe use to speed up the selection.

Ensure your files are sized correctly using ANALYZE-FILE fileName. If not dynamic ensure there is not too much overflow.
Using SELECT instead of SSELECT will mean you are reading data from the database sequentially rather than randomly and be signicantly faster.
You should also investigate how you are extracting the data from each record and putting it into a list. Usually the pick data separators chars 254, 253 and 252 will not be compatible with the external database and need to be converted. How this is done can make an enormous difference to the performance.
It is not clear from the initial post, however a WRITESEQ would probably be the most efficient way to output the file data.

Process each row and copy it to new table using C#

I have an MSSQL 2008 table with a few million records. I need to iterate over each row, modify some of the data, and copy the updated record to a new table using a C# application that gets executed on a daily basis.
I have tried doing this using ADO.NET entities, but there are memory issues involved with this method, not to mention it is very slow. I have read up on bulk-copy libraries and SQL-only ways for copying one table to another, but none of them involve modifying records before copying them. I need to find a better way for performing this operation.

As you mention memory issues I'm guessing you're trying to load the million rows into memory, process them and then write them back to the database.
You can avoid this by 'streaming' the data instead of loading it entirely. The SqlDataReader will handle buffering for you so on the reading side you can do a simple WHILE loop that fetches rows one by one. The actual conversion you already have working it seems so all you need to do is take care of writing the results back into the database. IMHO the fastest way to do so is by storing a buffer of multiple results (start with 100, work up and see where the sweet spot is) in a data-table and then push that data-table into the database using the SqlBulkCopy class.
Rinse & repeat.
PS: Sounds like a 'fun' problem. Do you have any sample data sitting somewhere to test this out ? 5 hours sounds like a LONG time for something that looks trivial at first, then again 20 million times virtually nothing still adds up. More specifically I wonder how 'large' the data is on the RTF side : are the values ca 2k on average or rather 200k? And what kind of hardware do you run this on ?

The fastest performing option would be to re-write your C# application logic into a CLR stored procedure so that all processing takes place on the server.

Checking around the internet, it looks like Microsoft's official answer to converting rich to plain text is to load the data into a RichTextBox control and then pull it out with the RichTextBox.Text property. That sucks for a lot of reasons, but mostly because it means you're going to have to get your hands dirty. Your best bet is to write a small app that invokes the RichTextBox control and passes all of your data to/from the database (using the SqlDataReader should alleviate the memory issues you mentioned).
Just as a matter of process - I would suggest building an intermediary table that your "cleansed" data rows get dumped into before appending them to your production table. Once you get the stored proc figured out just right, you can create a trigger that automatically invokes your stored proc every time a record gets added to your dirty table. This will ultimately eliminate the need to run your program every day to move records, as the trigger will make sure it happens "on the fly".
Edit - one last thought
It occurred to me that you might not be comfortable writing stored procedures and triggers, which is ok. A more "programmatic" solution would be to kick all of the files in your dirty table out to a delimited text file, which can easily be downloaded and parsed. Once you have the text file, you could manipulate it with your app (read it, cleanse it, create a cleansed file..what have you) and then upload for reading back into your database. Depending on your comfort/background/skill level, this might actually be the better solution to get the job done.
Hope this helps!

Use SSIS. Schedule a daily job that does your transformation and runs the SSIS package. This will take care of batching and memory consumption, and will offer a few fast connectors for the read and write of data. You can embed your custom C# code (the RTF stripping into pure text) as an SSIS component, see Developing Custom Objects for Integration Services.

what is the technique to process 1 million records from a database

This is more or less a design question. we have to process like a 1 million rows and send an xml to a third party. Initially we have to send like 1 million records, later we will send the deltas only.
Right now the stored procedure is taking approchimately around 15 to 20 min to return the data. Its a consoleapp right. I know its not a good way to get 1 million records at time.
I want to know the following things
1) Is console app in c# which connects to database is right approach or not
2) Are there any other ways of doing this?
Appreciate your guidence on this , there is no need for any coding or so , We need some some advice on how to proceed.
Thanks in advance.

My thoughts:
don't fetch all the data then process it; but process it as it arrives - via IDataReader or LINQ
use equally streaming approach for the file; perhaps XmlWriter directly, or maybe XStreamingElement - in either case reading from the source above
this vastly reduces the amount of memory you need, and allows your machine to do something useful while waiting on the network IO

Re 1: Depends on your architecture. That simple. It is a VIABLE approach.
Re 2: Yes, tons. All vaible. You could make a system service handling data generation upon request. You could have a web application.
In general, a console app will work fine, and 1 million rows nia a result set are no exactly a lot either. Totally workable.
1-20 minuts is odd, though. Where is the time spent? 1 million rows to transfer and write out shoul not take more than 2-3 minutes.

1) Yeah, why not.
2) Yes.
Use cursors.

You will need to be a little more specific on what you are doing during the 15 to 20 minutes.
You are asking about the "right" way to do things - what are you optimising for?
Speed? A 15 - 20 minute stored proc sounds dangerous. What is it doing?
Maintenance / Readability? A console app will work. It would also be easier to test (unit testing etc) than a stored proc.
I have never liked long running stored procedures because it's not easy to see progress. At least with a console app you can output something

Trust me, 1 million records isn't a big deal to a famous commercial data base, it's not worth 15 to 20 min to return the records. Somewhere else is wrong! Are you building the XML file in the store procedure? If yes, please remove them and implement the XML building in C#. The SP has only one simple task: fetching data. It won't take a long time if you are not joining 1 million records on another 1 million records. After the data come into the application(console application is ok in this case), build the XML with maybe LINQ-to-XML. If you are still not satisfy with the performance, make your code parallel.
EDIT Your SP is time consuming, you need to optimize it. An example: In the SP T_Data with 1m records joins T_User with 1m records that costs a lot of time. After optimization: In the SP T_Data joins one record in T_User(almost a WHERE expression which is very fast), and in C# code you are getting the records from T_User, for each record, call the SP and get the data, then build one section/*piece* of your XML. All of them can be processed concurrently. At last, you merge all the pieces of XML into one.

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!

I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.

Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.

If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.

Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

SQL Insert Performance Improvement

I am writing an application that logs status updates (GPS locations) from devices to a database. The updates occur at a set interval for each device, which is currently every 3 seconds. I'm using a simple table in SQL Server 08 for storing each update.
I've noticed that running the inserts is an area of slow down in my application. Its not a severe slow down, but noticable. Naturally, I'd like to write to the database in as an efficient way as possible. I have an idea to improve the performance and am looking for input and advice to see if it will help:
The status updates come in from an asynchronous Socket thread. In my current implementation, the database insert call is executed from this thread. I'm thinking I can create a queue for holding update data that the Socket thread can quickly add its update to and then go on its merry way. There would then be a separate thread whose sole responsibility would be checking the update queue and inserting the updates into the database.
Basically this whole process rests on the assumption that writing to the database from one location with a bunch of data all at once is more efficient than writing one row of data at a random time. Is my assumption correct, or way off base? Also, on the SQL side, is there a command to tell it to write a bunch of rows at once that would improve write performance?
This is how the database is being written to:
I'm using LinqToSQL in C#, so for each insert, I first create a DataContext instance. From the DataContext object I then call a stored procedure which inserts the location update.
The table is indexed by datetime, for the time of the update.

Have a look at the SqlBulkCopy class - this allows you to use BCP to insert chunks of data very quickly.
Also, make sure your indexes are efficient. If you have a clustered index on anything that does not increase sequentially (integer, date) then you will suffer performance slowdowns as the pages are filled up.

Have you looked MSMQ ( Microsoft Message Queuing (MSMQ)) ? That seems to me an option to take a look.

Yes, inserting in batches will typically be faster than separate inserts given your description. Each insert will require a connection to be set up and packets to be transferred. If you have a single small insert that takes one packet and you issue three of those, but you alternatively have three inserts that are small enough that they can all fit in one packet then it will help.
Quantifying it is difficult just based on your description - you'll need to do testing for that. For example, if you are keeping a dedicated connection open at all times anyway, as hova suggests, then you might see less of an impact.

Another area you might want to take a look at is whether you are setting up and tearing down a connection for each insert. That alone might make a performance improvement, negating the need for batching.
You'll also want to have as few indexes on the table as possible.

It sounds like a good idea. Why not give it a shot and see how it performs?

On the SQL side you'd want to have a look at making sure you are using parameterized queries.
Also batching your INSERT statements will certainly increase the performance.
Connection management is also key, of course that depends on how the application is built and whether it depends on a connection being there.

Are you not afraid to loose data while are you collecting data to batch copy?
I'm writing application doing the same. At start I will have to write data from 3,5k GPS devices. One device should send data each minute but it can send faster. Destination number of devices is 10,5k.
I'm wondering about inserting performance too. For now I'm saving received data to db on every packet using pure ADO.NET ICommand and stored procedure. On my test serwer (Xeon 3,4GHz and one 1TB hard disk - normal desktop ;) it takes for now 1ms or less.
#GRIMUS - should I wondering if there will be more devices?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.