Importing large data by API

Importing large data by API - c#

I am trying to develop a system in which I will sync my database with 3-Party database by provided API.
The API has format in which we can provide From-Date and To-Date
Problems
There is no API which gives me only modified records.
The data is too large (1000 records/day average)
Need a scheduler so all the records are updated automatically
I also need to keep track of modified records(which is the biggest problem as I can't get them by modified date)
Note : As per the previous requirement i Have already developed the system in which i can Specify the From-Date and To-Date the record get updated (its completed with the GUI no ajax was uses). and even if I request 1 day records the system get time out error.
NOTE 2 : I really should no say but the client is too strict (DUMB);( he just need the solution nothing else will do

Assuming that the data doesn't need to be "fresh" can you not write a process to run hourly / nightly fetching that days worth of data and processing it into your DB?
Obviously this would only work if you're sure previous records are not updated?
Does the API provide batches?
Why did you you choose a web client with Ajax to process this data? Would a windows / console application be better suited?

If the data is too big to retrieve by any given query, you're just going to have to do it by ID. Figure out a good size (100 records? 250?), and just spin through every record in the system by groups of that size.
You didn't say if you're pulling down data, pushing up data, or both. If you're only pulling it down, then that's the best you can do, and it will get slower and slower as more records are added. If you're only pushing it, then you can track a "pushed date". If it's both, how do you resolve conflicts?

Related

Use MSSQL CLR to retrieve data as json

this is my Problem,
i have a table with more than 1 million records and im using these record to generate some reports using crystal reports, but when selecting large number of records some time occur timeout erorros or sometime computer getting stuck,
i already used index and im retrieving data using Stored procedures.
there are several tables that joining with main table(that have 1mil records) and also data will group inside the report.
so im asking cant we use MSSQL CLR to get these data from data base by compressing or converting to light objects, if anyone have an idea will be appreciate..
thanks

there are 2 separate issues in your post that are unlikely to be solved by a CLR solution:
timeout
there are no details about where the timeout actually occur (on the rdbms while performing the selection, on the client side while retrieving the data, in the report engine while actually building the report) so i suppose that the timeout occur on the rdbms.building a CLR library will not improve the time required to gather the data.the solution is indexing (as you already did) and query plan analizing to identify bottlenecks and issues.should you have provided any sql code it would have been possible to give you some relevant hint.
computer getting stuck
this looks like an issue related to the amount of data that makes the machine struggle and there is very little you can do.once again a CLR library on the server will not lower the amount of data handed to che client and imho would only make the situation worse (the client would receive compressed data to uncompress: an additional heavy task for an already overloaded machine).your best bet are increase the amount of ram, buy a better cpu (higher speed and/or more cores), run the reports scheduled and not interactively.
there are no technical details at all in the post so all the above are more or less wild guesses; should you need some more detailed answer and advice please post another question with technical details about the issues.
https://stackoverflow.com/help/how-to-ask

Best way to page through results in SqlCommand?

I have a database table that contains RTF documents. I need to extract these programmatically (I am aware I can use a cursor to step through the table - I need to do some data manipulation). I created a C# program that will do that, but the problem is that it can not load the whole table (about 2 million rows) into memory.
There is a MSDN page here.
That says there is basically two ways to loop through the data.
use the DataAdapter.Fill method to load page by page
run the query many times, iterating by using the primary key. Basically you run it once with a TOP 500 limit (or whatever) and PK > (last PK)
I have tried option 2, and it seems to work. But can I be sure I am pulling back all the data? When I do a SELECT COUNT (*) FROM Document it pulls back the same number of rows. Still, I'm nervous. Any tips for data validation?
Also which is faster? The data query is pretty slow - I optimized the query as much as possible, but there is a ton of data to transport over the WAN.

I think the answer requires a lot more understanding of your true requirements. It's hard for me to imagine a recurring process or requirement where you have to regularly extract 2 million binary files to do some processing on them! If this is a one-time thing then alright, let's get 'er done!
Here are some initial thoughts:
Could you deploy your C# routine to SQL directly and execute everything via CLR?
Could you have run your C# app locally on the box and take advantage of shared memory protocol?
Do you have to process every single row? If, for instance, you're validating the structure of the RTF data has changes versus another file can you create hashes of each that can be compared?
If you must get all the data out, maybe try exporting it to local disk and the XCOPY'ing it to another location.
If want to get a chunk of rows at a time, create a table that just keeps a list of all ID's that have been processed. When grabbing of the next 500 rows just find rows that aren't in that table yet. Of course, update that table with the new ID's that you've exported.
If you must do all this it could have a serious effect on OLTP performance. Either throttle it to only run off hours or take a *.bak and process it on a separate box. Actually, if this is a one-time thing, restore it to the same box that's running the SQL and use the shared memory protocol.

How to load millions of records in asp.net mvc 3 or 4 to client side. (slow issue)

I have a project that loads millions of records. I will be using asp.net mvc 3 or 4. I am sure my page would load very slow because of much data retrieved from the server. I have been creating SQL Server agent job to perform early queries and save it in a temporary table which will be used in the site. I am sure there are many better ways. Please help. I have heard of IndexedDB or WebSQL for client side Database. Any help/suggestion would be appreciated.
My first Post/Question here in stackover!
Thanks, In Advance!

You might want to look at pagination. If the search returns 700,000+ records you can separate them into different pages (100 per page or something) so the page won't take forever to load.
Check similar question here

I've been dealing with a similar problem 500K data stored on client side in IndexedDB. I use multiple web workers to store the data on the client side (supported only in chrome) and I only post the id's and the action that is used on the data to the server side.
In order to achieve greater speed, I've found that the optimal data transfer rate is achieved using 4-8 web workers all retrieving and storing 1K - 2.5K items per call, that's about 1MB of data.
Things you should consider are:
limiting the actions that are allowed during synchronization process
updates of the data on server vs client side - I use the approach
of always updating data on server side and calling sync procedure
to sync the changed data back to the client side
what are you going to store in memory - Google Chrome limit is 1.6GB so be careful with memory leaks, i currently use no more than 300MB during the synchronization - 100-150MB is the average during regular work

You need to split the data by pages. Retrieve certain range from the database. Do you really need millions of rows at once. I think there should be a filter first of all.

Paging is definitely the way to go. However, you want to make the site responsive, so what you should do is use Ajax to run in the background to load up the data continuously while the letting the user to interact with the site with initial set of data.

Providing "Totals" for custom SQL queries on a regular basis

I would like some advice on how to best go about what I'm trying to achieve.
I'd like to provide a user with a screen that will display one or more "icon" (per say) and display a total next to it (bit like the iPhone does). Don't worry about the UI, the question is not about that, it is more about how to handle the back-end.
Let's say for argument sake, I want to provide the following:
Total number of unread records
Total number of waiting for approval
Total number of pre-approved
Total number of approved
etc...
I suppose, the easiest way to descrive the above would be "MS Outlook". Whenever emails arrive to your inbox, you can see the number of unread email being updated immediately. I know it's local, so it's a bit different, but now imagine having the same principle but for the queries above.
This could vary from user to user and while dynamic stored procedures are not ideal, I don't think I could write one sp for each scenario, but again, that's not the issue heree.
Now the recommendation part:
Should I be creating a timer that polls the database every minute (for example?) and run-all my relevant sql queries which will then provide me with the relevant information.
Is there a way to do this in real time without having a "polling" mechanism i.e. Whenever a query changes, it updates the total/count and then pushes out the count of the query to the relevant client(s)?
Should I have some sort of table storing these "totals" for each query and handle the updating of these immediately based on triggers in SQL and then when queried by a user, it would only read the "total" rather than trying to calculate them?
The problem with triggers is that these would have to be defined individually and I'm really tring to keep this as generic as possible... Again, I'm not 100% clear on how to handle this to be honest, so let me know what you think is best or how you would go about it.
Ideally when a specific query is created, I'd like to provide to choices. 1) General (where anyone can use this) and b) Specific where the "username" would be used as part of the query and the count returned would only be applied for that user but that's another issue.
The important part is really the notification part. While the polling is easy, I'm not sure I like it.
Imagine if I had 50 queries to be execute and I've got 500 users (unlikely, but still!) looking at the screen with these icons. 500 users would poll the database every minute and 50 queries would also be executed, this could potentially be 25000 queries per miuntes... Just doesn't sound right.
As mentioned, ideally, a) I'd love to have the data changes in real-time rather than having to wait a minute to be notified of a new "count" and b) I want to reduce the amount of queries to a minimum. Maybe I won't have a choice.
The idea behind this, is that they will have a small icon for each of these queries, and a little number will be displayed indicating how many records apply to the relevant query. When they click on this, it will bring them the relevant result data rather than the actual count and then can deal with it accordingly.
I don't know if I've explained this correctly, but if unclear, please ask, but hopefully I have and I'll be able to get some feedback on this.
Looking forward to your feeback.
Thanks.

I am not sure if this is the ideal solution but maybe a decent 1.
The following are the assumptions I have taken
Considering that your front end is a web application i.e. asp.net
The data which needs to be fetched on a regular basis is not hugh
The data which needs to be fetched does not change very frequently
If I were in this situation then I would have gone with the following approach
Implemented SQL Caching using SQLCacheDependency class. This class will fetch the data from the database and store in the cache of the application. The cache will get invalidated whenever the data in the table on which the dependency is created changes thus fetching the new data and again creating the cache. And you just need to get the data from the cache rest everything (polling the database, etc) is done by asp.net itself. Here is a link which describes the steps to implement SQL Caching and believe me it is not that difficult to implement.
Use AJAX to update the counts on the UI so that the User does not feel the pinch of PostBack.

What about "Improving Performance with SQL Server 2008 Indexed Views"?
"This is often particularly effective for aggregate views in decision
support or data warehouse environments"

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!

I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.

Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.

If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.

Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.