I am wondering what approach to take
problem flow :
I have a web API which can accept request from the client
The API layer talks to business layer and then to data layer
Data layer Gets huge record set (5000000 rows) , now the business layer
process the columns and rows (using maximum threads of the processor)
once processed the API streams the content as an excel/csv to the client (browser)
Right now the entire download happens in one flow (fire and wait till response is ready)
I would like to isolate this huge business operation of processing 5000000 rows to a separate engine or task queue (I don't want my web site to fall to out of memory exception) , also I would like to make the user experience smooth.
Trying to use server push events/signalr/or browser long polling , so that I can push the file once the data/file is processed and ready .
Is there any better way to achieve the same?
Here are a few suggestions based on what I can understand
Serialization. I would not recommend to respond back with a CSV or Excel format for such a large dataset, unless it is the only format the client can handle. If you have some control over the workflow, I would change the client to accept a format such as JSON, or better yet go even more optimized serializers that are data and speed efficient like ProtoBuff, Avro, Thrift, etc.
Pagination. (Assuming you are able to implement the above suggestion.) Usually responding back with large data can hinder performance across the board. It is very common for an API to accept parameters to define the page number and page size. In your case, you could make a unique reference id for the query (ex. "query-001" and that can be called upon /api/query/001?page=10&items-per-page=10000).
Caching. If the query is made frequently, in order to reduce hitting your data layer with every query (for example, requesting different pages). You could either load the data onto the disk or keep the data in memory. Keeping a cache will significantly improve performance and could also reduce complex debugging problems when it comes to performance tuning the system.
Related
I'm working on a program which will asynchronously load large amounts of data from a server via gRPC (both ends use protobuf-net.Grpc).
While I can simply throw large amounts of data at the client via IAsyncEnumerable, I sometimes want to prioritize sending certain parts earlier (where the priorities are determined on the fly, and not know at the start, kind of like sending a video stream and skipping ahead).
If I were to send data and wait for a response every time I'd be leaving a lot of bandwidth unused. The alternative would be throwing tons of data at the client, which could cause network congestion and delay prioritized packets by an indefinite amount.
Can I somehow use HTTPS/2s / TCPs flow/congestion control for myself here? Or will I need to implement a basic flow/congestion control system on top of gRPC?
To be slightly more precise: I'd like to send as much data as possible without filling up any internal buffers, causing delays down the line.
This is more of a programming strategy and direction question, than the actual code itself.
I am programming in C-Sharp.
I have an application that remotely starts processes on many different clients on the network, could be up to 1000 clients in theory.
It then monitors the status of the remote processes by reading a log file on each client.
I currently do this by running one thread that loops through all of the clients in a list, and reading the log file. It works fine for 10 or 20 machines, but 1000 would probably be untenable.
There are several problems with this approach:
First, if the thread doesn’t finish reading all of the client statuses before it’s called again, the client statuses at the end of the list might not be read and updated.
Secondly, if any client in the list goes offline during this period, the updating hangs, until that client is back online again.
So I require a different approach, and have thought up a few possible ways to resolve this.
Spawn a separate thread for each client, to read their log file and update its progress.
a. However, I’m not sure if having 1000 threads running on my machine is something that would be acceptable.
Test the connect for each machine first, before trying to read the file, and if it cannot connect, then just ignore it for that iteration and move on to the next client in the list.
a. This still has the same problem of not getting through the list before the next call, and causes more delay and it tries to test the connection via a port first. With 1000 clients, this would be noticeable.
Have each client send the data to the machine running the application whenever there is an update.
a. This could create a lot of chatter with 1000 machines trying to send data repeatedly.
So I’m trying to figure if there is another more efficient and reliable method, that I haven’t considered, or which one of these would be the best.
Right now I’m leaning towards having the clients send updates to the application, instead of having the application pulling the data.
Looking for thoughts, concerns, ideas and recommendations.
In my opinion, you are doing this (Monitoring) the wrong way. Instead of keeping all logs in a text file, you'd better preserve them in a central data repository that can be of any kind. With respect to the fact that you are monitoring the performance of those system, your design and the mechanism behind it must not impact the performance of the target systems negatively, and with this design the disk and CPU would be involved so much in certain cases that can result in a performance issue itself.
I recommend you to create a log repository server using a fast in-memory database like Redis, and send logged data directly to that server. Keep in mind that this database must be running on a different virtual machine. You can then tune Redis to store received data on physical Disk once a particular number of indexes are reached or a particular interval elapses. The in-memory feature here is advantageous as you may need to query information a lot in a monitoring application like this. On the other hand, the performance of Redis is so high that it efficiently passes processing millions of indexes.
The blueprint for you is that:
1- Centralize all log data in a single repository.
2- Configure clients to send monitored information to the centralized repository.
3- Read the data from the centralized repository by the main server (monitoring system) when required.
I'm not trying to advertise for a particular tool here as I'm only sharing my own experience. There's many more tools that you can use for this purpose such as ElasticSearch.
I have a project that loads millions of records. I will be using asp.net mvc 3 or 4. I am sure my page would load very slow because of much data retrieved from the server. I have been creating SQL Server agent job to perform early queries and save it in a temporary table which will be used in the site. I am sure there are many better ways. Please help. I have heard of IndexedDB or WebSQL for client side Database. Any help/suggestion would be appreciated.
My first Post/Question here in stackover!
Thanks, In Advance!
You might want to look at pagination. If the search returns 700,000+ records you can separate them into different pages (100 per page or something) so the page won't take forever to load.
Check similar question here
I've been dealing with a similar problem 500K data stored on client side in IndexedDB. I use multiple web workers to store the data on the client side (supported only in chrome) and I only post the id's and the action that is used on the data to the server side.
In order to achieve greater speed, I've found that the optimal data transfer rate is achieved using 4-8 web workers all retrieving and storing 1K - 2.5K items per call, that's about 1MB of data.
Things you should consider are:
limiting the actions that are allowed during synchronization process
updates of the data on server vs client side - I use the approach
of always updating data on server side and calling sync procedure
to sync the changed data back to the client side
what are you going to store in memory - Google Chrome limit is 1.6GB so be careful with memory leaks, i currently use no more than 300MB during the synchronization - 100-150MB is the average during regular work
You need to split the data by pages. Retrieve certain range from the database. Do you really need millions of rows at once. I think there should be a filter first of all.
Paging is definitely the way to go. However, you want to make the site responsive, so what you should do is use Ajax to run in the background to load up the data continuously while the letting the user to interact with the site with initial set of data.
I wonder how to update fast numbers on a website.
I have a machine that generates a lot of output, and I need to show it on line. However my problem is the update frequency is high, and therefore I am not sure how to handle it.
It would be nice to show the last N numbers, say ten. The numbers are updated at 30Hz. That might be too much for the human eye, but the human eye is only for control here.
I wonder how to do this. A page reload would keep the browser continuously loading a page, and for a web page something more then just these numbers would need to be shown.
I might generate a raw web engine that writes the number to a page over a specific IP address and port number, but even then I wonder whether this page reloading would be too slow, giving a strange experience to the users.
How should I deal with such an extreme update rate of data on a website? Usually websites are not like that.
In the tags for this question I named the languages that I understand. In the end I will probably write in C#.
a) WebSockets in conjuction with ajax to update only parts of the site would work, disadvantage: the clients infrastructure (proxies) must support those (which is currently not the case 99% of time).
b) With existing infrastructure the approach is Long Polling. You make an XmlHttpRequest using javascript. In case no data is present, the request is blocked on server side for say 5 to 10 seconds. In case data is avaiable, you immediately answer the request. The client then immediately sends a new request. I managed to get >500 updates per second using java client connecting via proxy, http to a webserver (real time stock data displayed).
You need to bundle several updates with each request in order to get enough throughput.
You don't have to use a page reload. You can use WebSockets to establish an open two-way communication between a browser (via JavaScript) and your server.
Python Tornado has support for this built-in. Additionally, there are a couple of other Python servers that support it. Socket.IO is a great JavaScript library, with fallback, to facilitate the client side.
On the backend you can use Redis or a NewSQL database like VoltDB for fast in-memory database updates. Caching helps a lot with high latency components (esp. in a write heavy application).
On the front-end you can look into websockets and the Comet web application model http://en.wikipedia.org/wiki/Comet_%28programming%29
Many gaming companies have to deal with fast counter updates and displays - it might be worth looking into. Zynga uses a protocol call AMF http://highscalability.com/blog/2010/3/10/how-farmville-scales-the-follow-up.html
I have a smart client (WPF) that makes calls to the server va services (WCF). The screen I am working on holds a list of objects that it loads when the constructor is called. I am able to add, edit and delete records in the list.
Typically what I am doing is after every add or delete I am reloading the entire model from the service again, there are a number off reasons for this including the fact that the data may have changed on the server between calls.
This approach has proved to be a big hit on perfomance because I am loading everything sending the list up and down the wire on Add and Edit.
What other options are open to me, should I only be send the required information to the server and how would I go about not reloading all the data again ever time an add or delete is performed?
The optimal way of doing what you're describing (I'm going to assume that you know that client/server I/O is the bottleneck already) is to send only changes in both directions once the client is populated.
This can be straightforward if you've adopted a journaling model for updates to the data. In order for any process to make a change to the shared data, it has to create a time-stamped transaction that gets added to a journal. The update to the data is made by a method that applies the transaction to the data.
Once your data model supports transaction journals, you have a straightforward way of keeping the client and server in sync with a minimum of network traffic: to update the client, the server sends all of the journal entries that have been created since the last time the client was updated.
This can be a considerable amount of work to retrofit into an existing design. Before you go down this road, you want to be sure that the problem you're trying to fix is in fact the problem that you have.
Make sure this functionality is well-encapsulated so you can play with it without having to touch other components.
Have your source under version control and check in often.
I highly recommend having a suite of automated unit tests to verify that everything works as expected before refactoring and continues to work as you perform each change.
If the performance hit is on the server->client transfer of data, moreso than the querying, processing and disk IO on the server, you could consider devising a hash of a given collection or graph of objects, and passing the hash to a service method on the server, which would query and calculate the hash from the db, compare the hashes, and return true or false. Only if false would you then reload data. This works if changes are unlikely or infrequent, because it requires two calls to get the data, when it has changed. If changes in the db are a concern, you might not want to only get the changes when the user modifies or adds something -- this might be a completely separate action based on a timer, for example. Your concurrency strategy really depends on your data, number of users, likelihood of more than one user being interested in changing the same data at the same time, etc.