Fastest way to read files in a multi-processing environment? C#

Fastest way to read files in a multi-processing environment? C# - c#

I have the following challenge:
I have a Azure Cloud Worker Role with many instances. Every minute, each instance spins up about 20-30 threads. In each thread, it needs to read some metadata about how to process the thread from 3 objects. The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). Most of the time (like 99.999%) the data in RavenDb does not change.
I've decided to implement local storage caching. First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. I'm using File.ReadAllText()
This approach appears to be bogging the machine down and procesing slows down considerably. I'm guessing the disks on "Small" Worker Roles are not fast enough.
Is there anyway, I can have OS help me out and cache those files? Perhaps there is an alternative to caching of this data?
I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance

Not a straight answer, but three possible options:
Use the built-in RavenDB caching mechanism
My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching)
The problem you have is that the cache is local to each server. If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time.
One possible option you could implement is divide the workload. For example:
Server A => fetch files that start with A-D
Server B => fetch files that start with E-H
Server C => ...
This would ensure that you optimize the cache on each server.
Get a bigger machine
If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck:
Disk access
Deserialization of the JSON
For these issues, the only thing I can imagine would be to get bigger resources:
If it's the disk, use premium storage with SSD's.
If it's deserialization, get VM's with a bigger CPU
Cache files in RAM
Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. Doing this would eliminate disk access and deserialization.
But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism.

Related

What is the best Method for monitoring a large number of clients reliably with good performance

This is more of a programming strategy and direction question, than the actual code itself.
I am programming in C-Sharp.
I have an application that remotely starts processes on many different clients on the network, could be up to 1000 clients in theory.
It then monitors the status of the remote processes by reading a log file on each client.
I currently do this by running one thread that loops through all of the clients in a list, and reading the log file. It works fine for 10 or 20 machines, but 1000 would probably be untenable.
There are several problems with this approach:
First, if the thread doesn’t finish reading all of the client statuses before it’s called again, the client statuses at the end of the list might not be read and updated.
Secondly, if any client in the list goes offline during this period, the updating hangs, until that client is back online again.
So I require a different approach, and have thought up a few possible ways to resolve this.
Spawn a separate thread for each client, to read their log file and update its progress.
a. However, I’m not sure if having 1000 threads running on my machine is something that would be acceptable.
Test the connect for each machine first, before trying to read the file, and if it cannot connect, then just ignore it for that iteration and move on to the next client in the list.
a. This still has the same problem of not getting through the list before the next call, and causes more delay and it tries to test the connection via a port first. With 1000 clients, this would be noticeable.
Have each client send the data to the machine running the application whenever there is an update.
a. This could create a lot of chatter with 1000 machines trying to send data repeatedly.
So I’m trying to figure if there is another more efficient and reliable method, that I haven’t considered, or which one of these would be the best.
Right now I’m leaning towards having the clients send updates to the application, instead of having the application pulling the data.
Looking for thoughts, concerns, ideas and recommendations.

In my opinion, you are doing this (Monitoring) the wrong way. Instead of keeping all logs in a text file, you'd better preserve them in a central data repository that can be of any kind. With respect to the fact that you are monitoring the performance of those system, your design and the mechanism behind it must not impact the performance of the target systems negatively, and with this design the disk and CPU would be involved so much in certain cases that can result in a performance issue itself.
I recommend you to create a log repository server using a fast in-memory database like Redis, and send logged data directly to that server. Keep in mind that this database must be running on a different virtual machine. You can then tune Redis to store received data on physical Disk once a particular number of indexes are reached or a particular interval elapses. The in-memory feature here is advantageous as you may need to query information a lot in a monitoring application like this. On the other hand, the performance of Redis is so high that it efficiently passes processing millions of indexes.
The blueprint for you is that:
1- Centralize all log data in a single repository.
2- Configure clients to send monitored information to the centralized repository.
3- Read the data from the centralized repository by the main server (monitoring system) when required.
I'm not trying to advertise for a particular tool here as I'm only sharing my own experience. There's many more tools that you can use for this purpose such as ElasticSearch.

Redis vs in-app caching

What are the benefits and downsides of using Redis for caching such data as userId-UserName pairs, NewsId-NewsDomainName? Why I should not cache this data in app memory bu creatinf Dictionatries for it? I think it must be much faster, than using redis?
Thank you!

Depending on what your workload looks like, you may want one or the other, or a combination of both caching strategies. Why?
in process caching is faster (good for latency), and more importantly, it doesn't produce any network traffic to get a hit (good for scalability);
remote caching, Redis or alike, allows you to keep one copy of the cached data that is accessed by all servers*, so it uses less memory (unless you only have one app server, which seems unlikely), and is less prone to data inconsistency problems (which seems important if you are dealing with user data)
In a cache cluster, or any data cluster where requests for a particular piece of data goes to a small set of servers, one of the biggest issues is hotspot. In this case, you may want to combine both- cache hot keys locally, but very briefly, to prevent overwhelming the backend servers, but not so long that it results in serving stale data for a long time.
* although, if there's more than one cache server in the cluster, and cluster management has server ejection/readmission logic but no data flush logic, you may have stale data on some of the servers.

what if you have multiple server? would your second server know what are stored in the first server? Nope. this could be the main reason you need to use redis.
And if you stored, let's say a great amount of data in your server, it also could affect your server performance

What is the fastest way to persistently increment a list of numbers from multiple threads?

My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?

You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.

I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.

What is the fastest way to write hundreds of files to disk using C#?

My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)

This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.

The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.

What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.

Store Documents/Video in database or as separate files?

Is it a better practice to store media files (documents, video, images, and eventually executables) in the database itself, or should I just put a link to them in the database and store them as individual files?

Read this white paper by MS research (to BLOB or not to BLOB) - it goes in depth about the question.
Executive summary - if you have lots of small (150kb and less) files, you might as well store them in the DB. Of course, this is right for the databases they were testing with and using their test procedures. I suggest reading the article in full to at least gain a good understanding of the trade-offs.

That is an interesting paper that Oded has linked to - if you are using Sql Server 2008 with its FileStream feature the conclusion is similar. I have quoted a couple of salient points from the linked FileStream whitepaper:
"FILESTREAM storage is not appropriate in all cases. Based on prior research and FILESTREAM feature behavior, BLOB data of size 1 MB and larger that will not be accessed through Transact-SQL is best suited to storing as FILESTREAM data."
"Consideration must also be given to the update workload, as any partial update to a FILESTREAM file will generate a complete copy of the file. With a particularly heavy update workload, the performance may be such that FILESTREAM is not appropriate"

Two requirements drive the answer to your question:
Is there more than one application server reading binaries from the database server?
Do you have a database connection that can stream binaries for write and read?
Multiple application servers pulling binaries from one database server really hinders your ability to scale. Consider that database connections are usually - necessarily - coming from a smaller pool than the application servers' request servicing pool. And, the data volume binaries will consume being sent from database server to application server over the pipe. The database server will likely queue requests because its pool of connections will be consumed delivering binaries.
Streaming is important so that a file is not completely in server memory on read or write (looks like #Andrew's answer about SQL Server 2008 FILESTREAM may speak to this). Imagine a file several gigabytes in size - if read completely into memory - would be enough to crash many application servers, which just don't have the physical memory to accommodate. If you don't have streaming database connections storing in the database is really not viable, unless you constrain file size such that your application server software is allocated at least as much memory as the max file size * number of request servicing connections + some additional overhead.
Now let's say you don't put the files in the database. Most operating systems are very good at caching frequently accessed files. So right off the bat you get an added benefit. Plus, if you're doing web servers, they are pretty good at sending back the right request headers, such as mime type, content length, e-tags, etc... which you otherwise end up coding yourself. The real issues are replication between servers, but most application servers are pretty good at doing this via http - streaming the read and write, and as another answerer pointed out keeping database and file system in sync for backups.

Storing BLOB data in database is not considered right way to go unless they are very small. Instead storing their path is more appropriate. it will greatly improve database query and retrieval performance.

Here is detailed comparison I have made
http://akashkava.com/blog/127/huge-file-storage-in-database-instead-of-file-system/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.