Store Documents/Video in database or as separate files?

Store Documents/Video in database or as separate files? - c#

Is it a better practice to store media files (documents, video, images, and eventually executables) in the database itself, or should I just put a link to them in the database and store them as individual files?

Read this white paper by MS research (to BLOB or not to BLOB) - it goes in depth about the question.
Executive summary - if you have lots of small (150kb and less) files, you might as well store them in the DB. Of course, this is right for the databases they were testing with and using their test procedures. I suggest reading the article in full to at least gain a good understanding of the trade-offs.

That is an interesting paper that Oded has linked to - if you are using Sql Server 2008 with its FileStream feature the conclusion is similar. I have quoted a couple of salient points from the linked FileStream whitepaper:
"FILESTREAM storage is not appropriate in all cases. Based on prior research and FILESTREAM feature behavior, BLOB data of size 1 MB and larger that will not be accessed through Transact-SQL is best suited to storing as FILESTREAM data."
"Consideration must also be given to the update workload, as any partial update to a FILESTREAM file will generate a complete copy of the file. With a particularly heavy update workload, the performance may be such that FILESTREAM is not appropriate"

Two requirements drive the answer to your question:
Is there more than one application server reading binaries from the database server?
Do you have a database connection that can stream binaries for write and read?
Multiple application servers pulling binaries from one database server really hinders your ability to scale. Consider that database connections are usually - necessarily - coming from a smaller pool than the application servers' request servicing pool. And, the data volume binaries will consume being sent from database server to application server over the pipe. The database server will likely queue requests because its pool of connections will be consumed delivering binaries.
Streaming is important so that a file is not completely in server memory on read or write (looks like #Andrew's answer about SQL Server 2008 FILESTREAM may speak to this). Imagine a file several gigabytes in size - if read completely into memory - would be enough to crash many application servers, which just don't have the physical memory to accommodate. If you don't have streaming database connections storing in the database is really not viable, unless you constrain file size such that your application server software is allocated at least as much memory as the max file size * number of request servicing connections + some additional overhead.
Now let's say you don't put the files in the database. Most operating systems are very good at caching frequently accessed files. So right off the bat you get an added benefit. Plus, if you're doing web servers, they are pretty good at sending back the right request headers, such as mime type, content length, e-tags, etc... which you otherwise end up coding yourself. The real issues are replication between servers, but most application servers are pretty good at doing this via http - streaming the read and write, and as another answerer pointed out keeping database and file system in sync for backups.

Storing BLOB data in database is not considered right way to go unless they are very small. Instead storing their path is more appropriate. it will greatly improve database query and retrieval performance.

Here is detailed comparison I have made
http://akashkava.com/blog/127/huge-file-storage-in-database-instead-of-file-system/

Related

Fastest way to read files in a multi-processing environment? C#

I have the following challenge:
I have a Azure Cloud Worker Role with many instances. Every minute, each instance spins up about 20-30 threads. In each thread, it needs to read some metadata about how to process the thread from 3 objects. The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). Most of the time (like 99.999%) the data in RavenDb does not change.
I've decided to implement local storage caching. First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. I'm using File.ReadAllText()
This approach appears to be bogging the machine down and procesing slows down considerably. I'm guessing the disks on "Small" Worker Roles are not fast enough.
Is there anyway, I can have OS help me out and cache those files? Perhaps there is an alternative to caching of this data?
I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance

Not a straight answer, but three possible options:
Use the built-in RavenDB caching mechanism
My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching)
The problem you have is that the cache is local to each server. If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time.
One possible option you could implement is divide the workload. For example:
Server A => fetch files that start with A-D
Server B => fetch files that start with E-H
Server C => ...
This would ensure that you optimize the cache on each server.
Get a bigger machine
If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck:
Disk access
Deserialization of the JSON
For these issues, the only thing I can imagine would be to get bigger resources:
If it's the disk, use premium storage with SSD's.
If it's deserialization, get VM's with a bigger CPU
Cache files in RAM
Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. Doing this would eliminate disk access and deserialization.
But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism.

Best way to create disc cache for web service

I have created a webservice that delivers images. It will always be one-way communication. The images will never be changed, on the side that gets them from the service.
It has multiple sources, and some can be far away, on bad connections.
I have created a memory cache for it, but I would like to also have a disc cache, to store images for longer periods.
I am a bit unsure on the best approach to do this.
First of all, all of my sources are webservers, so I don't really know how to check the last modified date (as an example) of my images, which I would like to use, to see if the file has changed.
Second, how do I best store my local cache? Just drop the files in a folder and compare dates with the original source?
Or, perhaps store all the timestamps in a txt file, with all the images, to avoid checking files.
OR, maybe store them in a local SQL express DB?
The images, in general, are not very large. Most are around 200kb. Every now and then, however, there will be 7+ mb.
The big problem is, that some of the locations, where the service will be hosted, are on really bad connections, and they will need to use the same image, many times.
There are no real performance requirements, I just want to make it as responsive as possible, for the locations that have a horrible connection, to our central servers.
I can't install any "real" cache systems. It has to be something I can handle in my code.

Why don't you install a proxy server on your server, and access all the remote web-servers through that? The proxy server will take care of caching for you.
EDIT: Since you can't install anything and don't have a database available, I'm afraid you're stuck with implementing the disk cache yourself.
The good news is - it's relatively easy. You need to pick a folder and place your image files there. And you need a unique mapping between your image identification and a file name. If your image IDs are numbers, the mapping is very simple...
When you receive a request for an image, first check for it on the disk. If it's there, you have it already. If not , download it from the remote server and store it there, then serve it from there.
You'll need to take concurrent requests into account. Make sure writing the files to disk is a relatively brief process (you can write them once you finish downloading them). When you write the file to disk, make sure nobody can open it for reading, that way you avoid sending incomplete files.
Now you just need to handle the case where the file isn't in your cache, and two requests for it are received at once. If performance isn't a real issue, just download it twice.

What is the fastest way to write hundreds of files to disk using C#?

My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)

This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.

The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.

What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.

SQLite & C# ] How can I control the number of people editing a db file?

I'm programming a simple customer-information management software now with SQLite.
One exe file, one db file, some dll files. - That's it :)
2~4 people may be going to run this exe file simultaneously and access to a database.
Not only just reading but frequent editing will be done by them too.
Yeahhh now here comes the one of the most famous problems... "Synchronization"
I was trying to create / remove a temporary empty file whenever someone is trying
to edit it. (this is a 'key' to access the db.)
But there must be a better way for it : (
What would be the best way of preventing this problem?

Well, SQLite already locks the database file for each use, the idea being that multiple applications can share the same database.
However, the documentation for SQLite explicitly warns about using this over the network:
SQLite will work over a network
filesystem, but because of the latency
associated with most network
filesystems, performance will not be
great. Also, the file locking logic of
many network filesystems
implementation contains bugs (on both
Unix and Windows). If file locking
does not work like it should, it might
be possible for two or more client
programs to modify the same part of
the same database at the same time,
resulting in database corruption.
Because this problem results from bugs
in the underlying filesystem
implementation, there is nothing
SQLite can do to prevent it.
A good rule of thumb is that you
should avoid using SQLite in
situations where the same database
will be accessed simultaneously from
many computers over a network
filesystem.
So assuming your "2-4 people" are on different computers, using a network file share, I'd recommend that you don't use SQLite. Use a traditional client/server RDBMS instead, which is designed for multiple concurrent connections from multiple hosts.
Your app will still need to consider concurrency issues (unless it speculatively acquires locks on whatever the user is currently looking at, which is generally a nasty idea) but at least you won't have to deal with network file system locking issues as well.

You are looking at some classic problems in dealing with multiple users accessing a database: the Lost Update.
See this tutorial on concurrency:
http://www.brainbell.com/tutors/php/php_mysql/Transactions_and_Concurrency.html
At least you won't have to worry about the db file itself getting corrupted by this, because SQLite locks the whole file when it's being written. That being said, SQLite doesn't recommend you to use it if you expect your app to be accessed simultaneously by a multiple clients.

What is the most cost-effective way to break up a centralised database?

Following on from this question...
What to do when you’ve really screwed up the design of a distributed system?
... the client has reluctantly asked me to quote for option 3 (the expensive one), so they can compare prices to a company in India.
So, they want me to quote (hmm). In order for me to get this as accurate as possible, I will need to decide how I'm actually going to do it. Here's 3 scenarios...
Scenarios
Split the database
My original idea (perhaps the most tricky) will yield the best speed on both the website and the desktop application. However, it may require some synchronising between the two databases as the two "systems" so heavily connected. If not done properly and not tested thouroughly, I've learnt that synchronisation can be hell on earth.
Implement caching on the smallest system
To side-step the sync option (which I'm not fond of), I figured it may be more productive (and cheaper) to move the entire central database and web service to their office (i.e. in-house), and have the website (still on the hosted server) download data from the central office and store it in a small database (acting as a cache)...
Set up a new server in the customer's office (in-house).
Move the central database and web service to the new in-house server.
Keep the web site on the hosted server, but alter the web service URL so that it points to the office server.
Implement a simple cache system for images and most frequently accessed data (such as product information).
... the down-side is that when the end-user in the office updates something, their customers will effectively be downloading the data from a 60KB/s upload connection (albeit once, as it will be cached).
Also, not all data can be cached, for example when a customer updates their order. Also, connection redundancy becomes a huge factor here; what if the office connection is offline? Nothing to do but show an error message to the customers, which is nasty, but a necessary evil.
Mystery option number 3
Suggestions welcome!
SQL replication
I had considered MSSQL replication. But I have no experience with it, so I'm worried about how conflicts are handled, etc. Is this an option? Considering there are physical files involved, and so on. Also, I believe we'd need to upgrade from SQL express to SQL non-free, and buy two licenses.
Technical
Components
ASP.Net website
ASP.net web service
.Net desktop application
MSSQL 2008 express database
Connections
Office connection: 8 mbit down and 1 mbit up contended line (50:1)
Hosted virtual server: Windows 2008 with 10 megabit line

Having just read for the first time your original question related to this I'd say that you may have laid the foundation for resolving the problem simply because you are communicating with the database by a web service.
This web service may well be the saving grace as it allows you to split the communications without affecting the client.
A good while back I was involved in designing just such a system.
The first thing that we identified was that data which rarely changes - and immediately locked all of this out of consideration for distribution. A manual process for administering using the web server was the only way to change this data.
The second thing we identified was that data that should be owned locally. By this I mean data that only one person or location at a time would need to update; but that may need to be viewed at other locations. We fixed all of the keys on the related tables to ensure that duplication could never occur and that no auto-incrementing fields were used.
The third item was the tables that were truly shared - and although we worried a lot about these during stages 1 & 2 - in our case this part was straight-forwards.
When I'm talking about a server here I mean a DB Server with a set of web services that communicate between themselves.
As designed our architecture had 1 designated 'master' server. This was the definitive for resolving conflicts.
The rest of the servers were in the first instance a large cache of anything covered by item1. In fact it wasn't a large cache but a database duplication but you get the idea.
The second function of the each non-master server was to coordinate changes with the master. This involved a very simplistic process of actually passing through most of the work transparently to the master server.
We spent a lot of time designing and optimising all of the above - to finally discover that the single best performance improvement came from simply compressing the web service requests to reduce bandwidth (but it was over a single channel ISDN, which probably made the most difference).
The fact is that if you do have a web service then this will give you greater flexibility about how you implement this.
I'd probably start by investigating the feasability of implementing one of the SQL server replication methods
Usual disclaimers apply:

Splitting the database will not help a lot but it'll add a lot of nightmare. IMO, you should first try to optimize the database, update some indexes or may be add several more, optimize some queries and so on. For database performance tuning I recommend to read some articles from simple-talk.com.
Also in order to save bandwidth you can add bulk processing to your windows client and also add zipping (archiving) to your web service.
And probably you should upgrade to MS SQL 2008 Express, it's also free.
It's hard to recommend a good solution for your problem using the information I have. It's not clear where is the bottleneck. I strongly recommend you to profile your application to find exact place of the bottleneck (e.g. is it in the database or in fully used up channel and so on) and add a description of it to the question.
EDIT 01/03:
When the bottleneck is an up connection then you can do only the following:
1. Add archiving of messages to service and client
2. Implement bulk operations and use them
3. Try to reduce operations count per user case for the most frequent cases
4. Add a local database for windows clients and perform all operations using it and synchronize the local db and the main one on some timer.
And sql replication will not help you a lot in this case. The most fastest and cheapest solution is to increase up connection because all other ways (except the first one) will take a lot of time.
If you choose to rewrite the service to support bulking I recommend you to have a look at Agatha Project

Actually hearing how many they have on that one connection it may be time to up the bandwidth at the office (not at all my normal response) If you factor out the CRM system what else is a top user of the bandwidth? It maybe the they have reached the point of needing more bandwidth period.
But I am still curious to see how much information you are passing that is getting used. Make sure you are transferring efferently any chance you could add some easy quick measures to see how much people are actually consuming when looking at the data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.