I have created a webservice that delivers images. It will always be one-way communication. The images will never be changed, on the side that gets them from the service.
It has multiple sources, and some can be far away, on bad connections.
I have created a memory cache for it, but I would like to also have a disc cache, to store images for longer periods.
I am a bit unsure on the best approach to do this.
First of all, all of my sources are webservers, so I don't really know how to check the last modified date (as an example) of my images, which I would like to use, to see if the file has changed.
Second, how do I best store my local cache? Just drop the files in a folder and compare dates with the original source?
Or, perhaps store all the timestamps in a txt file, with all the images, to avoid checking files.
OR, maybe store them in a local SQL express DB?
The images, in general, are not very large. Most are around 200kb. Every now and then, however, there will be 7+ mb.
The big problem is, that some of the locations, where the service will be hosted, are on really bad connections, and they will need to use the same image, many times.
There are no real performance requirements, I just want to make it as responsive as possible, for the locations that have a horrible connection, to our central servers.
I can't install any "real" cache systems. It has to be something I can handle in my code.
Why don't you install a proxy server on your server, and access all the remote web-servers through that? The proxy server will take care of caching for you.
EDIT: Since you can't install anything and don't have a database available, I'm afraid you're stuck with implementing the disk cache yourself.
The good news is - it's relatively easy. You need to pick a folder and place your image files there. And you need a unique mapping between your image identification and a file name. If your image IDs are numbers, the mapping is very simple...
When you receive a request for an image, first check for it on the disk. If it's there, you have it already. If not , download it from the remote server and store it there, then serve it from there.
You'll need to take concurrent requests into account. Make sure writing the files to disk is a relatively brief process (you can write them once you finish downloading them). When you write the file to disk, make sure nobody can open it for reading, that way you avoid sending incomplete files.
Now you just need to handle the case where the file isn't in your cache, and two requests for it are received at once. If performance isn't a real issue, just download it twice.
Related
I have to design a backup algorithm for some files used by a Windows Service and I already have some ideas, but I would like to hear the opinion of the wiser ones, in order to try and improve what I have in mind.
The software that I am dealing with follows a client-server architecture.
On the server side, we have a Windows Service that performs some tasks such as monitoring folders, etc, and it has several xml configuration files (around 10). These are the files that I want to backup.
On the client side, the user has a graphical interface that allows him to modify these configuration files, although this shouldn't happen very often. Communication with the server are made using WCF.
So the config files might be modified remotely by the user, but the administrator might also modify them manually on the server (the windows service monitors these changes).
And for the moment, this is what I have in mind for the backup algorithm (quite simple though):
When - backups will be performed in two situations:
Periodically: a parallel thread on the server application will perform a copy of the configuration files every XXXX months/weeks/whatever (configurable parameter). This is, it does not perform the backup each time the files are modified by user action, but only when the client app is launched.
Every time the user launches the client: every time the server detects that a user has launched the application, the server side will perform a backup.
How:
There will be a folder named Backup on the Program Data folder of the Windows Service. There, each time a backup is performed, a sub-folder named BackupYYYYMMDDHHmm will be created, containing all the concerned files.
Maintenance: Backup folders won't be kept forever. Periodically, all of those older than XXXX weeks/months/year (configurable parameter) will be deleted. Alternatively, I might only maintain N backup sub-folders (configurable parameter). I still haven't chosen an option, but I think I'll go for the first one.
So, this is it. Comments are very welcome. Thanks!!
I think your design is viable. just a few comments:
do you need to back up to a separate place other than the server? I don't feel it's safe to back up important data on same server, and I would rather back them up to a separate disk (perhaps a network location)
you need to implement the monitoring/backup/retention/etc. by yourself, and it sounds complicated - how long do you wish to spend on this?
Personally i would use some simple trick to achieve the backup, for example, since the data are plain text files (xml format) and light, I might simply back them up to some source control system: make the folder a checkout of SVN (or some other means) and create a simple script that detects/checks in changes to SVN, and schedule the script to be executed once a few hours (or more often up to your needs, or can be triggered by your service/app on demand) - this way it eliminates the unnecessary copy of data (as it checks in changes only), and it's much more trackable as svn provides all the history.
hope above can help a bit...
My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)
This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.
The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.
What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.
Is it a better practice to store media files (documents, video, images, and eventually executables) in the database itself, or should I just put a link to them in the database and store them as individual files?
Read this white paper by MS research (to BLOB or not to BLOB) - it goes in depth about the question.
Executive summary - if you have lots of small (150kb and less) files, you might as well store them in the DB. Of course, this is right for the databases they were testing with and using their test procedures. I suggest reading the article in full to at least gain a good understanding of the trade-offs.
That is an interesting paper that Oded has linked to - if you are using Sql Server 2008 with its FileStream feature the conclusion is similar. I have quoted a couple of salient points from the linked FileStream whitepaper:
"FILESTREAM storage is not appropriate in all cases. Based on prior research and FILESTREAM feature behavior, BLOB data of size 1 MB and larger that will not be accessed through Transact-SQL is best suited to storing as FILESTREAM data."
"Consideration must also be given to the update workload, as any partial update to a FILESTREAM file will generate a complete copy of the file. With a particularly heavy update workload, the performance may be such that FILESTREAM is not appropriate"
Two requirements drive the answer to your question:
Is there more than one application server reading binaries from the database server?
Do you have a database connection that can stream binaries for write and read?
Multiple application servers pulling binaries from one database server really hinders your ability to scale. Consider that database connections are usually - necessarily - coming from a smaller pool than the application servers' request servicing pool. And, the data volume binaries will consume being sent from database server to application server over the pipe. The database server will likely queue requests because its pool of connections will be consumed delivering binaries.
Streaming is important so that a file is not completely in server memory on read or write (looks like #Andrew's answer about SQL Server 2008 FILESTREAM may speak to this). Imagine a file several gigabytes in size - if read completely into memory - would be enough to crash many application servers, which just don't have the physical memory to accommodate. If you don't have streaming database connections storing in the database is really not viable, unless you constrain file size such that your application server software is allocated at least as much memory as the max file size * number of request servicing connections + some additional overhead.
Now let's say you don't put the files in the database. Most operating systems are very good at caching frequently accessed files. So right off the bat you get an added benefit. Plus, if you're doing web servers, they are pretty good at sending back the right request headers, such as mime type, content length, e-tags, etc... which you otherwise end up coding yourself. The real issues are replication between servers, but most application servers are pretty good at doing this via http - streaming the read and write, and as another answerer pointed out keeping database and file system in sync for backups.
Storing BLOB data in database is not considered right way to go unless they are very small. Instead storing their path is more appropriate. it will greatly improve database query and retrieval performance.
Here is detailed comparison I have made
http://akashkava.com/blog/127/huge-file-storage-in-database-instead-of-file-system/
I have a class that is creating an instance of StreamReader to an xml file on the local filesystem.
It may be possible that this same file is requested multiple times per-second.
I was wondering whether I need to manually add this file to the System.Web.Cache and read it from there, or whether Windows itself is clever enough to cache the item itself so that it 'knows' when ASP.NET requests this file the second/third etc time that it doesnt have to do a disk seek/read operation and pulls it from its own cache?
This article: http://dotnetperls.com/file-read-benchmarks seems to back this up, but this: article: http://msdn.microsoft.com/en-us/library/18c1wd61%28v=VS.100%29.aspx (although not discussing from a performance perspective, and maybe for other reasons entirely) lists how to add a physical file to the cache.
I don't think the caching the file would be all that useful, since it resides on the same server as the page, although it probably does get cached automatically by IIS. Instead what you should cache is the output from the StreamReader. Store the xml after you have read it in, and then you will save on the time it takes to read it in and whatever processing you do to get it into a usable format.
Then you can manually add it to the HttpRuntime.Cache and you can even set a file dependency upon the original file, to expire the cached output.
I maintain several client sites that have no dynamic data whatsoever, everything is static asp.net with c#.
Are there any pitfalls to caching the entire page for extreme periods of time, like a week?
Kibbee, We use a couple controls on the sites (ad rotator, some of the ajax extensions) on the sites. They could probably be completely written in html but for convenience sake I just stuck with what we use for every other site.
The only significant pitfall to long cache times occurs when you want to update that data. To be safe, you have to assume that it will take up to a week for the new version to become available. Intermediate hosts such as a ISP level proxy servers often do cache aggressively so this delay will happen.
If there are large files to be cached, I'd look at ensuring your content engine supports If-Modified-Since.
For smaller files (page content, CSS, images, etc), where reducing the number of round-trips is the key, having a long expiry time (a year?) and changing the URL when the content changes is the best. This lets you control when user agents will fetch the new content.
Yahoo! have published a two part article on reducing HTTP requests and browser cache usage. I won't repeat it all here, but these are good reads which will guide you on what to do.
My feeling is to pick a time period high enough to cover most users single sessions but low enough to not cause too much inconvenience should you wish to update the content. Be sure to support If-Modified-Since if you have a Last-Modified for all your content.
Finally, if your content is cacheable at all and you need to push new content out now, you can always use a new URL. This final cachable content URL can sit behind a fixed HTTP 302 redirect URL should you wish to publish a permanent link to the latest version.
We have a similar issue on a project I am working on. There is data that is pretty much static, but is open to change..
What I ended up doing is saving the data to a local file and then monitoring it for changes. The DB server is then never hit unless we remove the file, in which case it will scoot of to the DB and regenerate the data file.
So what we basically have a little bit of disk IO while loading/saving, no traffic to the DB server unless necessary and we are still in control of it (we can either delete manually or script it etc).
I should also add is that you could then tie this up with the actual web server caching model if you wanted to reduce the disk IO (we didnt really need to in our case)..
This could be totally the wrong way to go about it, but it seems to work quite nice for us :)
If it's static, why bother caching at all? Let IIS worry about it.
When you say that you have no data, how are you even using asp.net or c#. What functionality does that provide you over plain HTML? Also, if you do plan on caching, it's probably best to cache to a file, and then when a request is made, stream out the file. The OS will take care of keeping the file in memory so that you won't have to read it off the disk all the time.
You may want to build in a cache updating mechanism if you want to do this, just to make sure you can clear the cache if you need to do a code update. Other than that, there aren't any problems that I can think of.
If it is static you would probably be better off generating the pages once and then serve up the resulting static HTML file directly.