Necessary to manually cache static files in C# / ASP.NET? - c#

I have a class that is creating an instance of StreamReader to an xml file on the local filesystem.
It may be possible that this same file is requested multiple times per-second.
I was wondering whether I need to manually add this file to the System.Web.Cache and read it from there, or whether Windows itself is clever enough to cache the item itself so that it 'knows' when ASP.NET requests this file the second/third etc time that it doesnt have to do a disk seek/read operation and pulls it from its own cache?
This article: http://dotnetperls.com/file-read-benchmarks seems to back this up, but this: article: http://msdn.microsoft.com/en-us/library/18c1wd61%28v=VS.100%29.aspx (although not discussing from a performance perspective, and maybe for other reasons entirely) lists how to add a physical file to the cache.

I don't think the caching the file would be all that useful, since it resides on the same server as the page, although it probably does get cached automatically by IIS. Instead what you should cache is the output from the StreamReader. Store the xml after you have read it in, and then you will save on the time it takes to read it in and whatever processing you do to get it into a usable format.
Then you can manually add it to the HttpRuntime.Cache and you can even set a file dependency upon the original file, to expire the cached output.

Related

Multiple Async File Uploads with chunking to ASP.Net Web API

I have read a number of closely related questions but not one that hits this exactly. If it is a duplicate, please send me a link.
I am using an angular version of the flowjs library for doing HTML5 file uploads (https://github.com/flowjs/ng-flow). This works very well and I am able to upload multiple files simultaneously in 1MB chunks. There is an ASP.Net Web API Files controller that accepts these and saves them on disk. Although I can make this work, I am not doing it efficiently and would like to know a better approach.
First, I used the MultipartFormDataStreamProvider in an async method that worked great as long as the file uploaded in a single chunk. Then I switched to just using the FileStream to write the file to disk. This also worked as long as the chunks arrived in order, but of course, I cannot rely on that.
Next, just to see it work, I wrote the chunks to individual file streams and combined them after the fact, hence the inefficiency. A 1GB file would generate a thousand chunks that needed to be read and rewritten after the upload was complete. I could hold all file chunks in memory and flush them after they are all uploaded, but I'm afraid the server would blow up.
It seems that there should be a nice asynchronous solution to this dilemma but I don't know what it is. One possibility might be to use async/await to combine previous chunks while writing the current chunk. Another might be to use Begin/EndInvoke to create a separate thread so that the file manipulation on disk was handled independent of the thread reading from the HttpContext but this would rely on the ThreadPool and I'm afraid that the created threads will be unduly terminated when my MVC controller returns. I could create a FileWatcher that ran completely independent of ASP.Net but that would be very kludgey.
So my questions are, 1) is there a simple solution already that I am missing? (seems like there should be) and 2) if not, what is the best approach to solving this inside the Web API framework?
Thanks, bob
I'm not familiar with that kind of chunked upload, but I believe this should work:
Use flowTotalSize to pre-allocate the file when the first chunk comes in.
Have one SemaphoreSlim per file to serialize the asynchronous writes for that file.
Each chunk will write to its own offset (flowChunkSize * (flowChunkNumber - 1)) within the file.
This doesn't handle situations where the uploads are unexpectedly terminated. That kind of solution usually involves allocating/writing a temporary file (with a special extension) and then moving/renaming that file once the last chunk arrives.
Don't forget to ensure that your file writing is actually asynchronous.
Using #Stephen Cleary's answer, and this thread: https://github.com/flowjs/ng-flow/issues/41 I was able to make an ASP.NET Web Api Implementation and uploaded it for those still wondering about this question such as #Herb Caudill
https://github.com/samhowes/NgFlowSample/tree/master.
The original answer is the real answer to this question, but I don't have enough reputation yet to comment. I did not use a SemaphoreSlim, but instead enabled file Write sharing. But did in fact pre-allocate and make sure that each chunk is getting written to the right location by calculating an offset.
I will be contributing this to the Flow samples at: https://github.com/flowjs/flow.js/tree/master/samples
This is what I have done. Uploaded the chunks and saved those chunks on the server an save the location of chunks in the database with their order (not the order they came in, the order of the chunk in the file they should be).
Then I introduced another endpoint to merge those chunks. Since this part can be a long process I used a messaging service to run the process in the background.
And after the service is done merging the file, sending a notification (or you can trigger an event).
Agree, it won't fix the problem of having to save all those chunks, but after the merging is done, we can just delete those from the disk. However there are some IIS configuration required though for the upload to work smoothly.
Here's my two cents to this old question. Now most of the application use azure or aws for storage. However, still sharing my thoughts in case it helps someone.

Performance wise: File.Copy vs File.WriteAllText function in C#?

I have file content in string and I need to put the same content in 3 different files.
So, I am using File.WriteAllText() function of C# to put the content in first file.
now, for other 2 files, I have two options:
Using File.Copy(firstFile, otherFile)
Using File.WriteAllText(otherFile, content)
Performance wise which option is better?
If the file is relatively small it is likely to remain cached in Windows disk cache, so the performance difference will be small, or it might even be that File.Copy() is faster (since Windows will know that the data is the same, and File.Copy() calls a Windows API that is extremely optimised).
If you really care, you should instrument it and time things, although the timings are likely to be completely skewed because of Windows file caching.
One thing that might be important to you though: If you use File.Copy() the file attributes including Creation Time will be copied. If you programatically create all the files, the Creation Time is likely to be different between the files.
If this is important to you, you might want to programatically set the the file attributes after the copy so that they are the same for all files.
Personally, I'd use File.Copy().

Best way to create disc cache for web service

I have created a webservice that delivers images. It will always be one-way communication. The images will never be changed, on the side that gets them from the service.
It has multiple sources, and some can be far away, on bad connections.
I have created a memory cache for it, but I would like to also have a disc cache, to store images for longer periods.
I am a bit unsure on the best approach to do this.
First of all, all of my sources are webservers, so I don't really know how to check the last modified date (as an example) of my images, which I would like to use, to see if the file has changed.
Second, how do I best store my local cache? Just drop the files in a folder and compare dates with the original source?
Or, perhaps store all the timestamps in a txt file, with all the images, to avoid checking files.
OR, maybe store them in a local SQL express DB?
The images, in general, are not very large. Most are around 200kb. Every now and then, however, there will be 7+ mb.
The big problem is, that some of the locations, where the service will be hosted, are on really bad connections, and they will need to use the same image, many times.
There are no real performance requirements, I just want to make it as responsive as possible, for the locations that have a horrible connection, to our central servers.
I can't install any "real" cache systems. It has to be something I can handle in my code.
Why don't you install a proxy server on your server, and access all the remote web-servers through that? The proxy server will take care of caching for you.
EDIT: Since you can't install anything and don't have a database available, I'm afraid you're stuck with implementing the disk cache yourself.
The good news is - it's relatively easy. You need to pick a folder and place your image files there. And you need a unique mapping between your image identification and a file name. If your image IDs are numbers, the mapping is very simple...
When you receive a request for an image, first check for it on the disk. If it's there, you have it already. If not , download it from the remote server and store it there, then serve it from there.
You'll need to take concurrent requests into account. Make sure writing the files to disk is a relatively brief process (you can write them once you finish downloading them). When you write the file to disk, make sure nobody can open it for reading, that way you avoid sending incomplete files.
Now you just need to handle the case where the file isn't in your cache, and two requests for it are received at once. If performance isn't a real issue, just download it twice.

Synchronizing filesystem and cached data on program startup

I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).
So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?
If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.
Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:
In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.
Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):
On 1st Launch
The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
This takes a while, so 1st launch can be a bit slow.
There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:
n+1 Launch
The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
The Walker then walks the filesystem to pick-up new files.
Finally the Loader loads all UNPARSED files
Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:
The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
data is loaded (so the application "keeps working" while new data is cached into the database)
The Loader could load/parse on a background thread during an idle time in the main application
If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.
Just some thoughts / suggestions. Hope they help!
Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx
EDIT: I think it requires rather high privileges, unfortunately
The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).
However, if that is unacceptable to you for some reason, let us try to look at the original problem.
First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.
Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.
Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).
Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.
First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.
Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).
Hope this helps. Have fun.

Long-term Static Page Caching

I maintain several client sites that have no dynamic data whatsoever, everything is static asp.net with c#.
Are there any pitfalls to caching the entire page for extreme periods of time, like a week?
Kibbee, We use a couple controls on the sites (ad rotator, some of the ajax extensions) on the sites. They could probably be completely written in html but for convenience sake I just stuck with what we use for every other site.
The only significant pitfall to long cache times occurs when you want to update that data. To be safe, you have to assume that it will take up to a week for the new version to become available. Intermediate hosts such as a ISP level proxy servers often do cache aggressively so this delay will happen.
If there are large files to be cached, I'd look at ensuring your content engine supports If-Modified-Since.
For smaller files (page content, CSS, images, etc), where reducing the number of round-trips is the key, having a long expiry time (a year?) and changing the URL when the content changes is the best. This lets you control when user agents will fetch the new content.
Yahoo! have published a two part article on reducing HTTP requests and browser cache usage. I won't repeat it all here, but these are good reads which will guide you on what to do.
My feeling is to pick a time period high enough to cover most users single sessions but low enough to not cause too much inconvenience should you wish to update the content. Be sure to support If-Modified-Since if you have a Last-Modified for all your content.
Finally, if your content is cacheable at all and you need to push new content out now, you can always use a new URL. This final cachable content URL can sit behind a fixed HTTP 302 redirect URL should you wish to publish a permanent link to the latest version.
We have a similar issue on a project I am working on. There is data that is pretty much static, but is open to change..
What I ended up doing is saving the data to a local file and then monitoring it for changes. The DB server is then never hit unless we remove the file, in which case it will scoot of to the DB and regenerate the data file.
So what we basically have a little bit of disk IO while loading/saving, no traffic to the DB server unless necessary and we are still in control of it (we can either delete manually or script it etc).
I should also add is that you could then tie this up with the actual web server caching model if you wanted to reduce the disk IO (we didnt really need to in our case)..
This could be totally the wrong way to go about it, but it seems to work quite nice for us :)
If it's static, why bother caching at all? Let IIS worry about it.
When you say that you have no data, how are you even using asp.net or c#. What functionality does that provide you over plain HTML? Also, if you do plan on caching, it's probably best to cache to a file, and then when a request is made, stream out the file. The OS will take care of keeping the file in memory so that you won't have to read it off the disk all the time.
You may want to build in a cache updating mechanism if you want to do this, just to make sure you can clear the cache if you need to do a code update. Other than that, there aren't any problems that I can think of.
If it is static you would probably be better off generating the pages once and then serve up the resulting static HTML file directly.

Categories

Resources