I've been given a task to build a prototype for an app. I don't have any code yet, as the solution concepts that I've come up with seem stinky at best...
The problem:
the solution consist of various Azure projects which do stuff to lots of data stored in Azure SQL db-s. Almost every action that happens creates a gzipped log file in blob storage. So that's one .gz file per log entry.
We should also have a small desktop (WPF) app that should be able to read, filter and sort these log files.
I have absolutely 0 influence on how the logging is done, so this is something that can not be changed to solve this problem.
Possible solutions that I've come up with (conceptually):
1:
connect to the blob storage
open the container
read/download blobs (with applied filter)
decompress the .gz files
read and display
The problem with this is that, depending on the filter, this could mean a whole lot of data to download (which is slow), and process (which will also not be very snappy). I really can't see this as a usable application.
2:
create a web role which will run a WCF or REST service
the service will take the filter params and other stuff and return a single xml/json file with the data, the processing will be done on the cloud
With this approach, will I run into problems with decompressing these files if there's a lot of them (will it take up extra space on the storage/compute instance where the service is running).
EDIT: what I mean by filter is limit the results by date and severity (info, warning, error). The .gz files are saved in a structure that makes this quite easy, and I will not be filtering by looking into the files themselves.
3:
some other elegant and simple solution that I don't know of
I'd also need some way of making the app update the displayed logs in real time, which i suppose would need to be done with repeated requests to the blob storage/service.
This is not one of those "give me code" questions. I am looking for advice on best practices, or similar solutions that worked for similar problems. I also know this could be one of those "no one right answer" questions, as people have different approaches to problems, but I have some time to build a prototype, so I will be trying out different things, and I will select the right answer, which will be the one that showed a solution that worked, or the one that steered me in the right direction, even if it does take some time before I actually build something and test it out.
As I understand it, you have a set of log file in Azure Blob storage that are formatted in a particular way (gzip) and you want to display them.
How big are these files? Are you displaying every single piece of information in the log file?
Assuming that if this is a log file, it is static and historical...meaning that once the log/gzip file is created it cannot be changed (you are not updating the gzip file once it is out on Blog storage). Only new files can be created...
One Solution
Why not create an worker role/job process that periodically goes out and scans the blob storage and builds a persisted "database" so that you can display. Nice thing about this is that you are not putting the unzipping/business logic to extract the log file in a WPF app or UI.
1) I would have the worker role scan the log file in Azure Blob storage
2) Have some kind of mechanism to track which ones where processed and a current "state" maybe the UTC date of the last gzip file
3) Do all the unzipping/extracting of the log file in the worker role
4) Have the worker role place the content in a SQL database, Azure Table Storage or Distributed Cache for access
5) Access can be done by a REST service (ASP.NET Web API/Node.js etc)
You can add more things if you need to scale this out, for example run this as a job to re-do all of the log files from a given time (refresh all). I don't know the size of your data so I am not sure if that is feasable.
Nice thing about this is that if you need to scale your job (overnight), you can spin up 2, 3, 6 worker roles...extract the content, pass the result to a Service Bus or Storage Queue that would insert into SQL, Cache etc for access.
Simply storing the blobs isn't sufficient. The metadata you want to filter on should be stored somewhere else where it's easy to filter and retrieve all the metadata. So I think you should split this into 2 problems:
A. How do I efficiently list all "gzips" with their metadata and how
can I apply a filter on these gzips in order to show them in my client
application.
Solutions
Blobs: Listing blobs is slow and filtering is not possible (you could group in a container per month or week or user or ... but that's not filtering).
Table Storage: Very fast, but searching is slow (only PK and RK are indexed)
SQL Azure: You could create a table with a list of "gzips" together with some other metadata (like user that created the gzip, when, total size, ...). Using a stored procedure with a few good indexes you can make search very fast, but SQL Azure isn't the most scalable solution
Lucene.NET: There's an AzureDirectory for Windows Azure which makes it possible to use Lucene.NET in your application. This is a super fast search engine that allows you to index your 'documents' (metadata) and this would be perfect to filter and return a list of "gzips"
Update: Since you only filter on date and severity you should review the Blob and Table options:
Blobs: You can create a container per date+severity (20121107-low, 20121107-medium, 20121107-high ...). Assuming you don't have too many blobs per data+severity, you can simply list the blobs directly from the container. The only issue you might have here is that a user will want to see all items with a high severity from the last week (7 days). This means you'll need to list the blobs in 7 containers.
Tables: Even though you say table storage or db aren't an option, do consider table storage. Using partitions and row keys you can easily filter in a very scalable way (you can also use CompareTo to get a range of items (for example, all records between 1 and 7 november). Duplicating data is perfectly acceptable in Table Storage. You could include some data from the gzip in the Table Storage entity in order to show it in your WPF application (the most essential information you want to show after filtering). This means you'll only need to process the blob when the user opens/double clicks the record in the WPF application
B. How do I display a "gzip" in my application (after double clicking on a search result for example)
Solutions
Connect to the storage account from the WPF application, download the file, unzip it and display it. This means that you'll need to store the storage account in the WPF application (or use SAS or a container policy), and if you decide to change something in the backend of how files are stored, you'll also need to change the WPF application.
Connect to a Web Role. This Web Role gets the blob from blob storage, unzips it and sends it over the wire (or send it compressed in order to speed up the transfer). In case something changes in how you store files, you only need to update the Web Role
Related
We are working on a Educational website which allows users(teachers and students) to upload files (.pdf,.docx,.png and ... ). We don't have any experience in this area and wanted to make sure we are doing the right thing to store and index these files. We would like to have an architecture that scales well to high volumes of data.
currently we store the path to our files in database like below (Nvarchar(MAX)) :
~/Files/UserPhotos/2fd7199b-a491-433d-acf9-56ce54b6b14f_168467team-03.png
and we use codes below to save and retrieve files :
//save:
file.SaveAs(Server.MapPath("~/Files/UserPhotos/") + fileName);
//retrieve:
<img alt="" src="#Url.Content(Model.FilePath)">
now our questions are :
are we proceding in a good direction?
should we save files in a root directory or a virtual directory?
imagine our server has 1 TB storage,after storing 1 TB data if we add an extra hard drive how should we manage changes?
we searched a lot but did not find any good tutorial or guidelines for correct architecture.
sorry for my bad English.
In an ideal world, you would be using cloud storage, such as Azure Blob Storage, if that's not an option then the way I would do it is create a separate web service that specifically deals with uploaded files and file storage.
By creating a separate web service that manages file storage, you will have isolated your concerns, this service can monitor hard drive storage spaces and balance them out as documents are being uploaded, and in future if you add additional servers... you will already have separated your service so it won't be as big of a mess as it would be if you didn't.
You can index everything in a SQL data store as files are being uploaded. Your issues are actually much more complicated than what I've just mentioned though...
The other issues that need attention is the game plan if or when one of the hard drives go kaput! Without a RAID 1 configuration of your hard drives, your availability plummets to NADA.
Queue issue number 2... availability != backups... You need to consider your game plan on how you intend to back the system up, how often, during what time of day, etc... The more data you have, the more difficult this gets...
This is why everyone is moving over to Azure / AWS etc... you just don't have to worry about these sort of things anymore...
1.I usually save files in this way:
file.SaveAs(Server.MapPath("/Files/UserPhotos/") + fileName);
2.it is better to save it in a virtual directory,so that you can move your files folder to a new an extra hard disk and change your virtual directory's path in IIS when you have too much files in this folder.
I'm new in Web Api and I'm working on my first project. I'm working on mobile CRM system for our company.
I want to store companies logo, customers face foto etc.
I found some tutorials on this topic, but unfortunately some of them was old (doesn't use async) and the others doesn't work.
At the end I found this one:
http://www.intstrings.com/ramivemula/articles/file-upload-using-multipartformdatastreamprovider-in-asp-net-webapi/
It works correctly, but I don't understand a few things.
1) Should I use App_Data (or any other folder like /Uploads) for storing this images, or rather store images in database?
2) Can I set only supported images like .jpg, .png and reject any other files?
3) How can I processed image in upload method? Like resize, reduce size of the file, quality etc?
Thank you
1) We are storing files in a different location than app_data. We have a few customer groups and we gave them all a unique folder that we get from the database. Storing in database is also an option but if you go down this road, make sure that the files you are saving don't belong directly to a table that you need to retrieve often. There is no right or wrong, but have a read at this question and answer for some pros and cons.
2) If you foollowed that guide, you can put a check inside the loop to check the file ending
List<string> denyList = new List<string>();
denyList.Add(".jpg");
foreach (MultipartFileData file in provider.FileData)
{
string fileName = Path.GetFileName(file.LocalFileName);
if(denyList.Contains(Path.GetExtension(fileName))
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
files.Add(Path.GetFileName(file.LocalFileName));
}
3) Resizing images is something that I have never personally done my self, but I think you should have a look at the System.Drawing.Graphics namespace.
Found a link with an accepted answer for downresize of picture: ASP.Net MVC Image Upload Resizing by downscaling or padding
None of the questions are actually related to Web API or REST.
If you are using SQL Server 2008 or newer the answer is use FILESTREAM columns. This looks like a column in database with all its advantages (i.e. backup, replication, transactions) but the data is actually stored in file system. So you get the best of each world, i.e. it will not happen that someone deletes the file accidentally so database will reference an inexistent file, or vice versa, records from database are deleted but files not so you'll end up with a bunch of orphan files. Using a database has many advantages, i.e. metadata can be associated with files and permissions are easier to set up.
This depends on how files are uploaded. I.e. if using multipart forms then examine content type of each part before part is saved. You can even create your own MultipartStreamProvider class. Being an API maybe the upload method has a stream or byte array parameter and a content type parameter, in this case just test the value of content type parameter before content is saved. For other upload methods do something similar depending on what the input is.
You can use .Net's built in classes (i.e. Bitmap: SetResolution, RotateFlip, to resize use a constructor what accepts a size), or if you are not familiar with image processing rather choose an image processing library.
All of the above work in Asp.Net, MVC, Web API 1 and 2, custom HTTP handlers, basically in any .Net code.
#Binke
Never user string operations on paths. I.e. fileName.split('.')[1] will not return the extension if file name is like this: some.file.txt, and will fail with index out of range error if file has no extension.
Always use file API, i.e. Path.GetExtension.
Also using the extension to get content type is not safe especially when pictures and videos are involved, just think of avi extension what is used by many video formats.
files.Add(Path.GetFileName(file.LocalFileName)) should be files.Add(fileName).
I have a unique set of processes that all need to be automated.
We receive very inconsistent data from our customer so this requires a lot of responses from not very computer literate users. I'de go with console if it wasn't for that.
That data needs to be transformed and then combined using a few different processes.
I need to create an application that can only be accessed from one person at a time (we don't want to have multiple people building the same data).
All processes can be run on one machine.
A basic outline is the following...
Get all of the zip files from our customers FTP
Unzip all of these files into the specified directory
Take this data and verify it's surface level integrity
Transform the data to a new format
Import to the database
Build documents based on the data
I know how to write each of these functions, my question is more: should I do this in MVC3 with AJAX updates, WPF, windows forms, or straight asp.net? I know all of them, I just can't think of which fits this linear processing scheme. The user also needs constant updating of progress on each file so any of the asp.net derivatives get tricky with ajax.
I'd recommend just making a console application. Do you need an interface?
Two options: a console application to be launched as a scheduled task, or a Windows Service.
If everything is automated I would create a windows service to do everything. By doing so you'll also prevent the application from being run by more than once simultaneously (unless you install it on several computers).
I'm planning to develop an application that will read a log file and display statistics.
The first question, I guess, is to know if I need a database or not?
Will it be quicker to run queries against the database ; or read the file each time a user wants to see the statistics?
If I choose the database method, I will have to read the log file and update the database on a regular basis (between 1 and 10 minutes).
Is this article still good do you think (as it's from 2005): http://www.codeproject.com/KB/aspnet/ASPNETService.aspx
Or is it better to develop a Windows service? In that case, can I add the Windows Serice in my ASP.NET project in Visual Studio, or does it need to be
You mentioned ASP.NET so I believe it is a web application. In such case I would suggest to use Data Base, this is more robust, flexible and distributed solution.
Any way consider using log4net and then you can easily switch on file/DB ouput in any time by simply adding an other one appender section into the configuration file.
If I choose the database method, I will have to read the log file and
update the database on a regular basis (between 1 and 10 minutes)
Exactly, you're going to have to do it anyway. The Database basically just becomes another bottleneck at that point. For this type of app, there's no need to do anything other than read the file when the user requests to see it, and display them the results on the fly.
No need to have a windows service either. I mean, I don't know all your details, but I'm assuming the log file is in a directory on your machine, so just access it, open it, parse it, and display it to the user when they choose to see it on the front end.
If the only data you going to work is LOG files, you don't need any database.
But I assume that your application would do parse logs files, create some statistics and STORE it somewhere, to make possible to users to get back and see statistics for some period of time. It is not cool if any time you will be "re-calculating" that statistics again (further more, you might loose original log files till that time).
Even if you could store it to some files also, I do not recommed that at all. Don't be afraid of using Database, don't be concered on application performace on such early stage. Do the most that helps you to solve the problem.. and as for me using Database will solve your problem;
Recently I was working on displaying workflow diagram images in our web application. I managed to use the rehosted WF designer and create images on-the-fly on the server, but imagining how large the workflow diagrams can very quickly become, I wanted to give a better user experience by using some ajax control for displaying images that would support zoom & pan functionality.
I happened to come across the website of seadragon, which seems to be just an amazing piece of work that I could use. There is just one disadvantage - in order to use their library for generating deep zoom versions of images I have to use the file structure on a server. Because of the temporary nature of the images I am using (workflow diagrams with progress indicators), it is important to not only be able to create such images but also to get rid of them after some time.
Now the question is how can I best ensure that the temporary image files and the folder hierarchy can be created on a server (ASP.NET web app), and later cleaned up. I was thinking of using the cache functionality and by the expiration of the cache item delete the corresponding image folder hierarchy, or simply in the Application_Start and Application_End of Global.asax delete the content of the whole temporary folder, but I'm not really sure whether this is a good idea and whether there are some security restrictions or file-system-related troubles. What do you think ?
We do something similar for creating PDF reports and found the easiest way is to use a timestamp check to determine how "old" files are, and then delete them based on a period of time, in our case more then 2 hours old. This is done before the next PDF document is created, but as part of the creation process. We also created a specific folder and gave the ASP.Net user read/write access to the folder.
The only disadvantage is that if the process of creating PDF's is not used regularly there will be a build up of files, however they will be cleaned up eventually. In 2 years and close on 4000 PDF's we have yet to have an error doing it this way.
Use the App_Data folder. This folder is inside your application and writable by your app without having to go outside the context of the app, but it's also secured from casual browsing. It's meant to hold data files for your application.
Application_Start and Application_End will only fire once each, so if you need better cleanup than that, I would consider using a cache structure or a simple windows service to handle the cleanup.
First, you have to make sure your IIS worker process has rights to write/delete files from your cache directory (and NOT the rest of your site, just in case)
2nd, I would stay away from using App_Start and App_End, App end to clean up files is not 100% guaranteed to fire, and you could end up with a growing pile of orphaned images.
I would instead make a scheduled process, maybe runs once per hour, or once a day, depending on what you want. And have it check how old each image in your cache is, and if its older than your arbitrary "expiure time" then delete it.
Other than that there's not much to it.