In a C# project that I am currently working on, we're attempting to calculate the MD5 of a large quantity of files over a network (current pot is 2.7 million, client pot may be in excess of 10 million). With the number of files that we are processing, speed is of the issue.
The reason we do this is to verify the file was copied to a different location without modification.
We currently use the following code to calculate the MD5 of a file
MD5 md5 = new MD5CryptoServiceProvider();
StringBuilder sb = new StringBuilder();
byte[] hashMD5 = null;
try
{
// Open stream to file to get MD5 hash for, create hash
using (FileStream fsMD5 = new FileStream(sFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
hashMD5 = md5.ComputeHash(fsMD5);
}
catch (Exception ex)
{
clsLogging.logError(clsLogging.ErrorLevel.ERROR, ex);
}
string md5sum = "";
if (hashMD5 != null)
{
// Change hash into readable text
foreach (byte hex in hashMD5)
sb.Append(hex.ToString("x2"));
md5sum = sb.ToString();
}
However, the speed of this isn't what my manager has been hoping for. We've gone through a number of changes to the way and number of files that we calculate the MD5 for (i.e. we don't do it for files that we don't copy... until today when my manager changed his mind so ALL files must have a MD5 calculated for them, in case at some future time a client wishes to bugger with our program so all files are copied i guess)
I realize that the speed of the network is probably a major contributing factor (100Mbit/s). Is there an efficient way to calculate the MD5 of the contents of a file over a network?
Thanks in advance.
Trevor Watson
Edit: put all code in block instead of just a part of it.
The bottleneck is that the whole file must be streamed/copied over the network, and your seems to look good...
the different hash functions (md5/sha256/sha512) have almost the same computation time
Two possible solutions for this problem:
1) run a hasher on the remote system and store the hashes in to separate files - if that is possible in your environment.
2) Create a part-wise hash of the file, so that you only copy a part of the file.
I mean something like that:
part1Hash = md5(file.getXXXBytesFromFileAtPosition1)
part2Hash = md5(file.getXXXBytesFromFileAtPosition2)
part3Hash = md5(file.getXXXBytesFromFileAtPosition3)
finalHash = part1Hash ^ part2Hash ^ part3Hash;
you have to test which part of the file are optimal to read, so the hashes stay unique.
hope that helps...
edit: changed to bitwise xor
One possible approach would be to make use of the parallel task library in .Net 4.0. 100Mbps will still be a bottleneck, but you should see a modest improvement.
I wrote a small application last year that walks the top levels of a folder tree checking folder and file security settings. Running over a 10Mbps WAN it took about 7 minutes to complete one of our large file shares. When I parallelised the operation the execution time came down to a bit over 1 minute.
Why don't you try installing a 'client' on each one which listens on a port and when signaled, will calculate the MD5 hash for the files requested.
The main server will then only need to ask each client to calculate the MD5. Using this distributed approach you will gain the combined speed of all the clients and reduce network congestion.
Related
So, I know it is kinda crazy to report bug at this point in Azure life cycle, but I'm out of options. Here we go.
We have a service that you can upload files and a client that download then. That BLOB is stuffed with about 27 GB of data.
In a few occasions our users reported that some files were coming wrong, so we checked our MVC route to see if was anything wrong and found nothing.
So we created a simple console that loop the download:
public static void Main()
{
var firstHash = string.Empty;
var client = new System.Net.WebClient();
for (int i = 0; i < 5000; i++)
{
try
{
var date = DateTime.Now.ToString("HH-mm-ss-ffff");
var destination = #"C:\Users\Israel\Downloads\RO65\BLOB - RO65 -" + date + ".rfa";
client.DownloadFile("http://myboxfree.blob.core.windows.net/public/91fe9d90-71ce-4036-b711-a5300159abfa.rfa", destination);
string hash = string.Empty;
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(destination))
{
hash = Convert.ToBase64String(md5.ComputeHash(stream));
}
}
if (string.IsNullOrEmpty(firstHash))
firstHash = hash;
if (hash != firstHash) hash += " ---------------------------------------------";
Console.WriteLine("i: " + i.ToString() + " = " + hash);
}
catch { }
}
}
So here is the result - every now and then it downloads the wrong file:
The first 1000 downloads were OK, the right file. Out of the blue the BLOB returns a different file, and then goes back to normal.
The only relation I found between the files are the extension and the file size in bytes. The hash is (of course) different.
Any thoughts?
I have tried to rerun your sample code and wasn't able to repro.
Questions:
For the two different versions of the files you are seeing downloaded have you compared the contents of the two files? I think you said it was two completely different blobs being retrieved - however I wanted to verify that. How large is the delta between the two files?
Are you using RA-GRS and the client libraries read from secondary retry condition - meaning a network glitch could result in the read coming from the secondary region?
Suggestions:
Can you track the etag of the retrieved files. This allows you to check if the blob has changed since you first started reading it?
The Storage Service does enable you to explicitly validate the integrity of your objects to check to see if they have been modified in transit - potentially due to network issues etc. See Azure Storage Md5 Overview for more information. The simplest way however might just be to use https as these validations are already built into https.
Can you also try to repro using https and let me know if that helps?
I made a program in C# where it processes about 30 zipped folders which have about 35000 files in total. My purpose is to read every single file for processing its information. As of now, my code extracts all the folders and then read the files. The problem with this process is it takes about 15-20 minutes for it to happen, which is a lot.
I am using the following code to extract files:
void ExtractFile(string zipfile, string path)
{
ZipFile zip = ZipFile.Read(zipfile);
zip.ExtractAll(path);
}
The extraction part is the one which takes the most time to process. I need to reduce this time. Is there a way I can read the contents of the files inside the zipped folder without extracting them? or if anyone knows any other way that can help me reduce the time of this code ?
Thanks in advance
You could try reading each entry into a memory stream instead of to the file system:
ZipFile zip = ZipFile.Read(zipfile);
foreach(ZipEntry entry in zip.Entries)
{
using(MemoryStream ms = new MemoryStream())
{
entry.Extract(ms);
ms.Seek(0,SeekOrigin.Begin);
// read from the stream
}
}
Maybe instead of extracting it to the hard disk, you should try read it without extraction, using OpenRead, then you would have to use the ZipArchiveEntry.Open method.
Also have a look at the CodeFluent Runtime tool, which claims to be improved for performances issues.
Try to break your responses into single await async methods, which started one by one if one of the responses is longer than 50 ms. http://msdn.microsoft.com/en-us/library/hh191443.aspx
If we have for example 10 executions which call one by one, in async/await we call our executions parallel, and operation will depend only from server powers.
In mono for android I have an app that saves images to local storage for caching purposes. When the app launches it tries to load images from the cache before trying to load them from the web.
I'm currently having a hard time finding a good way to read and load them from local storage.
I'm currently using something equivilant to this:
List<byte> byteList = new List<byte>();
using (System.IO.BinaryReader binaryReader = new System.IO.BinaryReader(context.OpenFileInput("filename.jpg")))
{
while (binaryReader.BaseStream.IsDataAvailable())
{
byteList.Add(binaryReader.ReadByte());
}
}
return byteList.toArray();
OpenFileInput() returns a stream that does not give me a length so I have to read one byte at a time. It also can't seek. This seems to be causing images to load much slower than they aughto. Loading images from Resrouce.Drawable is almost instantanious by comparison, but with my method there a very noticable pause, maybe 300ms, for loading a 8kb file. This seems like a really obvious task to be able to do, but I've tried many solutions and searched a lot for advise but to no avail.
I've also noticed this code seems to crash with an EndOfStream exception when not run on the UI thread.
Any help would be hugely appreciated
What do you intend on doing with the List<byte>? You want to "load images from the cache," but you don't specify what you want to load them into.
If you want to load them into a Android.Graphics.Bitmap, you could use BitmapFactory.DecodeStream(Stream):
Bitmap bitmap = BitmapFactory.DecodeStream(context.OpenFileInput("filename.jpg"));
This would remove the List<byte> intermediary.
If you really need all the bytes (for whatever reason), you can rely on the fact that System.Environment.GetFolderPath(System.Environment.SpecialFolder.Personal) is the same as Context.FilesDir, which is what context.OpenFileInput() will use, permitting:
byte[] bytes = System.IO.File.ReadAllBytes(
Path.Combine (
System.Environment.GetFolderPath(System.Environment.SpecialFolder.Personal),
"filename.jpg"));
However, if this is truly a cache, you should be using Context.CacheDir instead of Context.FilesDir, which is Path.GetTempPath returns:
byte[] cachedBytes = System.IO.File.ReadAllBytes(
Path.Combine(System.IO.Path.GetTempPath(), "filename.jpg"));
I got the following problem. I upload csv and excel files via WCF Service. Hash calculation does only work for csv files. With xls files i get a different value with every upload.
Hash Calculation:
using (FileStream file = new FileStream(datei.FullName, FileMode.Open))
{
var sha1 = new SHA1CryptoServiceProvider();
byte[] retVal = sha1.ComputeHash(file);
var sb = new StringBuilder();
foreach (var b in retVal)
sb.Append(b.ToString("x2"));
return sb.ToString();
}
Does anybody know where the problem might be located? Is it a problem with the binary xls file format?
Any help is deeply appreciated.
Marius
I strongly suspect the file is actually different each time. That's easy enough to check though - there are various free tools around to perform checksums/hashes. You could pick SHA1 and compare it with your own results, or use an MD5 tool etc.
Try running it both client side and server side - that way you'll be able to verify that the file itself hasn't been corrupted in transit.
Once you've worked out exactly where and when the file has changed, you'll need to decide what to do about it. For example, if Excel is adding a timestamp, you may want to mask that out when computing the hash.
I need to develop WinForms app, which will be able to decrypt a media file (a movie) and then play it without saving decrypted file to the HDD (the decrypted file finally will be stored in the memory stream) The problem is, how then play that movie from the memory stream ? Is it possible ?
It is possible, but I expect you will need to write your own DirectShow filter to do so, which once created will act as a file reader (implementing the IFileSourceFilter interface), and, as the video plays, will read successive frames from the file, decrypt them, and pass them up to the next filter.
This will only work however if the file is encrypted in a sequential form (i.e. each individual frame is encrypted as a seperate entity). Otherwise, you will have to decrypt the entire file at once, which could be intensive, slow, and probably have to hit the hard drive to store the end file.
But anyway, this link should get you started: http://msdn.microsoft.com/en-us/library/dd375454%28VS.85%29.aspx
I'm afraid that in order to create the DirectShow filter, you will need to use C++, and it isn't the easiest API to get your head around.
An alternate way to do it may be to use the Windows Media Format SDK, which allows you to pass custom video packets to a renderer in real time. There is also a good interop library for C# (WindowsMediaLib)
First of all, it's a good idea to encrypt source video piece by piece. So the encrypted video file is a set of encrypted parts. Just split original file into parts of the same size and encrypt them.
Here the scheme (OutputStream is a stream of encrypted video file, InputStream is original file stream, ChunkSize is a size of each part in the original file, also we write some metadata: sizes of original and encrypted pieces):
using (BinaryWriter Writer = new BinaryWriter(OutputStream))
{
byte[] Buf = new byte[ChunkSize];
List<int> SourceChunkSizeList = new List<int>();
List<int> EncryptedChunkSizeList = new List<int>();
int ReadBytes;
while ((ReadBytes = InputStream.Read(Buf, 0, Buf.Length)) > 0)
{
byte[] EncryptedData = Encrypt(Buf, ReadBytes);
OutputStream.Write(EncryptedData, 0, EncryptedData.Length);
SourceChunkSizeList.Add(ReadBytes);
EncryptedChunkSizeList.Add(EncryptedData.Length);
}
foreach (int SourceChunkSize in SourceChunkSizeList)
Writer.Write(SourceChunkSize);
foreach (int EncryptedChunkSize in EncryptedChunkSizeList)
Writer.Write(EncryptedChunkSize);
}
Such metadata will help to find encrypted part rapidly.
Secondly, don't decrypt encrypted data in each read request. Cache it: video playing in the most case is just a sequential reading.
The tricky part is how to play encrypted video file. You may write either a DirectShow filter (video specific solution), or check a 3rd party product (multipurpose solution): BoxedApp, a virtualization SDK. What's cool is that they have an article that shows how to solve exact your task, look: http://boxedapp.com/encrypted_video_streaming.html