Uploading a File from OneDrive to Azure - issues with file content

Uploading a File from OneDrive to Azure - issues with file content - c#

We writing an application to move content from an OneDrive account into Azure Storage. We've managed to get this working but ran into memory issues working with big files (> 1GB) and Block Blobs. We've decided that Append Blobs are the best way going forward as that will solve the memory issues.
We're using a RPC call to SharePoint to get the file stream for big files, more info can be found here:
http://sharepointfieldnotes.blogspot.co.za/2009/09/downloading-content-from-sharepoint-let.html
Using the following code is working fine when writing the file from OneDrive to local storage
using (var strOut = System.IO.File.Create("path"))
using (var sr = wReq.GetResponse().GetResponseStream())
{
byte[] buffer = new byte[16 * 1024];
int read;
bool isHtmlRemoved = false;
while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
{
if (!isHtmlRemoved)
{
string result = Encoding.UTF8.GetString(buffer);
int startPos = result.IndexOf("</html>");
if (startPos > -1)
{
//get the length of the text, '</html>' as well
startPos += 8;
strOut.Write(buffer, startPos, read - startPos);
isHtmlRemoved = true;
}
}
else
{
strOut.Write(buffer, 0, read);
}
}
}
This creates the file with the correct size, but when we try to write it to an append blob in Azure Storage, we are not getting the complete file and in other cases getting bigger files.
using (var sr = wReq.GetResponse().GetResponseStream())
{
byte[] buffer = new byte[16 * 1024];
int read;
bool isHtmlRemoved = false;
while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
{
if (!isHtmlRemoved)
{
string result = Encoding.UTF8.GetString(buffer);
int startPos = result.IndexOf("</html>");
if (startPos > -1)
{
//get the length of the text, '</html>' as well
startPos += 8;
//strOut.Write(buffer, startPos, read - startPos);
appendBlob.UploadFromByteArray(buffer, startPos, read - startPos);
isHtmlRemoved = true;
}
}
else
{
//strOut.Write(buffer, 0, read);
appendBlob.AppendFromByteArray(buffer, 0, read);
}
}
}
Is this the correct way of doing it? Why would we be getting different file sizes?
Any suggestions will be appreciated
Thanks

In response to "Why would we be getting different file sizes?":
From the CloudAppendBlob.appendFromByteArray documentation
"This API should be used strictly in a single writer scenario
because the API internally uses the append-offset conditional header
to avoid duplicate blocks which does not work in a multiple writer
scenario." If you are indeed using a single writer, you need to
explicitly set the value of
BlobRequestOptions.AbsorbConditionalErrorsOnRetry to true.
You can also check if you are exceeding the 50,000 committed block
limit. Your block sizes are relatively small, so this is a
possibility with sufficiently large files (> 16KB * 50,000 = .82
GB).
In response to "Is this the correct way of doing it?":
If you feel you need to use Append Blobs, try using the CloudAppendBlob.OpenWrite method to achieve functionality more similar to your code example for local storage.
Block Blobs seem like they might be a more appropriate fit for your scenario. Can you please post the code you were using to upload Block Blobs? You should be able to upload to Block Blobs without running out of memory. You can upload different blocks in parallel to achieve faster throughput. Using Append Blobs to append (relatively) small blocks will result in degradation of sequential read performance, as currently append blocks are not defragmented.
Please let me know if any of these solutions work for you!

Related

Improve speed of splitting file

I am using this code to extract a chunk from file
// info is FileInfo object pointing to file
var percentSplit = info.Length * 50 / 100; // extract 50% of file
var bytes = new byte[percentSplit];
var fileStream = File.OpenRead(fileName);
fileStream.Read(bytes, 0, bytes.Length);
fileStream.Dispose();
File.WriteAllBytes(splitName, bytes);
Is there any way to speed up this process?
Currently for a 530 MB file it takes around 4 - 5 seconds. Can this time be improved?

There are several cases of you question, but none of them is language relevant.
Following are something to concern
What is the file system of source/destination file?
Do you want to keep original source file?
Are they lie on the same drive?
In c#, you almost do not have a method could be faster than File.Copy which invokes CopyFile of WINAPI internally. Because of the percentage is fifty, however, following code might not be faster. It copies whole file and then set the length of the destination file
var info=new FileInfo(fileName);
var percentSplit=info.Length*50/100; // extract 50% of file
File.Copy(info.FullName, splitName);
using(var outStream=File.OpenWrite(splitName))
outStream.SetLength(percentSplit);
Further, if
you don't keep original source after file splitted
destination drive is the same as source
your are not using a crypto/compression enabled file system
then, the best thing you can do, is don't copy files at all.
For example, if your source file lies on FAT or FAT32 file system, what you can do is
create new dir entry(entries) for newly splitted parts of file
let the entry(entries) point(s) to the cluster of target part(s)
set correct file size for each entry
check for cross-link and avoid that
If your file system was NTFS, you might need to spend a long time to study the spec.
Good luck!

var percentSplit = (int)(info.Length * 50 / 100); // extract 50% of file
var buffer = new byte[8192];
using (Stream input = File.OpenRead(info.FullName))
using (Stream output = File.OpenWrite(splitName))
{
int bytesRead = 1;
while (percentSplit > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(percentSplit, buffer.Length));
output.Write(buffer, 0, bytesRead);
percentSplit -= bytesRead;
}
output.Flush();
}
The flush may not be needed but it doesn't hurt, this was quite interesting, changing the loop to a do-while rather than a while had a big hit on performance. I suppose the IL is not as fast. My pc was running the original code in 4-6 secs, the attached code seemed to be running at about 1 second.

I get better results when reading/writing by chunks of a few megabytes. The performances changes also depending on the size of the chunk.
FileInfo info = new FileInfo(#"C:\source.bin");
FileStream f = File.OpenRead(info.FullName);
BinaryReader br = new BinaryReader(f);
FileStream t = File.OpenWrite(#"C:\split.bin");
BinaryWriter bw = new BinaryWriter(t);
long count = 0;
long split = info.Length * 50 / 100;
long chunk = 8000000;
DateTime start = DateTime.Now;
while (count < split)
{
if (count + chunk > split)
{
chunk = split - count;
}
bw.Write(br.ReadBytes((int)chunk));
count += chunk;
}
Console.WriteLine(DateTime.Now - start);

C# TCP file transfer - Images semi-transferred

I am developing a TCP file transfer client-server program. At the moment I am able to send text files and other file formats perfectly fine, such as .zip with all contents intact on the server end. However, when I transfer a .gif the end result is a gif with same size as the original but with only part of the image showing as if most of the bytes were lost or not written correctly on the server end.
The client sends a 1KB header packet with the name and size of the file to the server. The server then responds with OK if ready and then creates a fileBuffer as large as the file to be sent is.
Here is some code to demonstrate my problem:
// Serverside method snippet dealing with data being sent
while (true)
{
// Spin the data in
if (streams[0].DataAvailable)
{
streams[0].Read(fileBuffer, 0, fileBuffer.Length);
break;
}
}
// Finished receiving file, write from buffer to created file
FileStream fs = File.Open(LOCAL_FOLDER + fileName, FileMode.CreateNew, FileAccess.Write);
fs.Write(fileBuffer, 0, fileBuffer.Length);
fs.Close();
Print("File successfully received.");
// Clientside method snippet dealing with a file send
while(true)
{
con.Read(ackBuffer, 0, ackBuffer.Length);
// Wait for OK response to start sending
if (Encoding.ASCII.GetString(ackBuffer) == "OK")
{
// Convert file to bytes
FileStream fs = new FileStream(inPath, FileMode.Open, FileAccess.Read);
fileBuffer = new byte[fs.Length];
fs.Read(fileBuffer, 0, (int)fs.Length);
fs.Close();
con.Write(fileBuffer, 0, fileBuffer.Length);
con.Flush();
break;
}
}
I've tried a binary writer instead of just using the filestream with the same result.
Am I incorrect in believing successful file transfer to be as simple as conversion to bytes, transportation and then conversion back to filename/type?
All help/advice much appreciated.

Its not about your image .. It's about your code.
if your image bytes were lost or not written correctly that's mean your file transfer code is wrong and even the .zip file or any other file would be received .. It's gonna be correpted.
It's a huge mistake to set the byte buffer length to the file size. imagine that you're going to send a large a file about 1GB .. then it's gonna take 1GB of RAM .. for an Idle transfering you should loop over the file to send.
This's a way to send/receive files nicely with no size limitation.
Send File
using (FileStream fs = new FileStream(srcPath, FileMode.Open, FileAccess.Read))
{
long fileSize = fs.Length;
long sum = 0; //sum here is the total of sent bytes.
int count = 0;
data = new byte[1024]; //8Kb buffer .. you might use a smaller size also.
while (sum < fileSize)
{
count = fs.Read(data, 0, data.Length);
network.Write(data, 0, count);
sum += count;
}
network.Flush();
}
Receive File
long fileSize = // your file size that you are going to receive it.
using (FileStream fs = new FileStream(destPath, FileMode.Create, FileAccess.Write))
{
int count = 0;
long sum = 0; //sum here is the total of received bytes.
data = new byte[1024 * 8]; //8Kb buffer .. you might use a smaller size also.
while (sum < fileSize)
{
if (network.DataAvailable)
{
{
count = network.Read(data, 0, data.Length);
fs.Write(data, 0, count);
sum += count;
}
}
}
}
happy coding :)

When you write over TCP, the data can arrive in a number of packets. I think your early tests happened to fit into one packet, but this gif file is arriving in 2 or more. So when you call Read, you'll only get what's arrived so far - you'll need to check repeatedly until you've got as many bytes as the header told you to expect.
I found Beej's guide to network programming a big help when doing some work with TCP.

As others have pointed out, the data doesn't necessarily all arrive at once, and your code is overwriting the beginning of the buffer each time through the loop. The more robust way to write your reading loop is to read as many bytes as are available and increment a counter to keep track of how many bytes have been read so far so that you know where to put them in the buffer. Something like this works well:
int totalBytesRead = 0;
int bytesRead;
do
{
bytesRead = streams[0].Read(fileBuffer, totalBytesRead, fileBuffer.Length - totalBytesRead);
totalBytesRead += bytesRead;
} while (bytesRead != 0);
Stream.Read will return 0 when there's no data left to read.
Doing things this way will perform better than reading a byte at a time. It also gives you a way to ensure that you read the proper number of bytes. If totalBytesRead is not equal to the number of bytes you expected when the loop is finished, then something bad happened.

Thanks for your input Tvanfosson. I tinkered around with my code and managed to get it working. The synchronicity between my client and server was off. I took your advice though and replaced read with reading a byte one at a time.

How would I go about reading bittorrent pieces?

I'm currently developing a torrent metainfo management library for Ruby.
I'm having trouble reading the pieces from the files. I just don't understand how I'm supposed to go about it. I know I'm supposed to SHA1 digest piece length bytes of a file once (or read piece length bytes multiple times, or what?)
I'm counting on your help.
Pseudo / Python / Ruby / PHP code preferred.
Thanks in advance.

C#
// Open the file
using (var file = File.Open(...))
{
// Move to the relevant place in the file where the piece begins
file.Seek(piece * pieceLength, SeekOrigin.Begin);
// Attempt to read up to pieceLength bytes from the file into a buffer
byte[] buffer = new byte[pieceLength];
int totalRead = 0;
while (totalRead < pieceLength)
{
var read = stream.Read(buffer, totalRead, pieceLength-totalRead);
if (read == 0)
{
// the piece is smaller than the pieceLength,
// because it’s the last in the file
Array.Resize(ref buffer, totalRead);
break;
}
totalRead += read;
}
// If you want the raw data for the piece:
return buffer;
// If you want the SHA1 hashsum:
return SHA1.Create().ComputeHash(buffer);
}

Please taker a look at this distribution here:
http://prdownload.berlios.de/torrentparse/TorrentParse.GTK.0.21.zip
Written in PHP, it contains an Encoder and Decoder and the in's and out I believe!

How to write super-fast file-streaming code in C#?

I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:
private void copy(string srcFile, string dstFile, int offset, int length)
{
BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
reader.BaseStream.Seek(offset, SeekOrigin.Begin);
byte[] buffer = reader.ReadBytes(length);
BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
writer.Write(buffer);
}
Considering that I have to call this function about 100,000 times, it is remarkably slow.
Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:
public static void CopySection(Stream input, string targetFile, int length)
{
byte[] buffer = new byte[8192];
using (Stream output = File.OpenWrite(targetFile))
{
int bytesRead = 1;
// This will finish silently if we couldn't read "length" bytes.
// An alternative would be to throw an exception
while (length > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
output.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:
public static void CopySection(Stream input, string targetFile,
int length, byte[] buffer)
{
using (Stream output = File.OpenWrite(targetFile))
{
int bytesRead = 1;
// This will finish silently if we couldn't read "length" bytes.
// An alternative would be to throw an exception
while (length > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
output.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
Note that this also closes the output stream (due to the using statement) which your original code didn't.
The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.
I think it'll be significantly faster, but obviously you'll need to try it to see...
This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:
http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.
(edit) something like:
private static void copy(string srcFile, string dstFile, int offset,
int length, byte[] buffer)
{
using(Stream inStream = File.OpenRead(srcFile))
using (Stream outStream = File.OpenWrite(dstFile))
{
inStream.Seek(offset, SeekOrigin.Begin);
int bufferLength = buffer.Length, bytesRead;
while (length > bufferLength &&
(bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
{
outStream.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
while (length > 0 &&
(bytesRead = inStream.Read(buffer, 0, length)) > 0)
{
outStream.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.
If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:
offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000
can be grouped to one read:
offset = 1234, length = 1074
Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.
static void Main(string[] args)
{
Dispatcher dp = new Dispatcher();
DispatcherQueue dq = new DispatcherQueue("DQ", dp);
Port<long> offsetPort = new Port<long>();
Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
new Handler<long>(Split)));
FileStream fs = File.Open(file_path, FileMode.Open);
long size = fs.Length;
fs.Dispose();
for (long i = 0; i < size; i += split_size)
{
offsetPort.Post(i);
}
}
private static void Split(long offset)
{
FileStream reader = new FileStream(file_path, FileMode.Open,
FileAccess.Read);
reader.Seek(offset, SeekOrigin.Begin);
long toRead = 0;
if (offset + split_size <= reader.Length)
toRead = split_size;
else
toRead = reader.Length - offset;
byte[] buff = new byte[toRead];
reader.Read(buff, 0, (int)toRead);
reader.Dispose();
File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
}
This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?
Over 100,000 accesses (sum the times):
How much time is spent allocating the buffer array?
How much time is spent opening the file for read (is it the same file every time?)
How much time is spent in read and write operations?
If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.
Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.
If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

(For future reference.)
Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).
Memory Mapped files are supported in managed code in .NET 4.0.
But as noted, you need to profile, and expect to switch to native code for maximum performance.

How to avoid C# Azure API from running out of memory for large blob uploads?

I'm trying to uploading very large (>100GB) blobs to Azure using Microsoft.Azure.Storage.Blob (9.4.2). However, it appears that even when using the stream-based blob write API, the library will allocate memory proportional to the size of the file (a 1.2GB test file results in a 2GB process memory footprint). I need this to work in constant memory. My code is below (similar results using UploadFromFile, UploadFromStream, etc.):
var container = new CloudBlobContainer(new Uri(sasToken));
var blob = container.GetBlockBlobReference("test");
const int bufferSize = 64 * 1024 * 1024; // 64MB
blob.StreamWriteSizeInBytes = bufferSize;
using (var writeStream = blob.OpenWrite())
{
using (var readStream = new FileStream(archiveFilePath, FileMode.Open))
{
var buffer = new byte[bufferSize];
var bytesRead = 0;
while ((bytesRead = readStream.Read(buffer, 0, bufferSize)) != 0)
{
writeStream.Write(buffer, 0, bytesRead);
}
}
}
This behavior is pretty baffling - I can see in TaskMgr that the upload indeed starts right away, so it's not like it's buffering things up waiting to send; there is no reason why it needs to hang on to previously sent data. How does anyone use this API for non-trivial blob uploads?

I suggest you take a look at the BlobStorageMultipartStreamProvider sample, as it shows how a request stream can "forwarded" to an Azure Blob stream and this might reduce the amount of memory used at the server side while uploading.
Hope it helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Uploading a File from OneDrive to Azure - issues with file content - c#

Related

Improve speed of splitting file

C# TCP file transfer - Images semi-transferred

How would I go about reading bittorrent pieces?

How to write super-fast file-streaming code in C#?

How to avoid C# Azure API from running out of memory for large blob uploads?

Categories

Resources