"Where are my bytes?" or Investigation of file length traits

"Where are my bytes?" or Investigation of file length traits - c#

This is a continuation of my question about downloading files in chunks. The explanation will be quite big, so I'll try to divide it to several parts.
1) What I tried to do?
I was creating a download manager for a Window-Phone application. First, I tried to solve the problem of downloading
large files (the explanation is in the previous question). No I want to add "resumable download" feature.
2) What I've already done.
At the current moment I have a well-working download manager, that allows to outflank the Windows Phone RAM limit.
The plot of this manager, is that it allows to download small chunks of file consequently, using HTTP Range header.
A fast explanation of how it works:
The file is downloaded in chunks of constant size. Let's call this size "delta". After the file chunk was downloaded,
it is saved to local storage (hard disk, on WP it's called Isolated Storage) in Append mode (so, the downloaded byte array is
always added to the end of the file). After downloading a single chunk the statement
if (mediaFileLength >= delta) // mediaFileLength is a length of downloaded chunk
is checked. If it's true, that
means, there's something left for download and this method is invoked recursively. Otherwise it means, that this chunk
was last, and there's nothing left to download.
3) What's the problem?
Until I used this logic at one-time downloads (By one-time I mean, when you start downloading file and wait until the download is finished)
that worked well. However, I decided, that I need "resume download" feature. So, the facts:
3.1) I know, that the file chunk size is a constant.
3.2) I know, when the file is completely downloaded or not. (that's a indirect result of my app logic,
won't weary you by explanation, just suppose, that this is a fact)
On the assumption of these two statements I can prove, that the number of downloaded chunks is equal to
(CurrentFileLength)/delta. Where CurrentFileLenght is a size of already downloaded file in bytes.
To resume downloading file I should simply set the required headers and invoke download method. That seems logic, isn't it? And I tried to implement it:
// Check file size
using (IsolatedStorageFileStream fileStream = isolatedStorageFile.OpenFile("SomewhereInTheIsolatedStorage", FileMode.Open, FileAccess.Read))
{
int currentFileSize = Convert.ToInt32(fileStream.Length);
int currentFileChunkIterator = currentFileSize / delta;
}
And what I see as a result? The downloaded file length is equal to 2432000 bytes (delta is 304160, Total file size is about 4,5 MB, we've downloaded only half of it). So the result is
approximately 7,995. (it's actually has long/int type, so it's 7 and should be 8 instead!) Why is this happening?
Simple math tells us, that the file length should be 2433280, so the given value is very close, but not equal.
Further investigations showed, that all values, given from the fileStream.Length are not accurate, but all are close.
Why is this happening? I don't know precisely, but perhaps, the .Length value is taken somewhere from file metadata.
Perhaps, such rounding is normal for this method. Perhaps, when the download was interrupted, the file wasn't saved totally...(no, that's real fantastic, it can't be)
So the problem is set - it's "How to determine number of the chunks downloaded". Question is how to solve it.
4) My thoughts about solving the problem.
My first thought was about using maths here. Set some epsilon-neiborhood and use it in currentFileChunkIterator = currentFileSize / delta; statement.
But that will demand us to remember about type I and type II errors (or false alarm and miss, if you don't like the statistics terms.) Perhaps, there's nothing left to download.
Also, I didn't checked, if the difference of the provided value and the true value is supposed to grow permanently
or there will be cyclical fluctuations. With the small sizes (about 4-5 MB) I've seen only growth, but that doesn't prove anything.
So, I'm asking for help here, as I don't like my solution.
5) What I would like to hear as answer:
What causes the difference between real value and received value?
Is there a way to receive a true value?
If not, is my solution good for this problem?
Are there other better solutions?
P.S. I won't set a Windows-Phone tag, because I'm not sure that this problem is OS-related. I used the Isolated Storage Tool
to check the size of downloaded file, and it showed me the same as the received value(I'm sorry about Russian language at screenshot):

I'm answering to your update:
This is my understanding so far: The length actually written to the file is more (rounded up to the next 1KiB) than you actually wrote to it. This causes your assumption of "file.Length == amount downloaded" to be wrong.
One solution would be to track this information separately. Create some meta-data structure (which can be persisted using the same storage mechanism) to accurately track which blocks have been downloaded, as well as the entire size of the file:
[DataContract] //< I forgot how serialization on the phone works, please forgive me if the tags differ
struct Metadata
{
[DataMember]
public int Length;
[DataMember]
public int NumBlocksDownloaded;
}
This would be enough to reconstruct which blocks have been downloaded and which have not, assuming that you keep downloading them in a consecutive fashion.
edit
Of course you would have to change your code from a simple append to moving the position of the stream to the correct block, before writing the data to the stream:
file.Position = currentBlock * delta;
file.Write(block, 0, block.Length);

Just as a possible bug. Dont forget to verify if the file was modified during requests. Specialy during long time between ones, that can occor on pause/resume.
The error could be big, like the file being modified to small size and your count getting "erronic", and the file being the same size but with modified contents, this will leave a corrupted file.

Have you heard an anecdote about a noob-programmer and 10 guru-programmers? Guru programmers were trying to find an error in his solution, and noob had already found it, but didn't tell about it, as it was something that stupid, we was afraid to be laughed at.
Why I remembered this? Because the situation is similar.
The explanation of my question was very heavy, and I decided not to mention some small aspects, that I was sure, worked correctly. (And they really worked correctly)
One of this small aspects, was the fact, that the downloaded file was encrypted via AES PKCS7 padding. Well, the decryption worked correctly, I knew it, so why should I mention it? And I didn't.
So, then I tried to find out, what exactly causes the error with the last chunk. The most credible version was about problems with buffering, and I tried to find, where am I leaving the missing bytes. I tested again and again, but I couldn't find them, as every chunk was saving without any losses. And one day I comprehended:
There is no spoon
There is no error.
What's the point of AES PKCS7? Well, the primary one is that it makes the decrypted file smaller. Not much, only at 16 bytes. And it was considered in my decryption method and download method, so there should be no problem, right?
But what happens, when the download process interrupts? The last chunk will save correctly, there will be no errors with buffering or other ones. And then we want to continue download. The number of the downloaded chunks will be equal to currentFileChunkIterator = currentFileSize / delta;
And here I should ask myself: "Why are you trying to do something THAT stupid?"
"Your downloaded one chunk size is not delta. Actually, it's less than delta". (the decryption makes chunk smaller to 16 bytes, remember?)
The delta itself consists of 10 equal parts, that are being decrypted. So we should divide not by delta, but by (delta - 16 * 10) which is (304160 - 160) = 304000.
I sense a rat here. Let's try to find out the number of the downloaded chunks:
2432000 / 304000 = 8. Wait... OH SHI~
So, that's the end of story.
The whole solution logic was right.
The only reason it failed, was my thought, that, for some reason, the downloaded decrypted file size should be the same as the sum of downloaded encrypted chunks.
And, of course, as I didn't mention about the decryption(it's mentioned only in previous question, which is only linked), none of you could give me a correct answer. I'm terribly sorry about that.

In continue to my comment..
The original file size as I understand from your description is 2432000 bytes.
The Chunk size is set to 304160 bytes (or 304160 per "delta").
So, the machine which send the file was able to fill 7 chunks and sent them.
The receiving machine now has 7 x 304160 bytes = 2129120 bytes.
The last chunk will not be filled to the end as there is not enough bytes left to fill to it.. so it will contain: 2432000 - 2129120 = 302880 which is less than 304160
If you add the numbers you will get 7x304160 + 1x302880 = 2432000 bytes
So according to that the original file transferred in full to the destination.
The problem is that you are calculating 8x304160 = 2433280 insisting that even the last chunk must be filled completely - but with what?? and why??
In humble.. are you locked in some kind of math confusion or did I misunderstand your problem?
Please answer, What is the original file size and what size is being received at the other end? (totals!)

Related

Why is my encoding showing twice?

byte[] lengthBytes = new byte[4];
serverStream.Read(lengthBytes, 0, 4);
MessageBox.Show("'>>" + System.Text.Encoding.UTF8.GetString(lengthBytes) + "<<'");
MessageBox.Show("Hello");
This is the code I used for debugging. I get 2 messageboxes now. If I used Debug.WriteLine it was also printed twice.
Msgbox 1: '>>/ (Note that this is still 4 characters long, the last 3 bytes are null.
Msgbox 2: '>>{"ac<<'
Msgbox 3: Hello
I'm trying to send 4 bytes with an integer, the length of the message. This is going fine ('/ ' is utf8 for 47). The problem is that the first 4 bytes of the message are also being read ('{"ac'). I totally dont know how this happens, I'm already debugging this for several hours and I just can't get my head around it. One of my friends suggested to make an account on StackOverflow so here I am :p
Thanks for all the help :)
EDIT: The real code for the people who asked
My code http://kutj.es/2ah-j9

You are making traditional programmer mistakes, everybody has to make them once to learn how to avoid it and do it right. This primarily went off the rails by writing debugging code that is buggy and made it lot harder to find your mistake:
Never write debugging code that uses MessageBox.Show(). It is a very, very evil function, it causes re-entrancy. And expensive word that means that it only freezes the user interface, it doesn't freeze your program. It continues to run, one of the things that can go wrong is that the code that you posted is executed again. Re-entered. You'll see two message boxes. And you'll have a completely corrupted program state because your code was never written to assume it could be re-entered. Which is why you complained that 4 bytes of data were swallowed.
The proper tool to use here is the feature that really freezes your program. A debugger breakpoint.
Never assume that binary data can be converted to text. Those 4 bytes you received contain binary zeros. There is no character for it. Worse, it acts as a string terminator to many operating system calls, the kind used by the debugger, Debug.WriteLine() etc. This is why you can't see the "<<"
The proper tool to use here is a debugger watch or tooltip, it lets you look into the array directly. If you absolutely have to generate a diagnostic string then use BitConverter.GetString().
Never assume that a stream's Read() method will always return the number of bytes you asked for. Using the return value in your code is a hard requirement. This is the real bug in your program, the only you are actually trying to fix.
The proper solution is to continue to call Read() until you counted down the number of bytes you expected to receive from the length you read earlier. You'll need a MemoryStream to store the chunks of byte[]s you get.

Perhaps this link regarding Encoding.GetString() will help you out a bit. The part to pay attention to being:
If the data to be converted is available only in sequential blocks
(such as data read from a stream) or if the amount of data is so large
that it needs to be divided into smaller blocks, you should use the
Decoder object returned by the GetDecoder method of a derived class.

The problem was that I started the getMessage void 2 times. This started the while 2 times (in different threads).
Elgonzo helped me finding the problem, he is a great guy :)

Process very large XML file

I need to process an XML file with the following structure:
<FolderSizes>
<Version></Version>
<DateTime Un=""></DateTime>
<Summary>
<TotalSize Bytes=""></TotalSize>
<TotalAllocated Bytes=""></TotalAllocated>
<TotalAvgFileSize Bytes=""></TotalAvgFileSize>
<TotalFolders Un=""></TotalFolders>
<TotalFiles Un=""></TotalFiles>
</Summary>
<DiskSpaceInfo>
<Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
UsedBytes=""><![CDATA[ ]]></Drive>
</DiskSpaceInfo>
<Folder ScanState="">
<FullPath Name=""><![CDATA[ ]]></FullPath>
<Attribs Int=""></Attribs>
<Size Bytes=""></Size>
<Allocated Bytes=""></Allocated>
<AvgFileSz Bytes=""></AvgFileSz>
<Folders Un=""></Folders>
<Files Un=""></Files>
<Depth Un=""></Depth>
<Created Un=""></Created>
<Accessed Un=""></Accessed>
<LastMod Un=""></LastMod>
<CreatedCalc Un=""></CreatedCalc>
<AccessedCalc Un=""></AccessedCalc>
<LastModCalc Un=""></LastModCalc>
<Perc><![CDATA[ ]]></Perc>
<Owner><![CDATA[ ]]></Owner>
<!-- Special element; see paragraph below -->
<Folder></Folder>
</Folder>
</FolderSizes>
The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; I reckon up to about 5 levels.
The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale.
What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file.
Here are the things I have tried:
Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object
Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. The application ended up using 24GB memory, and taking very, very long, so I just cancelled it.
Attempt to process the file using an XmlReader object
This approach worked better in that it didn't use as much memory, but I still had a few problems:
It was taking really long because I was reading the file one line at a time.
Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors
Did I mention it takes really long reading the file one line at a time; and that still without actually processing that line - literally just reading it.
Import the information using SQL Server
I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. I read up about it and I don't think it's an adjustable limit.
Here are more things I think I should try:
Re-write my C# application to use unmanaged code
I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible.
I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. Using normal managed code didn't work - the result was about 2 frames per second. Re-writing the colour inversion method to use unmanaged code solved the problem. That's why I thought that unmanaged might be a solution.
Rather go for C++ in stead of C#
Not sure if this is really a solution. Would it necessarily be better that C#? Better than unmanaged C#?
The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet.
I thought I'd ask for some advice before I go any further, possibly wasting my time.
Thanks in advance for you time and assistance.
EDIT
So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; I made a screenshot of the calculation:
That's about 1500 lines per second; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. But this is only stepping each line. It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.
This is the method that runs during the size calculation:
private int _totalLines = 0;
private bool _cancel = false; // set to true when the cancel button is clicked
private void CalculateFileSize()
{
xmlStream = new StreamReader(_filePath);
xmlReader = new XmlTextReader(xmlStream);
while (xmlReader.Read())
{
if (_cancel)
return;
if (xmlReader.LineNumber > _totalLines)
_totalLines = xmlReader.LineNumber;
InterThreadHelper.ChangeText(
lblLinesRemaining,
string.Format("{0} lines", _totalLines));
string elapsed = string.Format(
"{0}:{1}:{2}:{3}",
timer.Elapsed.Days.ToString().PadLeft(2, '0'),
timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));
InterThreadHelper.ChangeText(lblElapsed, elapsed);
if (_cancel)
return;
}
xmlStream.Dispose();
}
Still runnig, 27 minutes in :(

you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. see the code sample at the end of this article
also, your question has already been asked here

Differences in length in TagLib# (C#) and TagLib (C++)

I am currently in the process of moving my C# application over to Qt / C++. I'm running into problems with lengths from TagLib. I find it odd that TagLib# returns audio durations in milliseconds, while TagLib returns its (incorrect) durations in seconds. TagLib just returns zero for the length values, while TagLib# remains correct.
Here is my source in C# / TagLib#...
TagLib.File tagfile = TagLib.File.Create(path);
uint milliseconds = (uint)tagfile.Properties.Duration.TotalMilliseconds;
And here is what should be nearly equivalent in C++ / TagLib. I've even forced it to read accurately. No success.
TagLib::FileName fn(path);
TagLib::FileRef fr(fn, true, TagLib::AudioProperties::Accurate);
uint length = fr.audioProperties()->length();
It works as expected for a good majority of my media files. However, a select few audio files fail to return any audio properties (the rest of the tag information reads fine!). The exact same audio properties are returned with no issues on TagLib#.
Any ideas are appreciated. Thanks.
Does anyone have any more ideas before the bounty ends?

Hi there is a patch to taglib that calculate the length in milliseconds, this guy added a method (lengthMilliseconds()) that return the length in milliseconds, maybe that could be useful for you:
http://web.archiveorange.com/archive/v/sF3Pjr01lSQjsqjrAC7L

A lot has changed in TagLib#'s parsing of audio files since it was originally ported, so its hard to say where exactly the difference would occur. You may check your C++ program for debug messages.
My guess is that the difference is in how the two libraries react to invalid headers. It appears that if the first frame header it finds is invalid, TagLib won't calculate any audio property values. TagLib#, on the other hand, looks for the first valid header in the first 16KiB of the audio part of the file. If the first header it encounters is corrupt, it will scan for the next one. If I remember correctly, an incorrectly saved ID3v2 tag could result in 0xFF FF FF FF appearing in the beginning of the audio section of the file. This would trigger the type of failure described above.
The problem is at line 166 of taglib/mpeg/mpegproperties.cpp. This could be solved using the same approach as lines 171 to 191, but you would want to update the code to give up after a point in case it really isn't an MP3 file.

As of this writing, TagLib 1.11 BETA 2 natively supports getting the length of audio in milliseconds. You can do so with the following code:
TagLib::FileRef f(path);
int lengthInMillis = f.audioProperties()->lengthInMilliseconds();

How can I make a fixed hex editor?

So. Let's say I were to make a hex editor to edit... oh... let's say a .DLL file. How can I edit a .DLL file's hex by using C# or C++? And for the "fixed part", I want to make it so that I can browse from the program for a specific .DLL, have some pre-coded buttons on the programmed file, and when the button is pressed, it will automatically execute the requested action, meaning the button has been pre-coded to know what to look for in the .DLL and what to change it to. Can anyone help me get started on this?
Also, preferably C#. Thank you!

The basics are very simple.
A DLL, or any file, is a stream of bytes.
Basic file operations allow you to read and write arbitrary portions of a file. The term of art is basically "Random Access Files Operations".
In C, the fundamental operations are read(), write(), and lseek().
read allows you to read a stream of bytes in to a buffer, write allows you to write a buffers of bytes to a file, lseek allows you to position anywhere you want in the file.
Example:
int fd = open("test.dat", O_RDWR);
off_t offset = lseek(fd, 200, SEEK_SET);
if (off_t == -1) {
printf("Boom!\n");
exit(1);
}
char buf[1024];
ssize_t bytes_read = read(fd, buf, 1024);
offset = lseek(fd, 100, SEEK_SET);
ssize_t bytes_written = write(fd, buf, 1024);
flush(fd);
close(fd);
This reads 1024 bytes from a file, starting at the 200th byte of the file, then writes it back to the file at 100 bytes.
Once you can change random bytes in a file, it's a matter of choosing what bytes to change, how to change them, and doing the appropriate reads/lseeks/writes to make the changes.
Note, those are the most primitive I/O operations, there are likely much better ones you can use depending on your language etc. But they're all based on those primitives.
Interpreting the bytes of a file, displaying them, etc. That's an exercise for the reader. But those basic I/O capabilities give you the fundamentals of changing files.

If the idea is to load a hex edit box you can use the following: Be.HexEditor
Editing a file's "hex" is nothing more than changing bytes in it. The part of having pre-programmed changes is going to be that more general type. But for actually viewing, finding and then having the option of changing anything you want, Be.HexEditor is a good option. I used it over a year ago, I would hope that it has some new features that will make your life easier.

How can I determine the length of an mp3 file's header?

I am writing a program to diff, and copy entire files or segments based on changes on either end (Rsync-esque... but more like Unison). The main idea is to keep my music folder (all mp3s) up to date over multiple locations.
I'd like to send segmented updates if only small portions of the file have changed, as opposed to copying the entire file. For this, I need a way to diff segments of the file.
I initially tried generating hashes for blocks of every file (Every n bytes I'd hash the segment). I noticed that when I changed one attribute (id3v2 tag on an mp3) all the hashed blocks would change. This makes sense, as I would guess the header is growing as it acquired new information.
This leads me to my actual question. I would like to know how to determine the length of an mp3's header, so I could create 2 comparable hashes.
1) The meta info of the file (header)
2) The actual mpeg stream with audio (This hash should remain unchanged if all I do is alter tag info)
Am I missing anything else?
Thanks!
Ty

If all you want to check the length of is id3v2 tags, then you can find out information about its structure at http://www.id3.org/id3v2.4.0-structure.
If you read the first 3 bytes, and they are equal to "ID3", then skip to the 7th byte, then read the header size. Be careful though, because the size is stored as a "synchsafe integer".

If you want to determine the header information, you'll either:
a) need to use a mp3 library that can do the parsing for you, or
b) go to the mp3 specification and parse it out as needed.

I wound up using TagLibSharp. developer.novell.com/wiki/index.php/TagLib_Sharp

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.