I need to fit as much information as I can into a small file size. In this case, the data is in a comma separated format and all values are stored as 2dp decimals (no titles).
I've had a look and my understanding is that all the characters I need are stored using ASCII (1 byte per character) in my standard .txt file that I am currently using. Apparently ASCII has 256 possible values, which is way more than I need - I could get by with only 16 characters.
Could I save my data in some kind of 4bit text file? I will be creating the file using c# (all google searches result in advice on making a text file, not how to make a smaller "font" text). Would doing this save any space in the end anyway?
I could zip up anything before I send it, but any advice on ideas to get the filesize down would be greatly appreciated.
[the file] it will be read by a piece of c# code
You are therefore controlling the serialization format. You can pick any format you like.
A quick way to save space and reuse your existing code is to compress the CSV. Gzip is built-in but it is rather weak. You can use a 7-Zip library. The 7-Zip algorithm is state of the art. If will get rid of the redundancies caused by decimal points and by mostly using the characters 0-9. It will not remove 100% of that but 99%(?).
You can make this even more efficient by using a better format. You can use BinaryReader/Writer to easily write something entirely custom.
Protocol Buffers is a bit easier and also extremely compact.
I think that the question is legitimate, but the answer is that you impose logical conditions that leave no place for any solution.
So if you could avoid CSV structure for your custom structure you could save something, but you need it and it pretty much determines your solution. The only variable left is how do you encode the text, but you can't encode the text in less than 8 bits, you can just use higher values like Unicode (16 bits).
I won't comment on using compression as you already mentioned that you are looking for alternative answers and you are aware of that.
Related
I want to be able to write a huge amount of metadata to a jpeg but .NET is fighting me. I'm to the point where I wonder if it would be easier to just modify the bytes myself. There's no image.Metadata.Comment = "My comment";, I can't find any projects that do it for you (See this answer), Microsoft's documentation is confusing, another StackOverflow post led to this article which when you get to the end you find out it doesn't show you how to actually write metadata, and this code by John P works but if you try to add a lot of characters you get the error System.IO.FileFormatException: Commit unsuccessful because too much metadata changed..
So pretty much nothing is working at all. I want to add a comment, of any length, to my jpeg. So if the jpeg itself is 1.3MB I want to be able to add a comment so long that the jpeg becomes 10MB.
You don't say what type of metadata you're trying to write. But from your question it sounds as though you're writing large strings into the JPEG comment section.
JPEG files are basically a list of segments. These segments have a type identifier (a single byte) and a length (two bytes). This means the longest segment can only be 65535 bytes in length.
You can store comments in their own segment, the so-called COM segment.
If your comment is longer than 65535 bytes, then you can store multiple COM segments in the file. The reader is supposed to concatenate these together into the final comment.
Some discussion here.
As for how to do this in C#, I'm not aware of any library that supports this. I wrote and maintain MetadataExtractor for .NET and Java, but as the name suggests it's all about reading, not writing, metadata.
However the container format for JPEG is not too complicated. It shouldn't be too complicated to write your own code that injects COM segments into the file and copies all other segments in verbatim.
In .NET, I need a way to compare two files. I thought of a class, which represents a diff:
public enum DiffEntryState
{
New,
Removed,
Changed
}
public class DiffEntry
{
public byte[] Bytes;
public long FileOffset;
public DiffEntryState State = BackupByteEntryState.Changed;
}
The names should be pretty self-explanatory. I thought of adding a State to each entry, so that I can distinguish between the cases were the first file is larger than the second or vice versa.
I'm wondering, if there is a common and fast way to retrieve the byte-by-byte differences of two files. I would simply create a stream for each file and compare chunks of these streams until one ends. Is there a better way, or does the Framework have a built-in solution? Keep in mind that I need the differences itself, not only the feedback that there ARE differences.
//Edit:
After sleeping a night over the problem, I guess I'm taking the wrong approach here. The whole tool is a backup solution, which will be able to save only the changed bytes and thus reduce the overall necessary space for the backup. Instead of saving a compressed 14 MB file each time, only 200k or less will be saved.
But, after thinking about the problem, I realized that it wouldn't be enough to save only the differences byte-by-byte. Take a Text for example:
"This is a string."
"This was a string."
As a matter of fact, the only change here is "is" to "was". But my approach would assume that the changed content is now "was a string". If this happens at the beginning of a huge file, well, this approach is useless.
Obviously, I need a way to index a file and detect all moved, copied or changed blocks in comparison to the original file.
Phew...
Take a look at Diff.NET,could be helpful .
For general case binary differencing, look at A Linear Time, Constant Space Differencing Algorithm by Randal C. Burns and Darrell D. E. Long. Also, Randal Burns' master's thesis, Differential Compression: A Generalized Solution For Binary Files, goes into more detail and provides pseudo-code for the algorithm.
You might also get some useful ideas from About Remote Differential Compression and from Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression
For text file differencing, I recommend starting with An O(ND) Difference Algorithm and Its Variations by Eugene W. Myers. This algorithm can be used to diff any two sequences. To compare two text files, generate sequences of hash codes (e.g., by calling string.GetHashCode()) for each line in each file. Then run those sequences (e.g., IList) through Myers' algorithm to find the shortest edit script (i.e., inserts and deletes) that will convert the first sequence into the second.
I hope this helps. I'm the author of Diff.Net, and it uses Burns' algorithm for binary differencing and Myers' algorithm for text differencing. The source code for Diff.Net's libraries (Menees.Diffs and Menees.Diffs.Controls) are available under the Apache License, Version 2.0, and the references above should help you implement your own solution without having to start from scratch.
There is no built-in functionality.
So you have to compare the files byte by byte or use a library that does this for you.
I'm working on a solution and one of the features is to check that some files have not been tampered in other words hacked. I was planning on using the MD5 sum with a mixture of created and modified dates, but wanted to see if anybody has done something like this before. I'm using C# at the moment but you could suggest any other language. I just want to hear the technique part of it or architecture.
We have an application that checks file validity for safety reasons. The CRC32 checksums are stored in a separate file using a simple dictionary lookup. Which of CRC32, MD5, or any other hashing/checksumming feature is purely choice: you simply need to know if the file has changed (at least that's what you've said). As each byte of the file is included in the calculation, any changes will be picked up, including simple swapping of bytes.
Don't use file dates: too unreliable and can be easily changed.
This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?
I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.
I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.
What is the best way to search a large binary file for a certain substring in C#?
To provide some specifics, I'm trying to extract the DWARF information from an executable, so I only care about certain parts of the binary file (namely the sections starting with the strings .debug_info, .debug_abbrev, etc.)
I don't see anything obvious in Stream, FileStream, or BinaryReader, so it looks like I'll have to read chunks in and search through the data for the strings myself.
Is there a better way?
There's nothing built into .NET that will do the search for you, so you're going to need to read in the file chunk by chunk and scan for what you want to find.
You can speed up the search in two ways.
Firstly, use bufferred IO and transfer large chunks at a time - don't read byte by byte, read 64KB, 256KB or 1MB chunks.
Secondly, don't do a linear scan for the piece you want - check out the Boyer-Moore (wikipedia link) algorithm for string searches - you can apply this to searching for the DWARF information you want.
I think you'll have to do it yourself, BinaryReader was not designed for searching for text in a binary file. However, you should be mindful of the text encoding you use when searching.
There must be a DWARF C library you could compile and use interop with? I did some searching and found this. If a library from there could be compiled into a DLL on Windows (I assume you're using Windows), then you could use System.Runtime.InteropServices to interact with the DLL and extract your information from there.
Perhaps?