Cutting random bytes off of file byte array in C#

Cutting random bytes off of file byte array in C# - c#

So I've been working on this project for a while now, involving LSB steganography. Really fun stuff. Anyways, I just finished writing the code for embedding and extracting files from an image(instead of just plaintext), and I'm running into this problem. I can recognize the MIME and extension of the bytes, but because the embedded file doesn't usually take up all of the LSBs of the image, there's a lot of garbage data. So I have the extracted file + some garbage in the byte array right after it. I need to figure out how to cut these, so that the file that is being exported is the correct, smaller size.
TLDR: I have a byte array with a recognized file in it, with some additional random bytes. How do I find out where the file ends and the random bytes begin?
Remember this is all in C#.
Any advice is appreciated.
Link to my project for reference: https://github.com/nicosogangstar/Steg

Generally you have two options.
End of stream marker
This is the more direct approach of the two, but it may lack some versatily depending on what data you want to hide. After you embed your data, continue with embedding a unique sequence of bits/bytes such that you know it cannot be prematurely encountered in the data before. As you extract the bits, you can stop reading once you encounter this sequence. If you expect to hide only readable text, i.e. bytes with ascii codes between 32 and 127, your marker can be as short as eight 0s, or eight 1s. However, if you intend to hide any sort of binary data, where each byte has a chance of appearing, you may accidentally encounter the marker while extracting legitimate data and thus halt the process prematurely.
Header information
You can add a header preceding data, e.g, another 16-24 bits (or any other amount) which can be translated to a number that tells you how many bits/bytes/pixels to read before stopping. For example, if you want to hide a byte array of size 1000, first embed 2 bytes related to the length of the secret and then follow it with the actual data. More specifically, split the length in 2 bytes, where the first byte has the 8th to 15th bits and the second byte has the 0th to 7th bits of the number 1000 in binary.
00000011 11101000 1000 in binary
3 -24 byte values
You can embed all sorts of information in a header, such as whether the data is encrypted or compressed with some algorithm, the original filename of the date, how many LSBs to read for extracting the information, etc.

Related

ProtectedMemory - How To Ensure That The Size Of The Byte Array Is A Multiple Of 16?

I'm reading up on the ProtectedMemory class in C# (which uses the Data Protection API in Windows (DPAPI)) and I see that in order to use the Protect() Method of the class, the data to be encrypted must be stored in a byte array whose size/length is a multiple of 16.
I know how to convert many different data types to byte array form and back again, but how can I guarantee that the size of a byte array is a multiple of 16? Do I literally need to create an array whose size is a multiple of 16 and keep track of the original data's length using another variable or am I missing something? With traditional block-ciphers all of these details are handled for you automatically with padding settings. Likewise, when I attempt to convert data back to its original form from a byte array, how do I ensure that any additional bytes are ignored, assuming of course that the original data wasn't a multiple of 16.
In the code sample provided in the .NET Framework documentation, the byte array utilised just so happens to be 16 bytes long so I'm not sure what best practice is in relation to this hence the question.

Yes, just to iterate over the possibilities given in the comments (and give an answer to this nice question), you can use:
a padding method that is also used for block cipher modes, see all the options on the Wikipedia page on the subject.
prefix a length in some form or other. A fixed size of 32 bits / 4 bytes is probably easiest. Do write down the type of encoding for the size (unsigned, little endian is probably best for C#).
Both of these already operate on bytes, so you may need to define a character encoding such as UTF-8 if you use a string.
You could also use a specific encoding of the string, e.g. one defined by ASN.1 / DER and then perform zero padding. That way you can even indicate the type of the data that has been encoded in a platform independent way. You may want to read up on masochism before taking this route.

Adding 8BIM profile metadata to a tiff image file

I'm working on a program which requires 8BIM profile information to be present in the tiff file for the processing to continue.
The sample tiff file (which does not contain the 8BIM profile information) when opened and saved in Adobe Photoshop gets this metadata information.
I'm clueless as to how to approach this problem.
The target framework is .net 2.0.
Any information related to this would be helpful.

No idea why you need the 8BIM to be present in your TIFF file. I will just give some general information and structure about 8BIM.
8BIM is the signature for Photoshop Image Resource Block (IRB). This kind of information could be found in images such as TIFF, JPEG, Photoshop native image format etc. It could also be found in non-image documents such as in PDF.
The structure of the IRB is as follows:
Each IRB block starts with 4 bytes signature which translates to string "8BIM." After that, is a 2 bytes unique identifier denoting the kind of resource for this IRB. For example: 0x040c for thumbnail; 0x041a for slices; 0x0408 for grid information; 0x040f for ICC Profile etc.
After the identifier is a variable length string for name. The first byte of the string tells the length of the string (excluding the first length byte). After the first byte comes the string itself. There is a requirement that the length of the whole string (including the length byte) should be even. Otherwise, pad one more byte after the string.
The next 4 bytes specifies the size of the actual data for this resource block followed by the data with the specified length. The total length of the data also should be an even number. So if the size of the data is odd, pad another one byte. This finishes a whole 8BIM.
There could be more than one IRBs but they all conform to the same structure as described above. How to interpret the data depends on the unique identifier.
Now let's see how the IRBs are include in images. For a JPEG image, metadata could be present as one of the application (APPn) segment. Since different application could use the same APPn segment to store it's own metadata, there must be some kind of identifier to let the image reader know what kind of information is contained inside the APPn. Photoshop uses APP13 as it's IRB container and the APP13 contains "Photoshop 3.0" as it's identifier.
For TIFF image which is tag based and arranged in a directory structure. There is a private tag 0x8649 called "PHOTOSHOP" to insert IRB information.
Let's take a look at the TIFF image format (quoted from this source):
The basic structure of a TIFF file is as follows:
The first 8 bytes forms the header. The first two bytes of which is
either "II" for little endian byte ordering or "MM" for big endian
byte ordering. In what follows we'll be assuming big endian ordering.
Note: any true TIFF reading software is supposed to be handle both
types. The next two bytes of the header should be 0 and 42dec (2ahex).
The remaining 4 bytes of the header is the offset from the start of
the file to the first "Image File Directory" (IFD), this normally
follows the image data it applies to. In the example below there is
only one image and one IFD.
An IFD consists of two bytes indicating the number of entries followed
by the entries themselves. The IFD is terminated with 4 byte offset to
the next IFD or 0 if there are none. A TIFF file must contain at least
one IFD!
Each IFD entry consists of 12 bytes. The first two bytes identifies
the tag type (as in Tagged Image File Format). The next two bytes are
the field type (byte, ASCII, short int, long int, ...). The next four
bytes indicate the number of values. The last four bytes is either the
value itself or an offset to the values. Considering the first IFD
entry from the example gievn below:
0100 0003 0000 0001 0064 0000
| | | |
tag --+ | | |
short int -+ | |
one value ------+ |
value of 100 -------------+
In order to be able to read a TIFF IFD, two things must be done first:
A way to be able to read either big or little endian data
A random access input stream which wraps the image input so that we can jump forward and backward while reading the directory.
Now let's assume we have a structure for each and every 12 bytes IFD entry called Entry. We read the first two bytes (the endianess is not applied here since it's either MM or II) to determine the endianess. Now we can read the remaining IFD data and interpret them according to the the endianess we already know.
Right now we have a list of Entry. It's not so difficult to insert a new Entry into the list - in our case, it's a "Photoshop" Entry. The difficult part is how to write the data back to create a new TIFF. You can't just write the Entries back to the output stream directly which will break the
overall structure of the TIFF. Cautions must be taken to keep track of where you write the data and update the pointer of the data accordingly.
From the above description, we can see that it's not so easy to insert new Entries into TIFF format. JPEG format will make it much easier given the fact that each JPEG segment is self-contained.
I don't have related C# code but there is a Java library here which could manipulate metadata for JPEG and TIFF images like insert EXIF, IPTC, thumbnail etc as 8BIM. In your case, if file size is not a big issue, the above mentioned library can insert a small thumbnail into a Photoshop tag as one 8BIM.

Compression of small string

I have data 0f 340 bytes in string mostly consists of signs and numbers like "føàA¹º#ƒUë5§Ž§"
I want to compress into 250 or less bytes to save it on my RFID card.
As this data is related to finger print temp. I want lossless compression.
So is there any algorithm which i can implement in C# to compress it?

If the data is strictly numbers and signs, I highly recommend changing the numbers into int based values. eg:
+12939272-23923+927392
can be compress into 3 piece of 32-bit integers, which is 22 bytes => 16 bytes. Picking the right integer size (whether 32-bit, 24-bit, 16-bit) should help.
If the integer size varies greatly, you could possibly use 8-bit to begin and use the value 255 to specify that the next 8-bit becomes the 8 more significant bits of the integer, making it 15-bit.
alternatively, you could identify the most significant character and assign 0 for it. the second most significant character gets 10, and the third 110. This is a very crude compression, but if you data is very limited, this might just do the job for you.

Is there any other information you know about your string? For instance does it contain certain characters more often than others? Does it contain all 255 characters or just a subset of them?
If so, huffman encoding may help you, see this or this other link for implementations in C#.
To be honest it just depends on how your input string looks like. What I'd do is try the using rar, zip, 7zip (LZMA) with very small dictionary sizes (otherwise they'll just use up too much space for preprocessed information) and see how big the raw compressed file they produce is (will probably have to use their libraries in order to make them strip headers to conserve space). If any of them produce a file under 250b, then find the c# library for it and there you go.

How do I accomplish random reads of a UTF8 file

My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).
How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?
Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?

Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.

Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:
Establish the length of the file in bytes
Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first n bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
Pick a pseudo-random number in 1..<guessed number of characters in file>
If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last
A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:
Read Buffer A : bytes 1000-1999
Read Buffer B : bytes 2000-2999
If a character occupies bytes 1998-2001, using a single buffer would lose context.
Read Buffer A : bytes 3000-3999
Now in effect buffer A follows buffer B when we convert the byte stream into characters.
As noted by #jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.

For UTF-16, you always have to jump to an even byte position. Then you can check whether a trailing surrogate follows. If so, skip it, otherwise you are at the start of a well-formed UTF-16 code unit sequence (always assuming that the file is well-formed, of course).
The Unicode encodings UTF-8 and UTF-16 were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units.

Getting a string, int, etc in binary representation?

Is it possible to get strings, ints, etc in binary format? What I mean is that assume I have the string:
"Hello" and I want to store it in binary format, so assume "Hello" is
11110000110011001111111100000000 in binary (I know it not, I just typed something quickly).
Can I store the above binary not as a string, but in the actual format with the bits.
In addition to this, is it actually possible to store less than 8 bits. What I am getting at is if the letter A is the most frequent letter used in a text, can I use 1 bit to store it with regards to compression instead of building a binary tree.

Is it possible to get strings, ints,
etc in binary format?
Yes. There are several different methods for doing so. One common method is to make a MemoryStream out of an array of bytes, and then make a BinaryWriter on top of that memory stream, and then write ints, bools, chars, strings, whatever, to the BinaryWriter. That will fill the array with the bytes that represent the data you wrote. There are other ways to do this too.
Can I store the above binary not as a string, but in the actual format with the bits.
Sure, you can store an array of bytes.
is it actually possible to store less than 8 bits.
No. The smallest unit of storage in C# is a byte. However, there are classes that will let you treat an array of bytes as an array of bits. You should read about the BitArray class.

What encoding would you be assuming?

What you are looking for is something like Huffman coding, it's used to represent more common values with a shorter bit pattern.
How you store the bit codes is still limited to whole bytes. There is no data type that uses less than a byte. The way that you store variable width bit values is to pack them end to end in a byte array. That way you have a stream of bit values, but that also means that you can only read the stream from start to end, there is no random access to the values like you have with the byte values in a byte array.

What I am getting at is if the letter
A is the most frequent letter used in
a text, can I use 1 bit to store it
with regards to compression instead of
building a binary tree.
The algorithm you're describing is known as Huffman coding. To relate to your example, if 'A' appears frequently in the data, then the algorithm will represent 'A' as simply 1. If 'B' also appears frequently (but less frequently than A), the algorithm usually would represent 'B' as 01. Then, the rest of the characters would be 00xxxxx... etc.
In essence, the algorithm performs statistical analysis on the data and generates a code that will give you the most compression.

You can use things like:
Convert.ToBytes(1);
ASCII.GetBytes("text");
Unicode.GetBytes("text");
Once you have the bytes, you can do all the bit twiddling you want. You would need an algorithm of some sort before we can give you much more useful information.

The string is actually stored in binary format, as are all strings.
The difference between a string and another data type is that when your program displays the string, it retrieves the binary and shows the corresponding (ASCII) characters.
If you were to store data in a compressed format, you would need to assign more than 1 bit per character. How else would you identify which character is the mose frequent?
If 1 represents an 'A', what does 0 mean? all the other characters?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.