I am working on a data-matrix encoding technique,i studied ECC-200 encoding methods, in ECC-200 method for 8x8 data-matrix they are using 8bit data,in that 8bit 3bit used for data and remaining 5bit for error correction.but i want to use full 8bit for data.So how to do this by avoiding error correction code-words?
According to the specification, DataMatrix is 8x8 + 2 for the alignment and finder patterns. So it's usually called a 10x10. It is not possible to change number of bits available for data vs error correction. They are fixed and not customizable like QRCode can do.
A 10x10 symbol size has 3 codewords of data and 5 codewords of error; which gives 62.5% of error correction. You cannot change this.
Be careful to not be confused by bits vs codewords. In a 10x10 (8x8 data region), 3 codewords of data correspond to 6 numeric value, or 3 alphanumeric, or 1 byte.
You can find more information here: http://www.barcodephp.com/en/2d/datamatrix/technical
You can see how many type of characters you can include in each size.
Related
I'm reading up on the ProtectedMemory class in C# (which uses the Data Protection API in Windows (DPAPI)) and I see that in order to use the Protect() Method of the class, the data to be encrypted must be stored in a byte array whose size/length is a multiple of 16.
I know how to convert many different data types to byte array form and back again, but how can I guarantee that the size of a byte array is a multiple of 16? Do I literally need to create an array whose size is a multiple of 16 and keep track of the original data's length using another variable or am I missing something? With traditional block-ciphers all of these details are handled for you automatically with padding settings. Likewise, when I attempt to convert data back to its original form from a byte array, how do I ensure that any additional bytes are ignored, assuming of course that the original data wasn't a multiple of 16.
In the code sample provided in the .NET Framework documentation, the byte array utilised just so happens to be 16 bytes long so I'm not sure what best practice is in relation to this hence the question.
Yes, just to iterate over the possibilities given in the comments (and give an answer to this nice question), you can use:
a padding method that is also used for block cipher modes, see all the options on the Wikipedia page on the subject.
prefix a length in some form or other. A fixed size of 32 bits / 4 bytes is probably easiest. Do write down the type of encoding for the size (unsigned, little endian is probably best for C#).
Both of these already operate on bytes, so you may need to define a character encoding such as UTF-8 if you use a string.
You could also use a specific encoding of the string, e.g. one defined by ASN.1 / DER and then perform zero padding. That way you can even indicate the type of the data that has been encoded in a platform independent way. You may want to read up on masochism before taking this route.
So I've been working on this project for a while now, involving LSB steganography. Really fun stuff. Anyways, I just finished writing the code for embedding and extracting files from an image(instead of just plaintext), and I'm running into this problem. I can recognize the MIME and extension of the bytes, but because the embedded file doesn't usually take up all of the LSBs of the image, there's a lot of garbage data. So I have the extracted file + some garbage in the byte array right after it. I need to figure out how to cut these, so that the file that is being exported is the correct, smaller size.
TLDR: I have a byte array with a recognized file in it, with some additional random bytes. How do I find out where the file ends and the random bytes begin?
Remember this is all in C#.
Any advice is appreciated.
Link to my project for reference: https://github.com/nicosogangstar/Steg
Generally you have two options.
End of stream marker
This is the more direct approach of the two, but it may lack some versatily depending on what data you want to hide. After you embed your data, continue with embedding a unique sequence of bits/bytes such that you know it cannot be prematurely encountered in the data before. As you extract the bits, you can stop reading once you encounter this sequence. If you expect to hide only readable text, i.e. bytes with ascii codes between 32 and 127, your marker can be as short as eight 0s, or eight 1s. However, if you intend to hide any sort of binary data, where each byte has a chance of appearing, you may accidentally encounter the marker while extracting legitimate data and thus halt the process prematurely.
Header information
You can add a header preceding data, e.g, another 16-24 bits (or any other amount) which can be translated to a number that tells you how many bits/bytes/pixels to read before stopping. For example, if you want to hide a byte array of size 1000, first embed 2 bytes related to the length of the secret and then follow it with the actual data. More specifically, split the length in 2 bytes, where the first byte has the 8th to 15th bits and the second byte has the 0th to 7th bits of the number 1000 in binary.
00000011 11101000 1000 in binary
3 -24 byte values
You can embed all sorts of information in a header, such as whether the data is encrypted or compressed with some algorithm, the original filename of the date, how many LSBs to read for extracting the information, etc.
I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.
The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.
I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".
Some 'requirements' that eliminate obvious implementations:
I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.
I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.
My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!
I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.
So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?
Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".
Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as 𩸽; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: 𩸽̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.
So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method
internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)
In StringInfo.cs.
A bit of nuisance, but doable.
For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.
Not really an answer but way too long for a comment.
Update
You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:
var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList(); // 95 found.
var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList(); // 95 names displayed.
Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.
This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:
Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.
I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.
Decoder is not cloneable, so I use serialization + deserialization to make the copy.
I'm working on a program which requires 8BIM profile information to be present in the tiff file for the processing to continue.
The sample tiff file (which does not contain the 8BIM profile information) when opened and saved in Adobe Photoshop gets this metadata information.
I'm clueless as to how to approach this problem.
The target framework is .net 2.0.
Any information related to this would be helpful.
No idea why you need the 8BIM to be present in your TIFF file. I will just give some general information and structure about 8BIM.
8BIM is the signature for Photoshop Image Resource Block (IRB). This kind of information could be found in images such as TIFF, JPEG, Photoshop native image format etc. It could also be found in non-image documents such as in PDF.
The structure of the IRB is as follows:
Each IRB block starts with 4 bytes signature which translates to string "8BIM." After that, is a 2 bytes unique identifier denoting the kind of resource for this IRB. For example: 0x040c for thumbnail; 0x041a for slices; 0x0408 for grid information; 0x040f for ICC Profile etc.
After the identifier is a variable length string for name. The first byte of the string tells the length of the string (excluding the first length byte). After the first byte comes the string itself. There is a requirement that the length of the whole string (including the length byte) should be even. Otherwise, pad one more byte after the string.
The next 4 bytes specifies the size of the actual data for this resource block followed by the data with the specified length. The total length of the data also should be an even number. So if the size of the data is odd, pad another one byte. This finishes a whole 8BIM.
There could be more than one IRBs but they all conform to the same structure as described above. How to interpret the data depends on the unique identifier.
Now let's see how the IRBs are include in images. For a JPEG image, metadata could be present as one of the application (APPn) segment. Since different application could use the same APPn segment to store it's own metadata, there must be some kind of identifier to let the image reader know what kind of information is contained inside the APPn. Photoshop uses APP13 as it's IRB container and the APP13 contains "Photoshop 3.0" as it's identifier.
For TIFF image which is tag based and arranged in a directory structure. There is a private tag 0x8649 called "PHOTOSHOP" to insert IRB information.
Let's take a look at the TIFF image format (quoted from this source):
The basic structure of a TIFF file is as follows:
The first 8 bytes forms the header. The first two bytes of which is
either "II" for little endian byte ordering or "MM" for big endian
byte ordering. In what follows we'll be assuming big endian ordering.
Note: any true TIFF reading software is supposed to be handle both
types. The next two bytes of the header should be 0 and 42dec (2ahex).
The remaining 4 bytes of the header is the offset from the start of
the file to the first "Image File Directory" (IFD), this normally
follows the image data it applies to. In the example below there is
only one image and one IFD.
An IFD consists of two bytes indicating the number of entries followed
by the entries themselves. The IFD is terminated with 4 byte offset to
the next IFD or 0 if there are none. A TIFF file must contain at least
one IFD!
Each IFD entry consists of 12 bytes. The first two bytes identifies
the tag type (as in Tagged Image File Format). The next two bytes are
the field type (byte, ASCII, short int, long int, ...). The next four
bytes indicate the number of values. The last four bytes is either the
value itself or an offset to the values. Considering the first IFD
entry from the example gievn below:
0100 0003 0000 0001 0064 0000
| | | |
tag --+ | | |
short int -+ | |
one value ------+ |
value of 100 -------------+
In order to be able to read a TIFF IFD, two things must be done first:
A way to be able to read either big or little endian data
A random access input stream which wraps the image input so that we can jump forward and backward while reading the directory.
Now let's assume we have a structure for each and every 12 bytes IFD entry called Entry. We read the first two bytes (the endianess is not applied here since it's either MM or II) to determine the endianess. Now we can read the remaining IFD data and interpret them according to the the endianess we already know.
Right now we have a list of Entry. It's not so difficult to insert a new Entry into the list - in our case, it's a "Photoshop" Entry. The difficult part is how to write the data back to create a new TIFF. You can't just write the Entries back to the output stream directly which will break the
overall structure of the TIFF. Cautions must be taken to keep track of where you write the data and update the pointer of the data accordingly.
From the above description, we can see that it's not so easy to insert new Entries into TIFF format. JPEG format will make it much easier given the fact that each JPEG segment is self-contained.
I don't have related C# code but there is a Java library here which could manipulate metadata for JPEG and TIFF images like insert EXIF, IPTC, thumbnail etc as 8BIM. In your case, if file size is not a big issue, the above mentioned library can insert a small thumbnail into a Photoshop tag as one 8BIM.
I have data 0f 340 bytes in string mostly consists of signs and numbers like "føàA¹º#ƒUë5§Ž§"
I want to compress into 250 or less bytes to save it on my RFID card.
As this data is related to finger print temp. I want lossless compression.
So is there any algorithm which i can implement in C# to compress it?
If the data is strictly numbers and signs, I highly recommend changing the numbers into int based values. eg:
+12939272-23923+927392
can be compress into 3 piece of 32-bit integers, which is 22 bytes => 16 bytes. Picking the right integer size (whether 32-bit, 24-bit, 16-bit) should help.
If the integer size varies greatly, you could possibly use 8-bit to begin and use the value 255 to specify that the next 8-bit becomes the 8 more significant bits of the integer, making it 15-bit.
alternatively, you could identify the most significant character and assign 0 for it. the second most significant character gets 10, and the third 110. This is a very crude compression, but if you data is very limited, this might just do the job for you.
Is there any other information you know about your string? For instance does it contain certain characters more often than others? Does it contain all 255 characters or just a subset of them?
If so, huffman encoding may help you, see this or this other link for implementations in C#.
To be honest it just depends on how your input string looks like. What I'd do is try the using rar, zip, 7zip (LZMA) with very small dictionary sizes (otherwise they'll just use up too much space for preprocessed information) and see how big the raw compressed file they produce is (will probably have to use their libraries in order to make them strip headers to conserve space). If any of them produce a file under 250b, then find the c# library for it and there you go.