How do I accomplish random reads of a UTF8 file

How do I accomplish random reads of a UTF8 file - c#

My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).
How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?
Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?

Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.

Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:
Establish the length of the file in bytes
Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first n bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
Pick a pseudo-random number in 1..<guessed number of characters in file>
If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last
A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:
Read Buffer A : bytes 1000-1999
Read Buffer B : bytes 2000-2999
If a character occupies bytes 1998-2001, using a single buffer would lose context.
Read Buffer A : bytes 3000-3999
Now in effect buffer A follows buffer B when we convert the byte stream into characters.
As noted by #jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.

For UTF-16, you always have to jump to an even byte position. Then you can check whether a trailing surrogate follows. If so, skip it, otherwise you are at the start of a well-formed UTF-16 code unit sequence (always assuming that the file is well-formed, of course).
The Unicode encodings UTF-8 and UTF-16 were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units.

Related

Decompress .Z files (LZW Compression) in C#

I am looking to implement the Rosetta Code LZSW Decompression method in C# and I need some help. The original code is available here: http://rosettacode.org/wiki/LZW_compression#C.23
I am only focusing on the Decompress method as I "simply" (if only) want to decompress .Z-files in my C# program in .NET 6.
I want my version take a byte[] as input and return a byte[] (as I am reading .ReadAllBytes() from file and want to create a new file with the decompressed result).
My problem comes from the fact that in C#, chars are 16bit (2 bytes) and not 8bit (1byte). This really messes with my head as that consequently (in my mind) means that each char should be represented by two bytes. In the code at Rosetta Code, the intial dictionary created only contains integer keys of 0 -> 255 meaning up to 1 byte, not two. I am thinking if this is an error in their implementation? What do you think? And how would you go about converting this algorithm to a method with the signature: byte[] Decompress(byte[]) ?
Thanks

No, there is no error. No, you don't need to convert the algorithm to work on 16-bit values. The usual lossless compression libraries operate on sequences of bytes. Your string of characters would first need to be converted to a sequence of bytes, e.g. to UTF-8, e.g. byte[] bs = Encoding.UTF8.GetBytes(str);. UTF-8 would be the right choice, since that encoding gives the compressor the best shot at compressing. In fact, just encoding UTF-16 to UTF-8 will almost always compress the strings, which makes it a good starting point. (In fact, using UTF-16 as the standard for character strings in .NET is a terrible choice for exactly this reason, but I digress.)
Any data you compress would first be serialized to a sequence of bytes for these compressors, if it isn't bytes already, in a way that permits reversing the transformation after decompression on the other end.
Since you are decompressing, someone encoded the characters into a sequence of bytes, so you need to first find out what they did. It may just be a sequence of ASCII characters, which are already one byte per character. Then you would use System.Text.Encoding.ASCII.GetString(bs); to make a character string out of it.

When compressing data we usually talk about sequences of symbols. A symbol in this context might be a byte, a character, or something completely different.
Your example obviously uses characters as it's symbols, but there should not be any real problem just replacing this with bytes instead. The more difficult part will be its use of strings to represent sequences of characters. You will need an equivalent representation of byte sequences that provides functionality like:
Concatenation/appending
Equality
GetHashCode (for performance)
Immutability (i.e. appending a byte should produce a new sequence, not modify the existing sequence)
Note that there LZW implementations have to agree on some specific things to be compatible, so implementing the posted example may or may not not allow you to decode .Z files encoded with another implementation. If your goal is to decode actual files you may have better luck asking on software recommendations for a preexisting decompression library.

Cutting random bytes off of file byte array in C#

So I've been working on this project for a while now, involving LSB steganography. Really fun stuff. Anyways, I just finished writing the code for embedding and extracting files from an image(instead of just plaintext), and I'm running into this problem. I can recognize the MIME and extension of the bytes, but because the embedded file doesn't usually take up all of the LSBs of the image, there's a lot of garbage data. So I have the extracted file + some garbage in the byte array right after it. I need to figure out how to cut these, so that the file that is being exported is the correct, smaller size.
TLDR: I have a byte array with a recognized file in it, with some additional random bytes. How do I find out where the file ends and the random bytes begin?
Remember this is all in C#.
Any advice is appreciated.
Link to my project for reference: https://github.com/nicosogangstar/Steg

Generally you have two options.
End of stream marker
This is the more direct approach of the two, but it may lack some versatily depending on what data you want to hide. After you embed your data, continue with embedding a unique sequence of bits/bytes such that you know it cannot be prematurely encountered in the data before. As you extract the bits, you can stop reading once you encounter this sequence. If you expect to hide only readable text, i.e. bytes with ascii codes between 32 and 127, your marker can be as short as eight 0s, or eight 1s. However, if you intend to hide any sort of binary data, where each byte has a chance of appearing, you may accidentally encounter the marker while extracting legitimate data and thus halt the process prematurely.
Header information
You can add a header preceding data, e.g, another 16-24 bits (or any other amount) which can be translated to a number that tells you how many bits/bytes/pixels to read before stopping. For example, if you want to hide a byte array of size 1000, first embed 2 bytes related to the length of the secret and then follow it with the actual data. More specifically, split the length in 2 bytes, where the first byte has the 8th to 15th bits and the second byte has the 0th to 7th bits of the number 1000 in binary.
00000011 11101000 1000 in binary
3 -24 byte values
You can embed all sorts of information in a header, such as whether the data is encrypted or compressed with some algorithm, the original filename of the date, how many LSBs to read for extracting the information, etc.

How does c# decoder know the exact number of bytes it should use for one char?

For example, a stream has four bytes: D8 00 DC 05. How does the decoder(e.g. System.Text.Decoder) know it should treat them as one char \uD800\udc05 or two separate chars\uD800 and \udc05 please? Thanks.
Perhaps I didn't describe my question clearly. My original intention was about to understand how UTF8 decoder knows the exact number of bytes it should use for one char, as one UTF8 char can take one to four bytes and the way to handle this variable is the magic. UTF16 decoder has no this problem for surrogate pairs. The above example is not appropriate for my question.

Your question is really about UTF-16 and surrogate pairs.
The two code units U+D800 and U+DC05 always represent surrogate pairs. These two code units combine into one single code point, that is one character.
C# calls the code units char which can be a bit misleading since it sometimes takes two char values (a pair of surrogates) to create one "character", as you have noticed.
Any code unit (char) value between U+D800 and U+DBFF always represents the lower part of a surrogate pair, while any code unit between U+DC00 and U+DFFF is the corresponding upper part of the pair.
Code units outside this domain, i.e. either in U+0000 through U+D7FF or in U+E000 through U+FFFF stand for themselves, so in those ranges one UTF-16 code unit corresponds to one Unicode code point.
EDIT: The question was changed to ask about UTF-8 instead.
I will use the word octet for a word of exactly 8 bits (so an octet is what most people call a byte).
In UTF-8 you can see from the position of the first 0 bit within the octet where this octet belongs in a UTF-8 sequence.
0xxxxxxx: If the first bit is 0, this octet constitutes a 1-octet sequence (ASCII value)
10xxxxxx: If the octet starts on 10, this is a continuation octet, i.e. not initial in the sequence
110xxxxx: This is the initial octet in a 2-octet sequence
1110xxxx: This is the initial octet in a 3-octet sequence
11110xxx: This is the initial octet in a 4-octet sequence
Since modern UTF-8 does not allow 5-octet sequences, or longer, it is illegal for an octet to start with five ones, 11111xxx. But in early versions, the above scheme would be extended to allow 5-octet and 6-octet sequences (sometimes also longer).
When comparing UTF-16 and UTF-8, note that code points that require only a single 16-bit code unit in UTF-16, correspond exactly to code points that can be made with 1-, 2-, or 3-octet sequences in UTF-8. While code points that require a surrogate pair in UTF-16 (i.e. two UTF-16 code units) correspond exactly to those that require a 4-octet sequence in UTF-8.

There is .NET Framework source code , you can look.
Source code of System.Text.Decoder placed here. So you can find here everything you want to know about your question.

Implementing DbDataReader.GetChars() efficiently when underlying data is not UTF-16

I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.
The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.
I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".
Some 'requirements' that eliminate obvious implementations:
I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.
I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.
My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!
I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.
So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?

Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".
Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as 𩸽; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: 𩸽̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.
So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method
internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)
In StringInfo.cs.
A bit of nuisance, but doable.
For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.
Not really an answer but way too long for a comment.
Update
You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:
var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList(); // 95 found.
var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList(); // 95 names displayed.
Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.
This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:
Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.

I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.
Decoder is not cloneable, so I use serialization + deserialization to make the copy.

How to read a char that has ASCII value in range 128-130 and convert it to int value

I have an array of chars, some of them are ASCII 128 and 130 in decimal. I am trying to read them as normal chars, but instead of 128 I get 8218 as an int (casted to byte, got 26). I need to get that number between 128 and 130. I found some articles on Encodings, some people say I need to use Encoding 439.
Any ideas?

A char (System.Char) in the CLR environment is an unsigned 16-bit number, a UTF-16 code unit. From the Unicode Standard, Chapter 3, §3.9:
Code unit: The minimal bit combination that can represent a
unit of encoded text for processing or interchange.
Code units are particular units of computer storage. Other character encoding
standards typically use code units defined as 8-bit units—that is, octets. The
Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code
units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
A code unit is also referred to as a code value in the information industry.
In the Unicode Standard, specific values of some code units cannot be used to
represent an encoded character in isolation. This restriction applies to isolated
surrogate code units in UTF-16 and to the bytes 80–FF in UTF-8. Similar
restrictions apply for the implementations of other character encoding standards;
for example, the bytes 81–9F, E0–FC in SJIS (Shift-JIS) cannot represent an encoded
character by themselves.
Your "ASCII" text is no longer ASCII once it's in the CLR world. ASCII is a 7-bit encoding and the code points 0x00–0x7F are maintained across all Unicode encodings (UTF-8, -16, -24, -32) for the sake of compatability. In the non-Unicode world, 0x80–0xFF have always had multiple character mappings (and don't even look at EBCDIC vs ASCII). Some ASCII implementations provided for parity as well: the high order bit would be set to maintain the desired parity.
Even parity. The high order bit is set to maintain an even number of 'on' bits in the octet.
Odd parity. The high order bit is set to maintain an odd number of 'on' bits in the octet.
No parity. The high order bit is never set.
Presumably you're reading your "ASCII" text using a UTF-8 encoder/decoder (the CLR default). To get the numeric values you expect in your chars, you'll need to read the text using an encode/decoder suitable for the encoding your text is actually in (Windows 1252? something else?).
A better approach for you, perhaps, would be to read your text octet by octet as binary, using System.IO.FileStream, rather than System.IO.TextReader and its minions. Then you've got the raw octets and you can convert them to text as you wish, or do math on the raw octet values.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.