Implementing DbDataReader.GetChars() efficiently when underlying data is not UTF-16 - c#

I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.
The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.
I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".
Some 'requirements' that eliminate obvious implementations:
I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.
I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.
My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!
I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.
So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?

Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".
Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as 𩸽; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: 𩸽̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.
So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method
internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)
In StringInfo.cs.
A bit of nuisance, but doable.
For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.
Not really an answer but way too long for a comment.
Update
You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:
var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList(); // 95 found.
var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList(); // 95 names displayed.
Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.
This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:
Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.

I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.
Decoder is not cloneable, so I use serialization + deserialization to make the copy.

Related

Decompress .Z files (LZW Compression) in C#

I am looking to implement the Rosetta Code LZSW Decompression method in C# and I need some help. The original code is available here: http://rosettacode.org/wiki/LZW_compression#C.23
I am only focusing on the Decompress method as I "simply" (if only) want to decompress .Z-files in my C# program in .NET 6.
I want my version take a byte[] as input and return a byte[] (as I am reading .ReadAllBytes() from file and want to create a new file with the decompressed result).
My problem comes from the fact that in C#, chars are 16bit (2 bytes) and not 8bit (1byte). This really messes with my head as that consequently (in my mind) means that each char should be represented by two bytes. In the code at Rosetta Code, the intial dictionary created only contains integer keys of 0 -> 255 meaning up to 1 byte, not two. I am thinking if this is an error in their implementation? What do you think? And how would you go about converting this algorithm to a method with the signature: byte[] Decompress(byte[]) ?
Thanks
No, there is no error. No, you don't need to convert the algorithm to work on 16-bit values. The usual lossless compression libraries operate on sequences of bytes. Your string of characters would first need to be converted to a sequence of bytes, e.g. to UTF-8, e.g. byte[] bs = Encoding.UTF8.GetBytes(str);. UTF-8 would be the right choice, since that encoding gives the compressor the best shot at compressing. In fact, just encoding UTF-16 to UTF-8 will almost always compress the strings, which makes it a good starting point. (In fact, using UTF-16 as the standard for character strings in .NET is a terrible choice for exactly this reason, but I digress.)
Any data you compress would first be serialized to a sequence of bytes for these compressors, if it isn't bytes already, in a way that permits reversing the transformation after decompression on the other end.
Since you are decompressing, someone encoded the characters into a sequence of bytes, so you need to first find out what they did. It may just be a sequence of ASCII characters, which are already one byte per character. Then you would use System.Text.Encoding.ASCII.GetString(bs); to make a character string out of it.
When compressing data we usually talk about sequences of symbols. A symbol in this context might be a byte, a character, or something completely different.
Your example obviously uses characters as it's symbols, but there should not be any real problem just replacing this with bytes instead. The more difficult part will be its use of strings to represent sequences of characters. You will need an equivalent representation of byte sequences that provides functionality like:
Concatenation/appending
Equality
GetHashCode (for performance)
Immutability (i.e. appending a byte should produce a new sequence, not modify the existing sequence)
Note that there LZW implementations have to agree on some specific things to be compatible, so implementing the posted example may or may not not allow you to decode .Z files encoded with another implementation. If your goal is to decode actual files you may have better luck asking on software recommendations for a preexisting decompression library.

How can I hold a list of string as efficiently (memory) as possible?

I have huge a list of string. I want to hold these list as memory efficient. I tried to hold on a list. But, it uses 24 bytes for each string which has 5 characters. Namely, there should be some overhead areas.
Then, I tried to hold on a string array. The memory usage has been a bit efficient. But, I have still memory usage problem.
How can I hold a list of string? I know that "C# reserves 2 bytes for each character". I want to hold a string which has 5 characters as 5*2 = 10 bytes. But, why does it use 24 bytes for this process?
Thank you for helps.
enter image description here
Firstly, note that the difference between a List<string> that was created at the correct size, and a string[] (of the same size) is inconsequential for any non-trivial size; a List<T> is really just a fancy wrapper for T[] with insert/resize/etc capabilities. If you only need to hold the data: T[] is fine, but so is List<T> usually.
As for the string - it isn't C# that reserves anything - it is .NET that defines that a string is an object, which is internally a length (int) plus memory for char data, 2 bytes per char. But: objects in .NET have object headers, padding/alignment, etc - and importantly: a minimum size. So yes, they take more memory than just the raw data you're trying to represent.
If you only need the actual data, you could perhaps store the data not as string, but as raw memory - either a simple large byte[] or byte*, or as a twinned pair of int[]/int* (for lengths and/or offsets into the page) and a char[]/char* (for the actual character data), or a byte[]/byte* if you can work with encoded data (i.e. you're mainly interested in IO work). However, working with such a form will be hugely inconvenient - virtually no common APIs will want to play with you unless you are talking in string. There are some APIs that accept raw byte/char data, but they are largely the encoder/decoder APIs, and some IO APIs. So again: unless that's what you're doing: it won't end well. Very recently, some Span<char> / Span<byte> APIs have appeared which would make this slightly less inconvenient (if you can use the latest .NET Core builds, etc), but: I strongly suspect that in most common cases you're just going to have to accept the string overhead and live with it.
Minimum size of any object in 64-bit .NET is 24 bytes.
In 32-bit it's a bit smaller but there's always at least 8 bytes for the object header and here we'd expect the string to store it's length (4 bytes). 8 + 4 + 10 = 22. I'm guessing it also wants/needs all objects to be 4-byte aligned. So if you're storing them as objects, you're not going to get a much smaller representation.
If it's all 7-bit ASCII type characters, you could store them as arrays of bytes but each array would still take up some space.
Your best route (I appreciate this bit is more comment like) is to come up with different processing algorithms that don't require them to all be in memory at the same time in the first place.

Hash function to obtain a limited length result

I need to hash a number (about 22 digits) and the result length must be less than 12 characters. It can be a number or a mix of characters, and must be unique. (The number entered will be unique too).
For example, if the number entered is 000000000000000000001, the result should be something like 2s5As5A62s.
I looked at the typicals, like MD5, SHA-1, etc., but they give high length results.
The problem with your question is that the input is larger than the output and unique. If you're expecting a unique output as well, it won't happen. The reason behind this that if you have an input space of say 22 numeric digits (10^22 possibilities) and an output space of hexadecimal digits with a length of 11 digits (16^11 possibilities), you end up with more input possibilities than output possibilities.
The graph below shows that you would need a an output space of 19 hexadecimal digits and a perfect one-to-one function, otherwise you will have collisions pretty often (more than 50% of the time). I assume this is something you do not want, but you did not specify.
Since what you want cannot be done, I would suggest rethinking your design or using a checksum such as the cyclic redundancy check (CRC). CRC-64 will produce a 64 bit output and when encoded with any base64 algorithm, will give you something along the lines of what you want. This does not provide cryptographic strength like SHA-1, so it should never be used in anything related to information security.
However, if you were able to change your criteria to allow for long hash outputs, then I would strongly suggest you look at SHA-512, as it will provide high quality outputs with an extremely low chance of duplication. By a low chance I mean that no two inputs have yet been found to equal the same hash in the history of the algorithm.
If both of these suggestions still are not great for you, then your last alternative is probably just going with only base64 on the input data. It will essentially utilize the standard English alphabet in the best way possible to represent your data, thus reducing the number of characters as much as possible while retaining a complete representation of the input data. This is not a hash function, but simply a method for encoding binary data.
Why not taking MD5 or SHA-N then refactor to BASE64 (or base-whatever) and take only 12 characters of them ?
NB: In all case the hash will NEVER be unique (but can offer low collision probability)
You can't use a hash if it has to be unique.
You need about 74 bits to store such a number. If you convert it to base-64 it will be about 12 characters.
Can you elaborate on what your requirement is for the hashing? Do you need to make sure the result is diverse? (i.e. not 1 = a, 2 = b)
Just thinking out loud, and a little bit laterally, but could you not apply principles of run-length encoding on your number, treating it as data you want to compress. You could then use the base64 version of your compressed version.

How do I determine why Enyim memcache is returning false when storing an item?

How can I determine WHY Enyim returned false from the following call:
cache.Store(Enyim.Caching.Memcached.StoreMode.Set, key, value);
Other items are getting stored fine, so it doesn't seem to be an issue with a connection to the server. The object does not appear to be greater than 1 MB.
So how can I determine what is causing the false?
One other thing to check is that the whole object graph you're storing is [Serializable]. If it isn't then Enyim will throw a serialization exception, which will tell you which type needs to be marked as serializable. Follow the instructions at https://github.com/enyim/EnyimMemcached/wiki/Configure-Logging to enable logging.
One possibility is that your key may include illegal characters. Typically the very low-end of the ASCII characters can cause this--I believe 0x30 and above are certainly safe, and possibly 0x20 and higher as well. Refencing an ASCII character chart you can see 0x00 through 0x1F are largely special characters. 0x20 thru 0x2F are "normal" characters, but in some reference material I've seen mention that they may be used as control characters as well.
This issue caused me some problems; I've solved it by building a highly-unique key, with little regard for length, then generating an MD5 checksum of the key. The MD5 sum guarantees a minimal risk of key-collision, safe characters, and a shorter length than the actual key.
Memcached limits the size of objects by default to under 1MB. Check the configuration on your memcached sever. The limit is configurable, but not recommended to be changed as it will affect overall performance of the server itself.
We totally wrapped the Enyim client to make static methods that did the right connection pooling. We also did two things in our wrapper code:
1) Check that the key is <= 250 characters and contains valid characters
2) Check that the length is < 1mb. We check the length on strings and on byte[].
We also requested an enhancement request.
It is: http://www.couchbase.org/issues/browse/NCBC-10

Binary encoding for low bandwidth connections?

In my application I have a simple XML formatted file containing structured data. Each data entry has a data type and a value. Something like
<entry>
<field type="integer">5265</field>
<field type="float">34.23</field>
<field type="string">Jorge</field>
</entry>
Now, this formatting allow us to have the data in a human readable form in order to check for various values, as well as performing transformation and reading of the file easily for interoperability.
The problem is we have a very low bandwidth connection (about 1000 bps, yeah, thats bits per second) so XML is no exactly the best format to transmit the data. I'm looking for ways to encode the xml file into a binary equivalent that its more suitable for transmission.
Do you know of any good tutorial on the matter?
Additionally we compress the data before sending (simple GZIP) so I'm a little concerned with losing compression ratio if I go binary. Would the size be affected (when compressing) so badly that it would be a bad idea to try to optimize it in the first place?
Note: This is not premature optimization, it's a requisite. 1000 bps is a really low bandwidth so every byte counts.
Note2: Application is written in c# but any tutorial will do.
Try using ASN.1. The packed encoding rules should yield a pretty decently compressed form on their own and and the xml encoding rules should yield something equivalent to your existing xml.
Also, consider using 7zip instead of gzip.
You may want to investigate Google Protocol Buffers. They produce far smaller payloads than XML, though not necessarily the smallest payloads possible; whether they're acceptable for your use depends on a lot of factors. They're certainly easier than devising your own scheme from scratch, though.
They've been ported to C#/.NET and seem to work quite well there in my (thus far, somewhat limited) experience. There's a package at that link to integrate somewhat with VS and automatically create C# classes from the .proto files, which is very nice.
I'd dump (for transmission anyway, you could deconstruct at the sender, and reconstruct at the receiver, in Java you could use a custom Input/OutputStream to do the work neatly) the XML. Go binary with fixed fields - data type, length, data.
Say if you have 8 or fewer datatypes, encode that in three bits. Then the length, e.g., as an 8-bit value (0..255).
Then for each datatype, encode differently.
Integer/Float: BCD - 4 bits per digit, use 15 as decimal point. Or just the raw bits themselves (might want different datatypes for 8-bit int, 16-bit int, 32-bit int, 64-bit long, 32-bit float, 64-bit double).
String - can you get away with 7-bit ASCII instead of 8? Etc. All upper-case letters + digits and some punctuation could get you down to 6-bits per character.
You might want to prefix it all with the total number of fields to transmit. And perform a CRC or 8/10 encoding if the transport is lossy, but hopefully that's already handled by the system.
However don't underestimate how well XML text can be compressed. I would certainly do some calculations to check how much compression is being achieved.
Anything which is efficient at converting the plaintext form to binary is likely to make the compression ratio much worse, yes.
However, it could well be that an XML-optimised binary format will be better than the compressed text anyway. Have a look at the various XML Binary formats listed on the Wikipedia page. I have a bit of experience with WBXML, but that's all.
As JeeBee says, a custom binary format is likely to be the most efficient approach, to be honest. You can try to the gzip it, but the results of that will depend on what the data is like in the first place.
And yes, as Skirwan says, Protocol Buffers are a fairly obvious candidate here - but you may want to think about custom floating point representations, depending on what your actual requirements are. If you only need 4SF (and you know the scale) then sending a two byte integer may well be the best bet.
The first thing to try is gzip; beyond that, I would try protobuf-net - I can think of a few ways of encoding that quite easily, but it depends how you are building the xml, and whether you mind a bit of code to shim between the two formats. In particular, I can imagine representing the different data-types as either 3 optional fields on the same type, or 3 different subclasses of an abstract contract.
[ProtoContract]
class EntryItem {
[ProtoMember(1)]
public int? Int32Value {get;set;}
[ProtoMember(2)]
public float? SingleValue {get;set;}
[ProtoMember(3)]
public string StringValue {get;set;}
}
[ProtoContract]
class Entry {
[ProtoMember(1)]
public List<EntryItem> Items {get; set;}
}
With test:
[TestFixture]
public class TestEntries {
[Test]
public void ShowSize() {
Entry e = new Entry {
Items = new List<EntryItem>{
new EntryItem { Int32Value = 5265},
new EntryItem { SingleValue = 34.23F },
new EntryItem { StringValue = "Jorge" }
}
};
var ms = new MemoryStream();
Serializer.Serialize(ms, e);
Console.WriteLine(ms.Length);
Console.WriteLine(BitConverter.ToString(ms.ToArray()));
}
}
Results (21 bytes)
0A-03-08-91-29-0A-05-15-85-EB-08-42-0A-07-1A-05-4A-6F-72-67-65
I would look into configuring your app to be responsive to smaller XML fragments; in particular ones which are small enough to fit in a single network packet.
Then arrange your data to be transmitted in order of importance to the user so that they can see useful stuff and maybe even start working on it before all the data arrives.
Late response -- at least it comes before year's end ;-)
You mentioned Fast Infoset. Did you try it? It should give you the best results in terms of both compactness and performance. Add GZIP compression and the final size will be really small, and you will have avoided the processing penalties of compressing XML. WCF-Xtensions offers a Fast Infoset message encoding and GZIP/DEFLATE/LZMA/PPMs compression too (works on .NET/CF/SL/Azure).
Here's the pickle you're in though: You're compresing things with Gzip. Gzip is horrible on plain text until you hit about the length of the total concatonated works of Dickens or about 1200 lines of code. The overhead of the dictionary and other things Gzip uses for compression.
1Kbps is fine for the task of 7500 chars (it'll take about a minute given optimal conditions, but for <300 chars, you should be fine!) However if you're really that concerned, you're going to want to compress this down for brevity. Here's how I do things of this scale:
T[ype]L[ength][data data data]+
That is, that T represents the TYPE. say 0x01 for INT, 0x02 for STRING, etc. LENGTH is just an int... so 0xFF = 254 chars long, etc. An example datapacket would look like:
0x01 0x01 0x3F 0x01 0x01 0x2D 0x02 0x06 H E L L O 0x00
This says I have an INT, length 1, of value 0x3F, an INT, length 1, of value 0x2D then a STRING, length 6 of a null terminated "HELLO" (Ascii assumed). Learn the wonders that are System.Text.Encoding.Utf8.getBytes and BitConverter and ByteConverter.
for reference see This page to see just how much 1Kbps is. Really, for the size you're dealing with you should bee fine.

Categories

Resources