why does IsSingleByte Encoding's GetByteCount do calculation

why does IsSingleByte Encoding's GetByteCount do calculation - c#

I’ve inspected AsciiEncoding's GetByteCount method. It does long calculations rather then returning String.Length. It doesn’t completely make any sense to me. Do you have an idea why?

EDIT: I've just tried reproducing this, and I can't currently force an ASCIIEncoding instead to have a different replacement. Instead, I'd have to use Encoding.GetEncoding to get a mutable one. So for ASCIIEncoding, I agree... but for other implementations where IsSingleByte returns true, you'd still have the potential problem below.
Consider trying to get the byte count of a string which doesn't just contain ASCII characters. The encoding has to take the EncoderFallback into account... which could do any number of things, including increasing the count by an indeterminate amount.
It could be optimized for the case where the encoder fallback is a "default" one which just replaces non-ASCII characters with "?" though.
Further edit: I've just tried to confuse this with a surrogate pair, hoping that it would be represented by a single question mark. Unfortunately not:
string text = "x\ud800\udc00y";
Console.WriteLine(text.Length); // Prints 4
Console.WriteLine(Encoding.ASCII.GetByteCount(text)); // Still prints 4!

Interestingly, the mono runtime doesn't seem to include that behaviour:
// Get the number of bytes needed to encode a character buffer.
public override int GetByteCount (char[] chars, int index, int count)
{
if (chars == null) {
throw new ArgumentNullException ("chars");
}
if (index < 0 || index > chars.Length) {
throw new ArgumentOutOfRangeException ("index", _("ArgRange_Array"));
}
if (count < 0 || count > (chars.Length - index)) {
throw new ArgumentOutOfRangeException ("count", _("ArgRange_Array"));
}
return count;
}
// Convenience wrappers for "GetByteCount".
public override int GetByteCount (String chars)
{
if (chars == null) {
throw new ArgumentNullException ("chars");
}
return chars.Length;
}
and further down
[CLSCompliantAttribute(false)]
[ComVisible (false)]
public unsafe override int GetByteCount (char *chars, int count)
{
return count;
}

For a multibyte character encoding like UTF8, this method makes sense, because characters are stored in with 1 - 6 bytes. I imagine, that method also applies for a fixed size encoding like ASCII, where every character is stored with 7 bits. In actual implementation however, "aaaaaaaa" would be 8 bytes, as characters in ASCII are stored in 1 byte (8 bits), so lenght hack would work in best case scenario.
Previous versions of .NET Framework allowed spoofing by ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back during the decoding of bytes.
Source: MSDN
I understand your question as : Does worst case scenario exist for lenght hack?
Encoding ae = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback("[lol]"),
new DecoderReplacementFallback("[you broke Me]"));
Console.WriteLine(ae.GetByteCount("õäöü"));
This will return 20 as string "õäöü" contains 4 characters, that all are off "us-ascii" character set limits ( U+0000 to U+007F.), so after encoder, the text will be "[lol][lol][lol][lol]".

Related

How to get a unique ID for a string and the string from this ID with C#?

I have this name:
string name = "Centos 64 bit";
I want to generate a 168-bit (or whatever is feasible) uid from this name and to be able to get the name from this id vice versa
.
I tried this one GetHashCode() without success.
Result would be something like:
Centos 64 bit (=) 91C47A57-E605-4902-894B-74E791F37C1F

One solution I would recommend is to use a hash function and something like a dictionary. So, get a hash - say SHA256 - of your input string and truncate it to 168 bytes.
Now, to go back from a uid to original string, you would need to have a dictionary which stores pairs like (input_string, string_uid). input_string is original string and string_uid is the uid generated for input_string using method from first paragraph.
Using this dictionary you can easily go back to original input string using string_uid.
This is one way - of course in case, you are allowed to store mappings between string and uid.
The hash normally gives you result as byte array. Converting this byte array to string is a separate step.
For example if you have 10 bytes representing integers in the range [0, 255], converting it to string if you encode the byte array as hex string, will take 20 bytes.
So the next question is do you want the length of the uid as string to be 21 bytes?
Because this will mean the hash output must be somewhere like 10 bytes, this will poorly reflect on collision resistance of the output.

what you want is not achievable. You need to store a lookup table of hash to name. Since you dont give more details of yr system it hard to say if that has to be persistent or in memory. If in memory just use a dictionary of string->string

Here you go sir:
public byte[] GetUID(string name)
{
var bytes = Encoding.ASCII.GetBytes(name);
if (bytes.Length > 21)
throw new ArgumentException("Value is too long to be used as an ID");
var uid = new byte[21];
Buffer.BlockCopy(bytes, 0, uid, 0, bytes.Length);
return bytes;
}
public string GetName(byte[] UID)
{
int length = UID.Length;
for (int i = 0; i < UID.Length; i++)
{
if (UID[i] == 0)
{
length = i;
break;
}
}
return Encoding.ASCII.GetString(UID, 0, length);
}
Caveats: it works for strings up to 21 characters in length that only use ASCII characters (no Unicode support) and it doesn't encrypt the string in any way, but I believe it meets your requirements.

Limit UTF-8 encoded bytes length from string

I need to limit the output byte[] length encoded with UTF-8 encoding. Eg. byte[] length must be less than or equals 1000 First I wrote the following code
int maxValue = 1000;
if (text.Length > maxValue)
text = text.Substring(0, maxValue);
var textInBytes = Encoding.UTF8.GetBytes(text);
works good if string is just using ASCII characters, because 1 byte per character. But if characters goes beyond that it could be 2 or 3 or even 6 bytes per character. That would be a problem with the above code. So to fix that problem I wrote this.
List<byte> textInBytesList = new List<byte>();
char[] textInChars = text.ToCharArray();
for (int a = 0; a < textInChars.Length; a++)
{
byte[] valueInBytes = Encoding.UTF8.GetBytes(textInChars, a, 1);
if ((textInBytesList.Count + valueInBytes.Length) > maxValue)
break;
textInBytesList.AddRange(valueInBytes);
}
I haven't tested code, but Im sure it will work as I want. However, I dont like the way it is done, is there any better way to do this ? Something I'm missing ? or not aware of ?
Thank you.

My first posting on Stack Overflow, so be gentle! This method should take care of things pretty quickly for you..
public static byte[] GetBytes(string text, int maxArraySize, Encoding encoding) {
if (string.IsNullOrEmpty(text)) return null;
int tail = Math.Min(text.Length, maxArraySize);
int size = encoding.GetByteCount(text.Substring(0, tail));
while (tail >= 0 && size > maxArraySize) {
size -= encoding.GetByteCount(text.Substring(tail - 1, 1));
--tail;
}
return encoding.GetBytes(text.Substring(0, tail));
}
It's similar to what you're doing, but without the added overhead of the List or having to count from the beginning of the string every time. I start from the other end of the string, and the assumption is, of course, that all characters must be at least one byte. So there's no sense in starting to iterate down through the string any farther in than maxArraySize (or the total length of the string).
Then you can call the method like so..
byte[] bytes = GetBytes(text, 1000, Encoding.UTF8);

Why does TextReader.Read return an int, not a char?

Consider the following code ( .Dump() in LinqPad simply writes to the console):
var s = "𤭢"; //3 byte code point. 4 byte UTF32 encoded
s.Dump();
s.Length.Dump(); // 2
TextReader sr = new StringReader("𤭢");
int i;
while((i = sr.Read()) >= 0)
{
// notice here we are yielded two
// 2 byte values, but as ints
i.ToString("X").Dump(); // D852, DF62
}
Given the outcome above, why does TextReader.Read() return an int and not a char. Under what circumstances might it read a value greater than 2 bytes?

TextReader.Read() will never read greater than 2 bytes; however, it returns -1 to mean "no more characters to read" (end of string). Therefore, its return type needs to go up to Int32 (4 bytes) from Char (2 bytes) to be able to express the full Char range plus -1.

TextReader.Read() probably uses int to allow returning -1 when reaching the end of the text:
The next character from the text reader, or -1 if no more characters are available. The default implementation returns -1.
And, the Length is 2 because Strings are UTF-16 sequences, which require surrogate pairs to represent code points above U+FFFF.
{ 0xD852, 0xDF62 } <=> U+24B62 (𤭢)
You can get the UTF-32 code point from them with Char.ConvertToUtf32():
Char.ConvertToUtf32("𤭢", 0).ToString("X").Dump(); // 24B62

Why does BigInteger.ToString("x") prepend a 0 for values between signed.MaxValue (exclusive) and unsigned.MaxValue (inclusive)?

Examples (asterisks next to odd behavior):
[Fact]
public void BigInteger_ToString_behavior_is_odd()
{
writeHex(new BigInteger(short.MaxValue)); // 7fff
writeHex(new BigInteger(short.MaxValue) + 1); // 08000 **
writeHex(new BigInteger(ushort.MaxValue)); // 0ffff **
writeHex(new BigInteger(ushort.MaxValue) + 1); // 10000
writeHex(new BigInteger(int.MaxValue)); // 7fffffff
writeHex(new BigInteger(int.MaxValue) + 1); // 080000000 **
writeHex(new BigInteger(uint.MaxValue)); // 0ffffffff **
writeHex(new BigInteger(uint.MaxValue) + 1); // 100000000
writeHex(new BigInteger(long.MaxValue)); // 7fffffffffffffff
writeHex(new BigInteger(long.MaxValue) + 1); // 08000000000000000 **
writeHex(new BigInteger(ulong.MaxValue)); // 0ffffffffffffffff **
writeHex(new BigInteger(ulong.MaxValue) + 1); // 10000000000000000
}
private static void writeHex(BigInteger value)
{
Console.WriteLine(value.ToString("x"));
}
Is there a reason for this?
How would I remove this extra zero? Can I just check if the string has a zero at the start and, if so, remove it? Any corner cases to think about?

Without a leading zero, the number may appear as though it is a negative number of the same number of bits in two's complement. Putting a leading zero ensures that the high bit isn't set, so it can't possibly be interpreted as a negative number.
Go ahead and remove the first character, if it's a zero, unless it's the only character in the string.

From my part not sure why this is done, but as you mentioned converting to string and then removing leading zero should do the trick.

IMO positive values should include a leading zero and i believe that is why you see those in your outputs.
To avoid maybe you could specify a specific formatting for the output

It seems that BigInteger with the x format specifier wants to write out a byte at a time.
See this example:
writeHex(new BigInteger(15));
0f
As such, feel free to remove any padded '0' at the beginning:
private static void writeHex(BigInteger value)
{
Console.WriteLine(value.ToString("x").TrimStart('0'));
}
Is there a reason for this?
A good reason for them to implement it this way is that it is still correct, and probably performs better in the tight loop they use to implement ToString (avoiding branches).
From reflector, the implementation looks like this:
StringBuilder builder = new StringBuilder();
byte[] buffer = value.ToByteArray();
// ... A bunch of pre-amble for special cases here,
// though obviously not including the high byte being < 0x10. Then:
while (index > -1)
{
builder.Append(buffer[index--].ToString(str, info));
}
Edit:
Well, Ben brought up a good point. Some of those examples you gave output an odd number of nibbles, so I guess the implementation is just quirky :)
You can still use the string.TrimStart function to get around that problem.

No reason?!
Perhaps this is simply just a quirck! Remember, the base class libraries were developed by developers, i.e. humans! You can expect the odd quirck to creep into them.

It might be interesting to note that the byte[] returned by the method ToByteArray also contains a leading zero byte in your example cases.
So, to answer your question literally, your examples are formatted with a leading zero because the byte array representing the number contains a leading zero and it's that array that's spit out in hexadecimal.

YouTube-like GUID

Is it possible to generate short GUID like in YouTube (N7Et6c9nL9w)?
How can it be done? I want to use it in web app.

You could use Base64:
string base64Guid = Convert.ToBase64String(Guid.NewGuid().ToByteArray());
That generates a string like E1HKfn68Pkms5zsZsvKONw==. Since a GUID is always 128 bits, you can omit the == that you know will always be present at the end and that will give you a 22 character string. This isn't as short as YouTube though.

URL Friendly Solution
As mentioned in the accepted answer, base64 is a good solution but it can cause issues if you want to use the GUID in a URL. This is because + and / are valid base64 characters, but have special meaning in URLs.
Luckily, there are unused characters in base64 that are URL friendly. Here is a more complete answer:
public string ToShortString(Guid guid)
{
var base64Guid = Convert.ToBase64String(guid.ToByteArray());
// Replace URL unfriendly characters
base64Guid = base64Guid.Replace('+', '-').Replace('/', '_');
// Remove the trailing ==
return base64Guid.Substring(0, base64Guid.Length - 2);
}
public Guid FromShortString(string str)
{
str = str.Replace('_', '/').Replace('-', '+');
var byteArray = Convert.FromBase64String(str + "==");
return new Guid(byteArray);
}
Usage:
Guid guid = Guid.NewGuid();
string shortStr = ToShortString(guid);
// shortStr will look something like 2LP8GcHr-EC4D__QTizUWw
Guid guid2 = FromShortString(shortStr);
Assert.AreEqual(guid, guid2);
EDIT:
Can we do better? (Theoretical limit)
The above yields a 22 character, URL friendly GUID.
This is because a GUID uses 128 bits, so representing it in base64 requires
characters, which is 21.33, which rounds up to 22.
There are actually 66 URL friendly characters (we aren't using . and ~). So theoretically, we could use base66 to get
characters, which is 21.17, which also rounds up to 22.
So this is optimal for a full, valid GUID.
However, GUID uses 6 bits to indicate the version and variant, which in our case are constant. So we technically only need 122 bits, which in both bases rounds to 21 ( = 20.33). So with more manipulation, we could remove another character. This requires wrangling the bits out however, so I leave this as an exercise to the reader.
How does youtube do it?
YouTube IDs use 11 characters. How do they do it?
A GUID uses 122 bits, which guarantees collisions are virtually impossible. This means you can generate a random GUID and be certain it is unique without checking. However, we don't need so many bits for just a regular ID.
We could use a smaller ID. If we use 66 bits or less, we have a higher risk of collision, but can represent this ID with 11 characters (even in base64). One could either accept the risk of collision, or test for a collision and regenerate.
With 122 bits (regular GUID), you would have to generate ~ GUIDs to have a 1% chance of collision.
With 66 bits, you would have to generate ~ or 1 billion IDs to have a 1% chance of collision. That is not that many IDs.
My guess is youtube uses 64 bits (which is more memory friendly than 66 bits), and checks for collisions to regenerate the ID if necessary.
If you want to abandon GUIDs in favor of smaller IDs, here is code for that:
class IdFactory
{
private Random random = new Random();
public int CharacterCount { get; }
public IdFactory(int characterCount)
{
CharacterCount = characterCount;
}
public string Generate()
{
// bitCount = characterCount * log (targetBase) / log(2)
var bitCount = 6 * CharacterCount;
var byteCount = (int)Math.Ceiling(bitCount / 8f);
byte[] buffer = new byte[byteCount];
random.NextBytes(buffer);
string guid = Convert.ToBase64String(buffer);
// Replace URL unfriendly characters
guid = guid.Replace('+', '-').Replace('/', '_');
// Trim characters to fit the count
return guid.Substring(0, CharacterCount);
}
}
Usage:
var factory = new IdFactory(characterCount: 11);
string guid = factory.Generate();
// guid will look like Mh3darwiZhp
This uses 64 characters which is not optimal, but requires much less code (since we can reuse Convert.ToBase64String).
You should be a lot more careful of collisions if you use this.

9 chars is not a GUID. Given that, you could use the hexadecimal representation of an int, which gives you a 8 char string.
You can use an id you might already have. Also you can use .GetHashCode against different simple types and there you have a different int. You can also xor different fields. And if you are into it, you might even use a Random number - hey, you have well above 2.000.000.000+ possible values if you stick to the positives ;)

It's not a GUID but rather an auto-incremented unique alphanumeric string
Please see the following code where I am trying to do the same, It uses the TotalMilliseconds from EPOCH and a valid set of characters to generate a unique string that is incremented with each passing milliseconds.
The one other way is to use numeric counters but that is expensive to maintain and will create a series where you can + or - values to guess the previous or the next unique string in the system and we don't what that to happen.
Do remember:
This will not be globally unique but unique to the instance where it's defined
It uses Thread.Sleep() to handle multithreading issue
public string YoutubeLikeId()
{
Thread.Sleep(1);//make everything unique while looping
long ticks = (long)(DateTime.UtcNow
.Subtract(new DateTime(1970, 1, 1,0,0,0,0))).TotalMilliseconds;//EPOCH
char[] baseChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
.ToCharArray();
int i = 32;
char[] buffer = new char[i];
int targetBase= baseChars.Length;
do{
buffer[--i] = baseChars[ticks % targetBase];
ticks = ticks / targetBase;
}
while (ticks > 0);
char[] result = new char[32 - i];
Array.Copy(buffer, i, result, 0, 32 - i);
return new string(result);
}
The output will come something like
XOTgBsu
XOTgBtB
XOTgBtR
XOTgBtg
XOTgBtw
XOTgBuE
Update: The same can be achieved from Guid as
var guid = Guid.NewGuid();
guid.ToString("N");
guid.ToString("N").Substring(0,8);
guid.ToString("N").Substring(8,4);
guid.ToString("N").Substring(12,4);
guid.ToString("N").Substring(16,4);
guid.ToString("N").Substring(20,12);
For a Guid ecd65132-ab5a-4587-87b8-b875e2fe0f35 it will break it down in chunks as ecd65132 ,ab5a , 4587,87b8,b875e2fe0f35
but it's not guarantee it to be unique always.
Update 2: There is also a project called ShortGuid to get a url friendly GUID it can be converted from/to a regular Guid
When I went under the hood I found it works by encoding the Guid to Base64 as the code below:
public static string Encode(Guid guid)
{
string encoded = Convert.ToBase64String(guid.ToByteArray());
encoded = encoded
.Replace("/", "_")
.Replace("+", "-");
return encoded.Substring(0, 22);
}
The good thing about it it can be decoded again to get the Guid back with
public static Guid Decode(string value)
{
// avoid parsing larger strings/blobs
if (value.Length != 22)
{
throw new ArgumentException("A ShortGuid must be exactly 22 characters long. Receive a character string.");
}
string base64 = value
.Replace("_", "/")
.Replace("-", "+") + "==";
byte[] blob = Convert.FromBase64String(base64);
var guid = new Guid(blob);
var sanityCheck = Encode(guid);
if (sanityCheck != value)
{
throw new FormatException(
#"Invalid strict ShortGuid encoded string. The string '{value}' is valid URL-safe Base64, " +
#"but failed a round-trip test expecting '{sanityCheck}'."
);
}
return guid;
}
So a Guid 4039124b-6153-4721-84dc-f56f5b057ac2 will be encoded as SxI5QFNhIUeE3PVvWwV6wg and the Output will look something like.
ANf-MxRHHky2TptaXBxcwA
zpjp-stmVE6ZCbOjbeyzew
jk7P-XYFokmqgGguk_530A
81t6YZtkikGfLglibYkDhQ
qiM2GmqCK0e8wQvOSn-zLA

As others have mentioned, YouTube's VideoId is not technically a GUID since it's not inherently unique.
As per Wikipedia:
The total number of unique keys is 2128 or 3.4×1038. This number is so
large that the probability of the same number being generated randomly
twice is negligible.
The uniqueness YouTube's VideoId is maintained by their generator algorithm.
You can either write your own algorithm, or you can use some sort of random string generator and utilize the UNIQUE CONSTRAINT constraint in SQL to enforce its uniqueness.
First, create a UNIQUE CONSTRAINT in your database:
ALTER TABLE MyTable
ADD CONSTRAINT UniqueUrlId
UNIQUE (UrlId);
Then, for example, generate a random string (from philipproplesch's answer):
string shortUrl = System.Web.Security.Membership.GeneratePassword(11, 0);
If the generated UrlId is sufficiently random and sufficiently long you should rarely encounter the exception that is thrown when SQL encounters a duplicate UrlId. In such an event, you can easily handle the exception in your web app.

Technically it's not a Guid. Youtube has a simple randomized string generator that you can probably whip up in a few minutes using an array of allowed characters and a random number generator.

It might be not the best solution, but you can do something like that:
string shortUrl = System.Web.Security.Membership.GeneratePassword(11, 0);

This id is probably not globally unique. GUID's should be globally unique as they include elements which should not occur elsewhere (the MAC address of the machine generating the ID, the time the ID was generated, etc.)
If what you need is an ID that is unique within your application, use a number fountain - perhaps encoding the value as a hexadecimal number. Every time you need an id, grab it from the number fountain.
If you have multiple servers allocating id's, you could grab a range of numbers (a few tens or thousands depending on how quickly you're allocating ids) and that should do the job. an 8 digit hex number will give you 4 billion ids - but your first id's will be a lot shorter.

Maybe using NanoId will save you from a lot of headaches:
https://github.com/codeyu/nanoid-net
You can do something like:
var id = Nanoid.Generate('1234567890abcdef', 10) //=> "4f90d13a42"
And you can check the collision probability here:
https://alex7kom.github.io/nano-nanoid-cc/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

why does IsSingleByte Encoding's GetByteCount do calculation - c#

I’ve inspected AsciiEncoding's GetByteCount method. It does long calculations rather then returning String.Length. It doesn’t completely make any sense to me. Do you have an idea why?

Related

How to get a unique ID for a string and the string from this ID with C#?

Limit UTF-8 encoded bytes length from string

Why does TextReader.Read return an int, not a char?

Why does BigInteger.ToString("x") prepend a 0 for values between signed.MaxValue (exclusive) and unsigned.MaxValue (inclusive)?

YouTube-like GUID

Categories

Resources