Limit UTF-8 encoded bytes length from string

Limit UTF-8 encoded bytes length from string - c#

I need to limit the output byte[] length encoded with UTF-8 encoding. Eg. byte[] length must be less than or equals 1000 First I wrote the following code
int maxValue = 1000;
if (text.Length > maxValue)
text = text.Substring(0, maxValue);
var textInBytes = Encoding.UTF8.GetBytes(text);
works good if string is just using ASCII characters, because 1 byte per character. But if characters goes beyond that it could be 2 or 3 or even 6 bytes per character. That would be a problem with the above code. So to fix that problem I wrote this.
List<byte> textInBytesList = new List<byte>();
char[] textInChars = text.ToCharArray();
for (int a = 0; a < textInChars.Length; a++)
{
byte[] valueInBytes = Encoding.UTF8.GetBytes(textInChars, a, 1);
if ((textInBytesList.Count + valueInBytes.Length) > maxValue)
break;
textInBytesList.AddRange(valueInBytes);
}
I haven't tested code, but Im sure it will work as I want. However, I dont like the way it is done, is there any better way to do this ? Something I'm missing ? or not aware of ?
Thank you.

My first posting on Stack Overflow, so be gentle! This method should take care of things pretty quickly for you..
public static byte[] GetBytes(string text, int maxArraySize, Encoding encoding) {
if (string.IsNullOrEmpty(text)) return null;
int tail = Math.Min(text.Length, maxArraySize);
int size = encoding.GetByteCount(text.Substring(0, tail));
while (tail >= 0 && size > maxArraySize) {
size -= encoding.GetByteCount(text.Substring(tail - 1, 1));
--tail;
}
return encoding.GetBytes(text.Substring(0, tail));
}
It's similar to what you're doing, but without the added overhead of the List or having to count from the beginning of the string every time. I start from the other end of the string, and the assumption is, of course, that all characters must be at least one byte. So there's no sense in starting to iterate down through the string any farther in than maxArraySize (or the total length of the string).
Then you can call the method like so..
byte[] bytes = GetBytes(text, 1000, Encoding.UTF8);

Related

Converting non digit text to double and then back to string

I have a unique situation where I have to write code on top of an already establish platform so I am trying to figure out a hack to make something work.
The problem I have is I have a user defined string. Basically naming a signal. I need to get this into another program but the only method available is within a double value. Below is what I have tried but not been able to get it to work. I tried converting the string to byte array and then creating a new string by looping the bytes. Then I convert this string to a Double. Then use BitCoverter to get it back to byte array and then try to get the string.
Not sure if this can even be achieve. Any ideas?
string signal = "R3MEXA";
string newId = "1";
byte[] asciiBytes = System.Text.Encoding.ASCII.GetBytes(signal);
foreach (byte b in asciiBytes)
newId += b.ToString();
double signalInt = Double.Parse(newId);
byte[] bytes = BitConverter.GetBytes(signalInt);
string result = System.Text.Encoding.ASCII.GetString(bytes);

Asuming your string consists of ASCII characters (7Bit):
Convert your string into a bit-Array, seven bits per character.
Convert this bit-array into a string of digits, using 3 bits for each digit. (there are digits 0..7)
Convert this string of digits to a double number.

You initially set newId to "1", which means when you're doing later conversion, you're not going to get the right output unless to account for the "1" again.

It doesn't work, because if you convert it back you don't know the length of a byte.
So I made every byte to a length of 3.
string signal = "R3MEXA";
string newId = "1";
byte[] asciiBytes = System.Text.Encoding.ASCII.GetBytes(signal);
foreach (byte b in asciiBytes)
newId += b.ToString().PadLeft(3,'0'); //Add Zero, if the byte has less than 3 digits
double signalInt = Double.Parse(newId);
//Convert it back
List<byte> bytes = new List<byte>(); //Create a list, we don't know how many bytes will come (Or you calc it: maximum is _signal / 3)
//string _signal = signalInt.ToString("F0"); //Maybe you know a better way to get the double to string without scientific
//This is my workaround to get the integer part from the double:
//It's not perfect, but I don't know another way at the moment without losing information
string _signal = "";
while (signalInt > 1)
{
int _int = (int)(signalInt % 10);
_signal += (_int).ToString();
signalInt /= 10;
}
_signal = String.Join("",_signal.Reverse());
for (int i = 1; i < _signal.Length; i+=3)
{
byte b = Convert.ToByte(_signal.Substring(i, 3)); //Make 3 digits to one byte
if(b!=0) //With the ToString("F0") it is possible that empty bytes are at the end
bytes.Add(b);
}
string result = System.Text.Encoding.ASCII.GetString(bytes.ToArray()); //Yeah "R3MEX" The "A" is lost, because double can't hold that much.
What can improved?
Not every PadLeft is necessary. Work from back to front and if the third digit of a byte is greater than 2, you know, that the byte has only two digits. (Sorry for my english, I write an example).
Example
194 | 68 | 75 | 13
194687513
Reverse:
315786491
31 //5 is too big 13
57 //8 is too big 75
86 //4 is too big 68
491 //1 is ok 194

C# private function, IncrementArray

Can someone please explain in layman's terms the workings of this C# code?
for (int pos = 0; pos < EncryptedData.Length; pos += AesKey.Length);
{
Array.Copy(incPKGFileKey, 0, PKGFileKeyConsec, pos, PKGFileKey.Length);
IncrementArray(ref incPKGFileKey, PKGFileKey.Length - 1);
}
private Boolean IncrementArray(ref byte[] sourceArray, int position)
{
if (sourceArray[position] == 0xFF)
{
if (position != 0)
{
if (IncrementArray(ref sourceArray, position - 1))
{
sourceArray[position] = 0x00;
return true;
}
else return false;
}
else return false;
}
else
{
sourceArray[position] += 1;
return true;
}
}
I'm trying to port an app to Ruby but I'm having trouble understanding how the IncrementArray function works.

IncrementArray increments all entries of a byte array, with any overflow being added to the previous index, unless it's index 0 already.
The entire thing looks like some kind of encryption or decryption code. You might want to look for additional hints on which algorithm is used, as this kind of code is usually not self-explaining.

It looks to me like a big-endian addition algorithm:
Let's say you've got a long (64 bit, 8 byte) number:
var bigNumber = 0x123456FFFFFFFF;
But for some reason, we've got it coming to us as a byte array in Big-endian format:
// Get the little endian byte array representation of the number:
// [0xff 0xff 0xff 0xff 0xff 0x56 0x34 0x12]
byte[] source = BitConverter.GetBytes(bigNumber);
// BigEndian-ify it by reversing the byte array
source = source.Reverse().ToArray();
So now you want to add one to this "number" in it's current form, while maintaining any carrys/overflows like you would in normal arithmetic:
// increment the least significant byte by one, respecting carry
// (as it's bigendian, the least significant byte will be the last one)
IncrementArray(ref source, source.Length-1);
// we'll re-little-endian-ify it so we can convert it back
source = source.Reverse().ToArray();
// now we convert the array back into a long
var bigNumberIncremented = BitConverter.ToInt64(source, 0);
// Outputs: "Before +1:123456FFFFFFFF"
Console.WriteLine("Before +1:" + bigNumber);
// Outputs: "After +1:12345700000000"
Console.WriteLine("After +1:" + bigNumberIncremented);

why does IsSingleByte Encoding's GetByteCount do calculation

I’ve inspected AsciiEncoding's GetByteCount method. It does long calculations rather then returning String.Length. It doesn’t completely make any sense to me. Do you have an idea why?

EDIT: I've just tried reproducing this, and I can't currently force an ASCIIEncoding instead to have a different replacement. Instead, I'd have to use Encoding.GetEncoding to get a mutable one. So for ASCIIEncoding, I agree... but for other implementations where IsSingleByte returns true, you'd still have the potential problem below.
Consider trying to get the byte count of a string which doesn't just contain ASCII characters. The encoding has to take the EncoderFallback into account... which could do any number of things, including increasing the count by an indeterminate amount.
It could be optimized for the case where the encoder fallback is a "default" one which just replaces non-ASCII characters with "?" though.
Further edit: I've just tried to confuse this with a surrogate pair, hoping that it would be represented by a single question mark. Unfortunately not:
string text = "x\ud800\udc00y";
Console.WriteLine(text.Length); // Prints 4
Console.WriteLine(Encoding.ASCII.GetByteCount(text)); // Still prints 4!

Interestingly, the mono runtime doesn't seem to include that behaviour:
// Get the number of bytes needed to encode a character buffer.
public override int GetByteCount (char[] chars, int index, int count)
{
if (chars == null) {
throw new ArgumentNullException ("chars");
}
if (index < 0 || index > chars.Length) {
throw new ArgumentOutOfRangeException ("index", _("ArgRange_Array"));
}
if (count < 0 || count > (chars.Length - index)) {
throw new ArgumentOutOfRangeException ("count", _("ArgRange_Array"));
}
return count;
}
// Convenience wrappers for "GetByteCount".
public override int GetByteCount (String chars)
{
if (chars == null) {
throw new ArgumentNullException ("chars");
}
return chars.Length;
}
and further down
[CLSCompliantAttribute(false)]
[ComVisible (false)]
public unsafe override int GetByteCount (char *chars, int count)
{
return count;
}

For a multibyte character encoding like UTF8, this method makes sense, because characters are stored in with 1 - 6 bytes. I imagine, that method also applies for a fixed size encoding like ASCII, where every character is stored with 7 bits. In actual implementation however, "aaaaaaaa" would be 8 bytes, as characters in ASCII are stored in 1 byte (8 bits), so lenght hack would work in best case scenario.
Previous versions of .NET Framework allowed spoofing by ignoring the 8th bit. The current version has been changed so that non-ASCII code points fall back during the decoding of bytes.
Source: MSDN
I understand your question as : Does worst case scenario exist for lenght hack?
Encoding ae = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback("[lol]"),
new DecoderReplacementFallback("[you broke Me]"));
Console.WriteLine(ae.GetByteCount("õäöü"));
This will return 20 as string "õäöü" contains 4 characters, that all are off "us-ascii" character set limits ( U+0000 to U+007F.), so after encoder, the text will be "[lol][lol][lol][lol]".

C# problem with byte[]

I am loading a file into a byte[]. By my understanding the byte[] should contain a specific elements of bytes (8-bit). When i print each byte, all of them are not 8-bit (i.e. they dont have the length of 8).
My Code:
FileStream stream = File.OpenRead(#"C:\Image\Img.jpg");
byte[] fileByte = new byte[stream.Length];
stream.Read(fileByte, 0, fileByte.Length);
for (int i = 0; i <= fileByte.Length - 1; i++)
{
Console.WriteLine(Convert.ToString(fileByte[i], 2));
}
Output:
10001110
11101011
10001100
1000111
10011010
10010011
1001010
11000000
1001001
100100
I think my understanding is wrong here, Can you please let me know (or provide me some tutorial links) where I am missing this.

Leading 0's don't get printed.

When converting a numeric to a string, you lose any leading zeros. (Note that all of your entries start with "1".) You can use PadLeft to put them back in.
FileStream stream = File.OpenRead(#"C:\Image\Img.jpg");
byte[] fileByte = new byte[stream.Length];
stream.Read(fileByte, 0, fileByte.Length);
for (int i = 0; i <= fileByte.Length - 1; i++)
{
Console.WriteLine(Convert.ToString(fileByte[i], 2).PadLeft(8,'0'));
}

They all have 8 bits, but the non significant zeroes (the zeroes on the left) are not printed.

It is simply that the leading zeros are not included...

Are the bytes without leading zeros? You kinda chose a bad example because we do not know the decimal values you are displaying (ok maybe someone who knows the header structure for a .jpg file knows). I'm willing to bet leading zeros are not displayed in the binary equivalents.

How do I truncate a string while converting to bytes in C#?

I would like to put a string into a byte array, but the string may be too big to fit. In the case where it's too large, I would like to put as much of the string as possible into the array. Is there an efficient way to find out how many characters will fit?

In order to truncate a string to a UTF8 byte array without splitting in the middle of a character I use this:
static string Truncate(string s, int maxLength) {
if (Encoding.UTF8.GetByteCount(s) <= maxLength)
return s;
var cs = s.ToCharArray();
int length = 0;
int i = 0;
while (i < cs.Length){
int charSize = 1;
if (i < (cs.Length - 1) && char.IsSurrogate(cs[i]))
charSize = 2;
int byteSize = Encoding.UTF8.GetByteCount(cs, i, charSize);
if ((byteSize + length) <= maxLength){
i = i + charSize;
length += byteSize;
}
else
break;
}
return s.Substring(0, i);
}
The returned string can then be safely transferred to a byte array of length maxLength.

You should be using the Encoding class to do your conversion to byte array correct? All Encoding objects have an overridden method GetMaxCharCount, which will give you "The maximum number of characters produced by decoding the specified number of bytes." You should be able to use this value to trim your string and properly encode it.

Efficient way would be finding how much (pessimistically) bytes you will need per character with
Encoding.GetMaxByteCount(1);
then dividing your string size by the result, then converting that much characters with
public virtual int Encoding.GetBytes (
string s,
int charIndex,
int charCount,
byte[] bytes,
int byteIndex
)
If you want to use less memory use
Encoding.GetByteCount(string);
but that is a much slower method.

The Encoding class in .NET has a method called GetByteCount which can take in a string or char[]. If you pass in 1 character, it will tell you how many bytes are needed for that 1 character in whichever encoding you are using.
The method GetMaxByteCount is faster, but it does a worst case calculation which could return a higher number than is actually needed.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Limit UTF-8 encoded bytes length from string - c#

Related

Converting non digit text to double and then back to string

C# private function, IncrementArray

why does IsSingleByte Encoding's GetByteCount do calculation

C# problem with byte[]

How do I truncate a string while converting to bytes in C#?

Categories

Resources