C# .Net framework strings encoding from utf-8 bytes - c#

I have application wrote in C#, and this application receives data through network from server using sockets (udp libenet).
In my application I have function to process raw bytes sent in packet.
One of functions is reading string, delimited by \0.
My problem is that I'm sending UTF-8 encoded string by server to C# application, but when I use these strings to display them in controls, I get gibberish instead of polish letters.
Function that reads strings from buffer:
public override string ReadString()
{
StringBuilder sb = new StringBuilder();
while (true)
{
byte b;
if (Remaining > 0)
b = ReadByte();
else
b = 0;
if (b == 0) break;
// Probably here is the problem. Checked other encodings etc., but still same
sb.Append(Encoding.UTF8.GetString(new byte[] { b }, 0, 1));
}
return sb.ToString();
}
Function overrides, the one from:
public class BitReader : BinaryReader
In my application I get:

You can't read UTF-8 byte wise as a single character might take more than one byte.
See How to convert byte[] to string? (first read everything into one byte array / List).

Related

Convert image/png string representation into byte array in C#

From a call to an external API my method receives an image/png as an IRestResponse where the Content property is a string representation.
I need to convert this string representation of image/png into a byte array without saving it first and then going File.ReadAllBytes. How can I achieve this?
You can try a hex string to byte conversion. Here is a method I've used before. Please note that you may have to pad the byte depending on how it comes. The method will throw an error to let you know this. However if whoever sent the image converted it into bytes, then into a hex string (which they should based on what you are saying) then you won't have to worry about padding.
public static byte[] HexToByte(string HexString)
{
if (HexString.Length % 2 != 0)
throw new Exception("Invalid HEX");
byte[] retArray = new byte[HexString.Length / 2];
for (int i = 0; i < retArray.Length; ++i)
{
retArray[i] = byte.Parse(HexString.Substring(i * 2, 2), NumberStyles.HexNumber, CultureInfo.InvariantCulture);
}
return retArray;
}
This might not be the fastest solution by the way, but its a good representation of what needs to happen so you can optimize later.
This is also assuming the string being sent to you is the raw byte converted string. If the sender did anything like a base58 conversion or something else you will need to decode and then use method.
I have found that the IRestResponse from RestSharp actually contains a 'RawBytes' property which is of the response content. This meets my needs and no conversion is necessary!

Encoding and null terminated strings

EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.
/// <summary>
/// Decodes a string from the specified bytes in the specified encoding.
/// </summary>
/// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
{
if (Length == 0) return string.Empty;
var sb = new StringBuilder();
if (Length <= -1)
{
using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
{
int ch;
while (true)
{
ch = sr.Read();
if (ch <= 0) break;
sb.Append((char)ch);
}
if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
return sb.ToString();
}
}
else return Encoding.GetString(Source, Offset, Length);
}
I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.
Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.
This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.
At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.
That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it.
For example:
"blah\0\0\oldstring"
ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.
Any ideas for an elegant solution for C#?
Your best solution is to use an implementation of TextReader:
StreamReader if you're reading from a stream
StringReader if you're reading from a string
With this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int:
int ch = reader.Read();
Internally the magic is done through the C# Decoder class (which comes from your Encoding):
var decoder = Encoding.UTF7.GetDecoder();
The Decoder class needs a short array buffer. Fortunately StreamReader knows how to keep the buffer filled and everything work.
Pseudocode
Untried, untested, and only happens to look like C#:
String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
StringBuilder sb = new StringBuilder();
TextReader rdr = new StreamReader(stm, encoding);
int ch = rdr.Read();
while (ch > 0) //returns -1 when we've hit the end, and 0 is null
{
sb.AppendChar(Char(ch));
int ch = rdr.Read();
}
return sb.ToString();
}
Note: Any code released into public domain. No attribution required.

String to byte array only converts first 16 bytes according to Intellisense

I'm trying to convert a string to a byte[] using the ASCIIEncoder object in the .NET library. The string will never contain non-ASCII characters, but it will usually have a length greater than 16. My code looks like the following:
public static byte[] Encode(string packet)
{
ASCIIEncoder enc = new ASCIIEncoder();
byte[] byteArray = enc.GetBytes(packet);
return byteArray;
}
By the end of the method, the byte array should be full of packet.Length number of bytes, but Intellisense tells me that all bytes after byteArray[15] are literally questions marks that cannot be observed. I used Wireshark to view byteArray after I sent it and it was received on the other side fine, but the end device did not follow the instructions encoded in byteArray. I'm wondering if this has anything to do with Intellisense not being able to display all elements in byteArray, or if my packet is completely wrong.
If your packet string basically contains characters in the range 0-255, then ASCIIEncoding is not what you should be using. ASCII only defines character codes 0-127; anything in the range 128-255 will get turned into question marks (as you have observed) because there characters are not defined in ASCII.
Consider using a method like this to convert the string to a byte array. (This assumes that the ordinal value of each character is in the range 0-255 and that the ordinal value is what you want.)
public static byte[] ToOrdinalByteArray(this string str)
{
if (str == null) { throw new ArgumentNullException("str"); }
var bytes = new byte[str.Length];
for (int i = 0; i < str.Length; ++i) {
// Wrapping the cast in checked() will trigger an OverflowException
// if the character being converted is out of range for a byte.
bytes[i] = checked((byte)str[i]);
}
return bytes;
}
The Encoding class hierarchy is specifically designed for handling text. What you have here doesn't seem to be text, so you should avoid using these classes.
The standard encoders use the replacement character fallback strategy. If a character doesn't exist in the target character set, they encode a replacement character ('?' by default).
To me, that's worse than a silent failure; It's data corruption. I prefer that libraries tell me when my assumptions are wrong.
You can derive an encoder that throws an exception:
Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
If you are truly using only characters in Unicode's ASCII range then you'll never see an exception.

C# Network Stream getString method

I'm writing a library to simplify my network programming in future projects. I'm wanting it to be robust and efficient because this will be in nearly all of my projects in the future. (BTW both the server and the client will be using my library so I'm not assuming a protocol in my question) I'm writing a function for receiving strings from a network stream where I use 31 bytes of buffer and one for sentinel. The sentinel value will indicate which byte if any is the EOF. Here's my code for your use or scrutiny...
public string getString()
{
string returnme = "";
while (true)
{
int[] buff = new int[32];
for (int i = 0; i < 32; i++)
{
buff[i] = ns.ReadByte();
}
if (buff[31] > 31) { /*throw some error*/}
for (int i = 0; i < buff[31]; i++)
{
returnme += (char)buff[i];
}
if (buff[31] != 31)
{
break;
}
}
return returnme;
}
Edit: Is this the best (efficient, practical, etc) to accomplish what I'm doing.
Is this the best (efficient, practical, etc) to accomplish what I'm doing.
No. Firstly, you are limiting yourself to characters in the 0-255 code-point range, and that isn't enough, and secondly: serializing strings is a solved problem. Just use an Encoding, typically UTF-8. As part of a network stream, this probably means "encoode the length, encode the data" and "read the length, buffer that much data, decode the data". As another note: you aren't correctly handling the EOF scenario if ReadByte() returns a negative value.
As a small corollary, note that appending to a string in a loop is never a good idea; if you did do it that way, use a StringBuilder. But don't do it that way. My code would be something more like (hey, whadya know, here's my actual string-reading code from protobuf-net, simplified a bit):
// read the length
int bytes = (int)ReadUInt32Variant(false);
if (bytes == 0) return "";
// buffer that much data
if (available < bytes) Ensure(bytes, true);
// read the string
string s = encoding.GetString(ioBuffer, ioIndex, bytes);
// update the internal buffer data
available -= bytes;
position += bytes;
ioIndex += bytes;
return s;
As a final note, I would say: if you are sending structured messages, give some serious consideration to using a pre-rolled serialization API that specialises in this stuff. For example, you could then just do something like:
var msg = new MyMessage { Name = "abc", Value = 123, IsMagic = true };
Serializer.SerializeWithLengthPrefix(networkStream, msg);
and at the other end:
var msg = Serializer.DeserializeWithLengthPrefix<MyMessage>(networkStream);
Console.WriteLine(msg.Name); // etc
Job done.
I think tou should use a StringBuilder object with fixed size for better performance.

C# perform string operation on UTF-16 byte array

I'm reading a file into byte[] buffer. The file contains a lot of UTF-16 strings (millions) in the following format:
The first byte contain and string length in chars (range 0 .. 255)
The following bytes contains the string characters in UTF-16 encoding (each char represented by 2 bytes, means byteCount = charCount * 2).
I need to perform standard string operations for all strings in the file, for example: IndexOf, EndsWith and StartsWith, with StringComparison.OrdinalIgnoreCase and StringComparison.Ordinal.
For now my code first converting each string from byte array to System.String type. I found the following code to be the most efficient to do so:
// position/length validation removed to minimize the code
string result;
byte charLength = _buffer[_bufferI++];
int byteLength = charLength * 2;
fixed (byte* pBuffer = &_buffer[_bufferI])
{
result = new string((char*)pBuffer, 0, charLength);
}
_bufferI += byteLength;
return result;
Still, new string(char*, int, int) it's very slow because it performing unnecessary copying for each string.
Profiler says its System.String.wstrcpy(char*,char*,int32) performing slow.
I need a way to perform string operations without copying bytes for each string.
Is there a way to perform string operations on byte array directly?
Is there a way to create new string without copying its bytes?
No, you can't create a string without copying the character data.
The String object stores the meta data for the string (Length, et.c.) in the same memory area as the character data, so you can't keep the character data in the byte array and pretend that it's a String object.
You could try other ways of constructing the string from the byte data, and see if any of them has less overhead, like Encoding.UTF16.GetString.
If you are using a pointer, you could try to get multiple strings at a time, so that you don't have to fix the buffer for each string.
You could read the File using a StreamReader using Encoding.UTF16 so you do not have the "byte overhead" in between:
using (StreamReader sr = new StreamReader(filename, Encoding.UTF16))
{
string line;
while ((line = sr.ReadLine()) != null)
{
//Your Code
}
}
You could create extension methods on byte arrays to handle most of those string operations directly on the byte array and avoid the cost of converting. Not sure what all string operations you perform, so not sure if all of them could be accomplished this way.

Categories

Resources