Invalid JSON after calling Encoding.UTF8.GetString() [duplicate] - c#

I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).
Here's the relevant code snippet:
byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);
Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);
While debugging this test, the data string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray(), I noticed that the first byte of the string data is the value 65279 which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString() keeps this byte around.
How do I get Encoding.UTF8.GetString() to not put the Byte Order Marker in the resulting string?
Update: The problem was that GetData(), which reads a file from disk, reads the data from the file using FileStream.readbytes(). I corrected this by using a StreamReader and converting the string to bytes using Encoding.UTF8.GetBytes(), which is what it should've been doing in the first place! Thanks for all the help.

Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.
EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);
string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}

There is a slightly more efficient way to do it than creating StreamReader and MemoryStream:
1) If you know that there is always a BOM
string viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
2) If you don't know, check:
string viaEncoding;
if (withBom.Length >= 3 && withBom[0] == 0xEF && withBom[1] == 0xBB && withBom[2] == 0xBF)
viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
else
viaEncoding = Encoding.UTF8.GetString(withBom);

Unfortunately the BOM won't be removed with a simple Trim(). But it can be done as follows:
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
byte[] bom = { 0xef, 0xbb, 0xbf };
var text = System.Text.Encoding.UTF8.GetString(withBom);
Console.WriteLine($"Untrimmed: {text.Length}, {text}");
var trimmed = text.Trim(System.Text.Encoding.UTF8.GetString(bom).ToCharArray());
Console.WriteLine($"Trimmed: {trimmed.Length}, {trimmed}");
Output:
Untrimmed: 2, A
Trimmed: 1, A

I believe the extra character is removed if you Trim() the decoded string

Related

Convert C# string to C char array

I am sending a string from C# to C via sockets:
write 5000 100
In C, I split the received string using spaces.
char **params = str_split(buffer, ' ');
And then access the 3rd parameter, and convert 100 into C char. However, I need to be able to send an array of chars from C# (1 byte each) so that I can use them in C.
For instance, let's say I need to send the following string:
write 5000 <byte[] { 0x01, 0x20, 0x45 }>
Of course, the byte array needs to be transformed into string characters in C# that can be sent via StreamWriter. StreamWriter accepts array of chars which are 2 bytes each, but I need 1 byte.
How can this be accomplished?
In C, char is of 1 byte size. Thus, to accommodate them from C#, you will need to send byte.
And it seems like you need two different inputs for your problem:
C#
string textFront = "write 5000"; //input 1
byte[] bytes = new byte[] { 0x01, 0x20, 0x45 }; //input 2
And then to send them together, I would rather use Stream which allows you to send byte[]. Thus, we only need to (1) Change the textFront into byte[], (2) concat textFront with bytes, and lastly (3) send combined variable as byte[].
byte[] frontBytes = Encoding.ASCII.GetBytes(textFront); // no (1)
byte[] combined = new byte[frontBytes.Length + bytes.Length];
frontBytes.CopyTo(combined, 0);
bytes.CopyTo(combined, frontBytes.Length); //no (2)
Stream stream = new StreamWriter(); //no (3)
stream.Write(combined, 0, combined.Length);
I don't quite understand your question, is it what you are looking for?
byte[] bytes = Encoding.UTF8.GetBytes("your string");
and vice versa
string text = Encoding.UTF8.GetString(bytes);
The StreamWriter constructor may receive a Encoding parameter. Maybe that's what you want.
var sw = new StreamWriter(your_stream, Encoding.ASCII);
sw.Write("something");
There is also the BinaryWriter class that can write strings and byte[].
var bw = new BinaryWriter(output_stream, Encoding.ASCII);
bw.Write("something");
bw.Write(new byte[] { 0x01, 0x20, 0x45 });

base64string to string back to base64string

I am trying an experiment to convert a base64string to a string then back to a base64string, however, I am not getting my original base64string:
String profilepic = "/9j/4AAQ";
string Orig = System.Text.Encoding.Unicode.GetString(Convert.FromBase64String(profilepic));
string New = Convert.ToBase64String(System.Text.Encoding.Unicode.GetBytes(Orig));
The string New returns "/f//4AAQ".
Any thoughts of why this is happening?
You are doing it wrong. You should do it as below:
namespace ConsoleApplication1
{
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
string profilepic = "/9j/4AAQ";
string New = Convert.ToBase64String(Encoding.Unicode.GetBytes(profilepic));
byte[] raw = Convert.FromBase64String(New); // unpack the base-64 to a blob
string s = Encoding.Unicode.GetString(raw); // outputs /9j/4AAQ
Console.ReadKey();
}
}
}
You're assuming that the base64-encoded binary data in your example contains a UTF-16 encoded message. This may simply not be the case, and the System.Text.Encoding.Unicode class may alter the contents by discarding the bytes that it doesn't understand.
Therefore, the result of base64-encoding the UTF-16 encoded byte stream of the returned string may not yield the same output.
Your input string contains the binary sequence 0xff 0xd8 0xff 0xe0 0x00 0x10 (in hex). Interpreting this as UTF-16LE (which you're using with System.Text.Encoding.Unicode) the first character would be 0xffd8, but is placed in the string as 0xfffd, which explains the change.
I tried decoding it with Encoding.Unicode, Encoding.UTF8 and Encoding.Default, but none of them yielded anything intelligible.

How do I load a string into a FileStream without going to disk?

string abc = "This is a string";
How do I load abc into a FileStream?
FileStream input = new FileStream(.....);
Use a MemoryStream instead...
MemoryStream ms = new MemoryStream(System.Text.Encoding.ASCII.GetBytes(abc));
remember a MemoryStream (just like a FileStream) needs to be closed when you have finished with it. You can always place your code in a using block to make this easier...
using(MemoryStream ms = new MemoryStream(System.Text.Encoding.ASCII.GetBytes(abc)))
{
//use the stream here and don't worry about needing to close it
}
NOTE: If your string is Unicode rather than ASCII you may want to specify this when converting to a Byte array. Basically, a Unicode character takes up 2 bytes instead of 1. Padding will be added if needed (e.g. 0x00 0x61 = "a" in unicode, where as in ASCII 0x61 = "a")

How do I ignore the UTF-8 Byte Order Marker in String comparisons?

I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).
Here's the relevant code snippet:
byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);
Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);
While debugging this test, the data string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray(), I noticed that the first byte of the string data is the value 65279 which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString() keeps this byte around.
How do I get Encoding.UTF8.GetString() to not put the Byte Order Marker in the resulting string?
Update: The problem was that GetData(), which reads a file from disk, reads the data from the file using FileStream.readbytes(). I corrected this by using a StreamReader and converting the string to bytes using Encoding.UTF8.GetBytes(), which is what it should've been doing in the first place! Thanks for all the help.
Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.
EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);
string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}
There is a slightly more efficient way to do it than creating StreamReader and MemoryStream:
1) If you know that there is always a BOM
string viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
2) If you don't know, check:
string viaEncoding;
if (withBom.Length >= 3 && withBom[0] == 0xEF && withBom[1] == 0xBB && withBom[2] == 0xBF)
viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
else
viaEncoding = Encoding.UTF8.GetString(withBom);
Unfortunately the BOM won't be removed with a simple Trim(). But it can be done as follows:
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
byte[] bom = { 0xef, 0xbb, 0xbf };
var text = System.Text.Encoding.UTF8.GetString(withBom);
Console.WriteLine($"Untrimmed: {text.Length}, {text}");
var trimmed = text.Trim(System.Text.Encoding.UTF8.GetString(bom).ToCharArray());
Console.WriteLine($"Trimmed: {trimmed.Length}, {trimmed}");
Output:
Untrimmed: 2, A
Trimmed: 1, A
I believe the extra character is removed if you Trim() the decoded string

Why am I getting an extra character (a dot or bullet point) at the beginning of my byte array?

I have the following code used to get xml from a DataSet into a byte array using UTF-8 encoding:
private static byte[] fGetXmlBytes(DataTable lvDataTable)
{
XmlWriterSettings lvSettings = new XmlWriterSettings();
lvSettings.Encoding = Encoding.UTF8;
lvSettings.NewLineHandling = NewLineHandling.Replace;
lvSettings.NewLineChars = String.Empty;
using(MemoryStream lvMemoryStream = new MemoryStream())
using (XmlWriter lvWriter = XmlWriter.Create(lvMemoryStream, lvSettings))
{
lvDataTable.WriteXml(lvWriter, XmlWriteMode.IgnoreSchema);
//Lines used during debugging
//byte[] lvXmlBytes = lvMemoryStream.GetBuffer();
//String lsXml = Encoding.UTF8.GetString(lvXmlBytes, 0, lvXmlBytes.Length);
return lvMemoryStream.GetBuffer();
}
}
I want a byte array because I subsequently pass the data to compression and encryption routines that work on byte arrays. Problem is I end up with an extra character at the start of the xml. Instead of:
<?xml version="1.0" encoding="utf-8"?><etc....
I get
.<?xml version="1.0" encoding="utf-8"?><etc....
Does anyone know why the character is there? Is there a way to prevent the character being added? Or to easily strip it out?
Colin
You will have to use an Encoding class that doesn't emit a preamble. The object returned by Encoding.UTF8 will emit a preamble, but you can create your own UTF8Encoding that doesn't emit a preamble like this:
lvSettings.Encoding = new UTF8Encoding(false);
The UTF-8 preamble is the UNICODE byte order mark (U+FEFF) encoded using UTF-8. The purpose of the UNICODE byte order mark is to indicate the endianness (byte order) of the 16-bit code units of the stream. If the initial bytes in the stream are 0xEF 0xFF the stream is big endian; otherwise, if the initial bytes are 0xFF 0xEF the stream is little endian.
U+FEFF encoded using UTF-8 results in the bytes 0xEF 0xBB 0xBF and somewhat ironically, because UTF-8 encodes into a sequence of 8-bit bytes, the byte order does no longer matter.
Preamble perhaps? Info here: http://www.firstobject.com/dn_markutf8preamble.htm
The extra character is the UTF-8 preamble. AFAIK you cannot prevent the preamble from being written to the stream. However, does it really matter? When the byte array is parsed back into XML, the preamble will be correctly interpreted without error, so you might as well just leave it in there.
I am doing mostly the same with this code and it works perfectly:
MemoryStream data = new MemoryStream(1000);
datatable.WriteXml(data);
return data.toArray();

Categories

Resources