How to read byte[] with current encoding using streamreader - c#

I would like to read byte[] using C# with the current encoding of the file.
As written in MSDN the default encoding will be UTF-8 when the constructor has no encoding:
var reader = new StreamReader(new MemoryStream(data)).
I have also tried this, but still get the file as UTF-8:
var reader = new StreamReader(new MemoryStream(data),true)
I need to read the byte[] with the current encoding.

A file has no encoding. A byte array has no encoding. A byte has no encoding. Encoding is something that transforms bytes to text and vice versa.
What you see in text editors and the like is actually program magic: The editor tries out different encodings an then guesses which one makes the most sense. This is also what you enable with the boolean parameter. If this does not produce what you want, then this magic fails.
var reader = new StreamReader(new MemoryStream(data), Encoding.Default);
will use the OS/Location specific default encoding. If that is still not what you want, then you need to be completely explicit, and tell the streamreader what exact encoding to use, for example (just as an example, you said you did not want UTF8):
var reader = new StreamReader(new MemoryStream(data), Encoding.UTF8);

I just tried leveraging different way of trying to figure out the ByteEncoding and it is not possible to do so as the byte array does not have an encoding in place as Jan mentions in his reply. However you can always take the value and do the type conversion to UTF8 or ASCII/Unicode and test the string values in case you are doing a "Text.EncodingFormat.GetString(byte [] array)"
public static bool IsUnicode(string input)
{
var asciiBytesCount = Encoding.ASCII.GetByteCount(input);
var unicodBytesCount = Encoding.UTF8.GetByteCount(input);
return asciiBytesCount != unicodBytesCount;
}

Related

How can I convert an XElement to a byte array for a PutFile operation?

I need to convert a big XElement to a byte array so that it can be uploaded later to a fileshare. What is the correct method to call to do that?
Below you see the signature of a method fileShare.PutFile that is internal:
void PutFile(string folder, string fileName, byte[] content);
Then given an XElement xml, I tried converting it to a byte array by encoding its XElement.Value using Encoding.Default.GetBytes() as follows:
byte[] bytes = Encoding.Default.GetBytes(xml.Value);
fileShare.PutFile(folderName, blobName, bytes);
I am not so sure xml.Value (XElement.Value) is really what GetBytes method is really needing though. Is this correct?
To test this, I spun up a console app and put in some fake data. I did this for the XElement:
XElement root = new XElement("Root",
new XElement("Child1", 1),
new XElement("Child2", 2),
new XElement("Child3", 3),
new XElement("Child4", 4),
new XElement("Child5", 5),
new XElement("Child6", 6)
);
Then I tried that line of code putting to a byte array
byte[] bytes = Encoding.Default.GetBytes(root.Value);
Well I guess I forgot that when I step over and see Autos that bytes variable is btye[6] and when I expand - I see that [0] = 49 and so on
Now this may not mean it is not working ... or does it mean that? How can I interpret the contents of the bytes array, to check whether it is correct?
Firstly, using Encoding.Default is not recommended. From the docs:
Warning
Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these reasons, using the default encoding is not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding. You could also use a higher-level protocol to ensure that the same format is used for encoding and decoding.
Secondly, XElement.Value returns
A String that contains all of the text content of this element. If there are multiple text nodes, they will be concatenated.
Thus if you upload the Value you will be stripping away the entire XML markup structure from your file leaving only the plain text. While you might want to do that, it seems very unlikely. If you compare the value with the string returned by XElement.ToString() the difference should be clear.
Instead, to convert the XML contents of your XElement (including both markup and text) to a byte array, it would be better to write your XElement directly to a MemoryStream using an appropriately configured XmlWriterSettings and return the byte array thereby created. The following extension method does the job:
public static partial class XNodeExtensions
{
static Encoding DefaultEncoding { get; } = new UTF8Encoding(false); // Disable the BOM because XElement.ToString() does not include it.
public static byte [] ToByteArray(this XNode node, SaveOptions options = default, Encoding encoding = default)
{
// Emulate the settings of XElement.ToString() and XDocument.ToString()
// https://referencesource.microsoft.com/#System.Xml.Linq/System/Xml/Linq/XLinq.cs,2004
// I omitted the XML declaration because XElement.ToString() omits it, but you might want to include it, depending upon your needs.
var settings = new XmlWriterSettings { OmitXmlDeclaration = true, Indent = (options & SaveOptions.DisableFormatting) == 0, Encoding = encoding ?? DefaultEncoding };
if ((options & SaveOptions.OmitDuplicateNamespaces) != 0)
settings.NamespaceHandling |= NamespaceHandling.OmitDuplicates;
return node.ToByteArray(settings);
}
public static byte [] ToByteArray(this XNode node, XmlWriterSettings settings)
{
using var ms = new MemoryStream();
using (var writer = XmlWriter.Create(ms, settings))
node.WriteTo(writer);
return ms.ToArray();
}
}
Now you can format your XElement to a UTF8-encoded byte array by doing:
var bytes = root.ToByteArray();
The extension method has the added advantage that, if you really need to use some encoding other than UTF8, unsupported Unicode characters will be escaped rather than replaced with a fallback as explained in this answer to XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter.
var bytes = root.ToByteArray(encoding : Encoding.Default);
To check for correctness, you could examine the contents of the byte array in the debugger or your console app by decoding it to a string as follows:
var resultString = Encoding.UTF8.GetString(bytes);
Console.WriteLine(resultString);
Or with the default encoding:
var resultString = Encoding.Default.GetString(bytes);
You could also assert that the contents of the byte array are correct by parsing the contents back to a new XElement and checking that the result is semantically identical to the original by using XNode.DeepEquals():
Assert.IsTrue(
XNode.DeepEquals(root,
XElement.Load(new StreamReader(new MemoryStream(bytes), encoding))));
Demo fiddle here.

c# how to add a line break to a memory stream

I am merging 3 files, for example, but at final there are not line breaks between the files...
MemoryStream m = new MemoryStream();
File.OpenRead("c:\file1.txt").CopyTo(m);
File.OpenRead("c:\file2.txt").CopyTo(m);
File.OpenRead("c:\file3.txt").CopyTo(m);
m.Position = 0;
Console.WriteLine(new StreamReader(m).ReadToEnd());
how can I may add a line break to a memory stream?
You can write the line break to the stream. You need to decide which one you want. Probably, you want Encoding.Xxx.GetBytes(Environment.NewLine). You also need to decide which encoding to use (which must match the encoding of the other files).
Since the line break string is ASCII what matters is only the distinction between single-byte encodings and ones that use more. Unicode uses two bytes per newline char for example.
If you need to guess you probably should go with UTF 8 without BOM.
You also can try a fully text based approach:
var result = File.ReadAllLines(a) + Environment.NewLine + File.ReadAllLines(b);
Let me also point out that you need to dispose the streams that you open.
Quick and dirty:
MemoryStream m = new MemoryStream();
File.OpenRead("c:\file1.txt").CopyTo(m);
m.WriteByte(0x0A); // this is the ASCII code for \n line feed
// You might want or need \n\r in which case you'd
// need to write 0x0D as well.
File.OpenRead("c:\file2.txt").CopyTo(m);
m.WriteByte(0x0A);
File.OpenRead("c:\file3.txt").CopyTo(m);
m.Position = 0;
Console.WriteLine(new StreamReader(m).ReadToEnd());
But as #usr points out, you really should think about the encoding.
Assuming you know the encoding, for example UTF-8, you can do:
using (var ms = new MemoryStream())
{
// Do stuff ...
var newLineBytes = Encoding.UTF8.GetBytes(Environment.NewLine);
ms.Write(newLineBytes, 0, newLineBytes.Length);
// Do more stuff ...
}

How do I load a string into a FileStream without going to disk?

string abc = "This is a string";
How do I load abc into a FileStream?
FileStream input = new FileStream(.....);
Use a MemoryStream instead...
MemoryStream ms = new MemoryStream(System.Text.Encoding.ASCII.GetBytes(abc));
remember a MemoryStream (just like a FileStream) needs to be closed when you have finished with it. You can always place your code in a using block to make this easier...
using(MemoryStream ms = new MemoryStream(System.Text.Encoding.ASCII.GetBytes(abc)))
{
//use the stream here and don't worry about needing to close it
}
NOTE: If your string is Unicode rather than ASCII you may want to specify this when converting to a Byte array. Basically, a Unicode character takes up 2 bytes instead of 1. Padding will be added if needed (e.g. 0x00 0x61 = "a" in unicode, where as in ASCII 0x61 = "a")

issue with XML encoding

I tried to phrase this as a generic question but realized I don't know enough, so here is the problem I'm having.
Here is a snippet from a console application:
public void Run()
{
Run(Console.Out);
}
public void Run(TextWriter writer)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(writer);
}
Then I run it from the console and use ">" to put it in a file.
c:\> QuickBooksETL extract US > qb_us.xml
If i try to load the result as I would normally:
var x = XDocument.Load("qb_us.xml");
I get the error:
Invalid character in the given encoding. Line 8, position 26.
So I tried to determine what .NET "thinks" it is using:
string path = #"\\ad1\accounting$\Xml\qb_us.xml";
StreamReader sr = new StreamReader(path);
sr.CurrentEncoding.Dump();
Result:
System.Text.UTF8Encoding
BodyName utf-8
EncodingName Unicode (UTF-8)
HeaderName utf-8
WebName utf-8
WindowsCodePage 1200
IsBrowserDisplay True
IsBrowserSave True
IsMailNewsDisplay True
IsMailNewsSave True
IsSingleByte False
EncoderFallback 5EncoderReplacementFallback
System.Text.EncoderReplacementFallback
DefaultString �
MaxCharCount 1
DecoderFallback 5DecoderReplacementFallback
System.Text.DecoderReplacementFallback
DefaultString �
MaxCharCount 1
IsReadOnly True
CodePage 65001
Finally, I find by guessing that it works if I just explicitly say it's ASCII:
string path = #"\\ad1\accounting$\Xml\qb_us.xml";
StreamReader sr = new StreamReader(path, Encoding.ASCII);
var x = XDocument.Load(sr);
Any thoughts on where am I going wrong would be greatly appreciated. I admit I have never taken the "deep dive" on character encodings, but I'm willing to put in the effort to get this right.
The simple answer is not to get the console involved. Write directly to the file from your code:
public void Run(string filename)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(filename);
}
or create the TextWriter or Stream yourself and pass that in, e.g.
public void Run(Stream output)
{
DataTable customers = _quickBooksAdapter.GetTableData("Customer");
customers.WriteXml(output);
}
Note that by reading it as ASCII, you'll basically be getting question marks for any non-ASCII character in the original data. IIRC, that's the default behaviour of an encoding when it encounters binary data it can't handle.
Using a Stream it should default to writing out in UTF-8, and the XML declaration and the data within the file should match.
In my experience, if your data includes illegal characters (for example, character 12), the XML doesn't round trip unless you read the XML with an XmlTextReader with Normalization = false. I've been using XmlSerializer.Deserialize(), not XDocument.Load(). Still, you might try calling the Load(XmlReader) overload by passing in an XmlTextReader with Normalization = false.
I would add my voice to Jon's in suggesting that you write to your own stream, not Console.Out.

Why am I getting an extra character (a dot or bullet point) at the beginning of my byte array?

I have the following code used to get xml from a DataSet into a byte array using UTF-8 encoding:
private static byte[] fGetXmlBytes(DataTable lvDataTable)
{
XmlWriterSettings lvSettings = new XmlWriterSettings();
lvSettings.Encoding = Encoding.UTF8;
lvSettings.NewLineHandling = NewLineHandling.Replace;
lvSettings.NewLineChars = String.Empty;
using(MemoryStream lvMemoryStream = new MemoryStream())
using (XmlWriter lvWriter = XmlWriter.Create(lvMemoryStream, lvSettings))
{
lvDataTable.WriteXml(lvWriter, XmlWriteMode.IgnoreSchema);
//Lines used during debugging
//byte[] lvXmlBytes = lvMemoryStream.GetBuffer();
//String lsXml = Encoding.UTF8.GetString(lvXmlBytes, 0, lvXmlBytes.Length);
return lvMemoryStream.GetBuffer();
}
}
I want a byte array because I subsequently pass the data to compression and encryption routines that work on byte arrays. Problem is I end up with an extra character at the start of the xml. Instead of:
<?xml version="1.0" encoding="utf-8"?><etc....
I get
.<?xml version="1.0" encoding="utf-8"?><etc....
Does anyone know why the character is there? Is there a way to prevent the character being added? Or to easily strip it out?
Colin
You will have to use an Encoding class that doesn't emit a preamble. The object returned by Encoding.UTF8 will emit a preamble, but you can create your own UTF8Encoding that doesn't emit a preamble like this:
lvSettings.Encoding = new UTF8Encoding(false);
The UTF-8 preamble is the UNICODE byte order mark (U+FEFF) encoded using UTF-8. The purpose of the UNICODE byte order mark is to indicate the endianness (byte order) of the 16-bit code units of the stream. If the initial bytes in the stream are 0xEF 0xFF the stream is big endian; otherwise, if the initial bytes are 0xFF 0xEF the stream is little endian.
U+FEFF encoded using UTF-8 results in the bytes 0xEF 0xBB 0xBF and somewhat ironically, because UTF-8 encodes into a sequence of 8-bit bytes, the byte order does no longer matter.
Preamble perhaps? Info here: http://www.firstobject.com/dn_markutf8preamble.htm
The extra character is the UTF-8 preamble. AFAIK you cannot prevent the preamble from being written to the stream. However, does it really matter? When the byte array is parsed back into XML, the preamble will be correctly interpreted without error, so you might as well just leave it in there.
I am doing mostly the same with this code and it works perfectly:
MemoryStream data = new MemoryStream(1000);
datatable.WriteXml(data);
return data.toArray();

Categories

Resources