Download HTML to Word with Chinese characters - c#

We have a "Download to Word" feature in our application. Rather than creating an actual binary .doc file we create an HTML document and set the MIME type to indicate it's a Word document. Here's a stripped-down version of the method we're using.
private FileContentResult ExportToWord( string htmlSource, string filename )
{
StringBuilder doc = new StringBuilder();
doc.Append( "<html><body>" );
doc.Append( htmlSource );
doc.Append( "</body></html>" );
byte[] buffer = Encoding.UTF8.GetBytes( doc.ToString() );
FileContentResult result = new FileContentResult( buffer, "application/msword" );
result.FileDownloadName = string.Format( "{0}.doc", filename );
return result;
}
In the above example htmlSource is the body of the document, so it would contain something like:
<p>This is the first paragraph.</p>
All of the above works just fine until we introduce Unicode characters into htmlSource. If htmlSource contains
<p>这是一个测试</p>
then in the Word document we get
这是一个测试
We've tried replacing Encoding.UTF8 with Encoding.Unicode and Encoding.UTF32 but in both cases Word ends up displaying all the markup with null/space between each character (and the Chinese strings still don't show up correctly).
I've also tried using Server.HtmlEncode against the Chinese string, but that gives me back the same string of Chinese characters.
I'm at a loss as to how to solve this problem.

As it turns out, while finding the solution wasn't easy the actual implementation was pretty simple. We just changed this line:
byte[] buffer = Encoding.UTF8.GetBytes( doc.ToString() );
To this:
byte[] buffer = Encoding.Unicode.GetPreamble()
.Concat( Encoding.Unicode.GetBytes( doc.ToString() ) )
.ToArray();
The GetPreamble() method adds the byte-order-mark to the file so Word knows how to interpret the file contents. It is now able to determine that the file contains Unicode and properly interprets the markup instead of displaying it in the document.

Related

How can I convert an XElement to a byte array for a PutFile operation?

I need to convert a big XElement to a byte array so that it can be uploaded later to a fileshare. What is the correct method to call to do that?
Below you see the signature of a method fileShare.PutFile that is internal:
void PutFile(string folder, string fileName, byte[] content);
Then given an XElement xml, I tried converting it to a byte array by encoding its XElement.Value using Encoding.Default.GetBytes() as follows:
byte[] bytes = Encoding.Default.GetBytes(xml.Value);
fileShare.PutFile(folderName, blobName, bytes);
I am not so sure xml.Value (XElement.Value) is really what GetBytes method is really needing though. Is this correct?
To test this, I spun up a console app and put in some fake data. I did this for the XElement:
XElement root = new XElement("Root",
new XElement("Child1", 1),
new XElement("Child2", 2),
new XElement("Child3", 3),
new XElement("Child4", 4),
new XElement("Child5", 5),
new XElement("Child6", 6)
);
Then I tried that line of code putting to a byte array
byte[] bytes = Encoding.Default.GetBytes(root.Value);
Well I guess I forgot that when I step over and see Autos that bytes variable is btye[6] and when I expand - I see that [0] = 49 and so on
Now this may not mean it is not working ... or does it mean that? How can I interpret the contents of the bytes array, to check whether it is correct?
Firstly, using Encoding.Default is not recommended. From the docs:
Warning
Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these reasons, using the default encoding is not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding. You could also use a higher-level protocol to ensure that the same format is used for encoding and decoding.
Secondly, XElement.Value returns
A String that contains all of the text content of this element. If there are multiple text nodes, they will be concatenated.
Thus if you upload the Value you will be stripping away the entire XML markup structure from your file leaving only the plain text. While you might want to do that, it seems very unlikely. If you compare the value with the string returned by XElement.ToString() the difference should be clear.
Instead, to convert the XML contents of your XElement (including both markup and text) to a byte array, it would be better to write your XElement directly to a MemoryStream using an appropriately configured XmlWriterSettings and return the byte array thereby created. The following extension method does the job:
public static partial class XNodeExtensions
{
static Encoding DefaultEncoding { get; } = new UTF8Encoding(false); // Disable the BOM because XElement.ToString() does not include it.
public static byte [] ToByteArray(this XNode node, SaveOptions options = default, Encoding encoding = default)
{
// Emulate the settings of XElement.ToString() and XDocument.ToString()
// https://referencesource.microsoft.com/#System.Xml.Linq/System/Xml/Linq/XLinq.cs,2004
// I omitted the XML declaration because XElement.ToString() omits it, but you might want to include it, depending upon your needs.
var settings = new XmlWriterSettings { OmitXmlDeclaration = true, Indent = (options & SaveOptions.DisableFormatting) == 0, Encoding = encoding ?? DefaultEncoding };
if ((options & SaveOptions.OmitDuplicateNamespaces) != 0)
settings.NamespaceHandling |= NamespaceHandling.OmitDuplicates;
return node.ToByteArray(settings);
}
public static byte [] ToByteArray(this XNode node, XmlWriterSettings settings)
{
using var ms = new MemoryStream();
using (var writer = XmlWriter.Create(ms, settings))
node.WriteTo(writer);
return ms.ToArray();
}
}
Now you can format your XElement to a UTF8-encoded byte array by doing:
var bytes = root.ToByteArray();
The extension method has the added advantage that, if you really need to use some encoding other than UTF8, unsupported Unicode characters will be escaped rather than replaced with a fallback as explained in this answer to XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter.
var bytes = root.ToByteArray(encoding : Encoding.Default);
To check for correctness, you could examine the contents of the byte array in the debugger or your console app by decoding it to a string as follows:
var resultString = Encoding.UTF8.GetString(bytes);
Console.WriteLine(resultString);
Or with the default encoding:
var resultString = Encoding.Default.GetString(bytes);
You could also assert that the contents of the byte array are correct by parsing the contents back to a new XElement and checking that the result is semantically identical to the original by using XNode.DeepEquals():
Assert.IsTrue(
XNode.DeepEquals(root,
XElement.Load(new StreamReader(new MemoryStream(bytes), encoding))));
Demo fiddle here.

How can I use C# to search an XML file for specific words?

I'm very new to C# and XML files in general, but currently I have an XML file that still has some html markup in it (&amp, ;quot;, etc.) and I want to read through the XML file and remove all of those so it becomes easily readable. I can open and print the file to the console with no issue, but I'm stumped trying to search for those specific strings and remove them.
One way to do this would be to put all the words you want to remove into an array, and then use the Replace method to replace them with empty strings:
var xmlFilePath = #"c:\temp\original.xml";
var newFilePath = #"c:\temp\modified.xml";
var wordsToRemove = new[] {"&amp", ";quot;"};
// Read existing xml file
var fileContents = File.ReadAllText(xmlFilePath);
// Remove words
foreach (var word in wordsToRemove)
{
fileContents = fileContents.Replace(word, "");
}
// Create new file with words removed
File.WriteAllText(newFilePath, fileContents);
I suppose you are looking for this: https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?view=netcore-3.1
Converts a string that has been HTML-encoded for HTTP transmission into a decoded string.
// Encode the string.
string myEncodedString = HttpUtility.HtmlEncode(myString);
Console.WriteLine($"HTML Encoded string is: {myEncodedString}");
StringWriter myWriter = new StringWriter();
// Decode the encoded string.
HttpUtility.HtmlDecode(myEncodedString, myWriter);
string myDecodedString = myWriter.ToString();
Console.Write($"Decoded string of the above encoded string is: {myDecodedString}");
Your string is html encoded, probably for transmission over network. So there is a built in method to decode it.

Understand how to decode this base64 encoded string

I've handled base64 encoded images and strings and have been able to decode them using C# in the past.
I'm now trying on what looks to me like a base64 string, but the value I'm getting is about 98% accurate and I just don't understand what is affecting the output.
Here is the string:
http://pastebin.com/ntcth6uN
And this is the decoded value:
http://pastebin.com/Buh4xXDA
That IS what it should be, but you can clearly see where there are artifacts and the decoded value isn't quite right.
Any idea why it's failing?
var data = Convert.FromBase64String(Faces[i].InfoData);
Faces[i].InfoData = Encoding.UTF8.GetString(data);
Thanks for your help.
The string is not encoded as UTF8, but instead as another encoding. Thus the same encoding must be used to decode it.
Use the following to decode it:
Encoding.ASCII.GetString(data);
If ASCII isn't the correct encoding, here's some code that will iterate through available encodings and list the first 200 characters in order to manually select original encoding:
String encStr = "PFdhdGNoIG5hbWU9IlBpa2FjaHUDV2F0Y2DDQ2hyb25vIiBkZXNjcmlwdGlvbj0iIiBhdXRob3I9IkFsY2hlbWlzdFByaW1lIiB3ZWJfbGluaz0iaHR0cgov42ZhY2VyZXBv4mNvbS9hcHAvZmFjZXMvdXNlci9BbGNoZW1pc3RQcmltZS8xIiBiZ19jb2xvcj0iYWJhYmFiIiBpbmRfbG9jPSJ0YyIDaW5kX2JnPSJZIiBob3R3b3JkX2xvYz0idGMiIGhvdHdvcmRfYmc9IlkiPDoDICADPExheWVyIHR5cGU9ImltYWdlIiBLPSIwIiB5PSIwIiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDcGF0ag0i4mltZzQ5OTIucHBuZyIDd2lkdGD9IjU1MCIDaGVpZ2h0PSI1NTAiIGNvbG9yPSJmZmZmZmYiIGRpc3BsYXk9ImJkIi8+CiADICA8TGF5ZXIDdHlwZT0idGVLdCIDeg0iMTciIHk9IjQ5IiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDdGVLdg0ie2RofTp7ZG16fSIDdGVLdF9zaXplPSIzMiIDZm9udg0iTENEUEhPTkUiIHRyYW5zZm9ybT0ibiIDY29sb3JfZGltPSIxZgFkMWQiIGNvbG9yPSIxZgFkMWQiIGRpc3BsYXk9ImJkIi8+CiADICA8TGF5ZXIDdHlwZT0idGVLdCIDeg0iOTAiIHk9IjU0IiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDdGVLdg0ie2Rzen0iIHRleHRfc2l6ZT0iMTDiIGZvbnQ9IkxgRFBIT05FIiB0cmFuc2Zvcm09ImLiIGNvbG9yX2RpbT0iMWQxZgFkIiBjb2xvcj0iMWQxZgFkIiBkaXNwbGF5PSJiZCIvPDoDICADPExheWVyIHR5cGU9InRleHQiIHD9IjEyNCIDeT0iNTQiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0ie3N3cnN9PT0DMCBhbmQDMTAwIG9yIgAiIGFsaWdubWVudg0iY2MiIHRleHQ9IntkYX0iIHRleHRfc2l6ZT0iMjAiIGZvbnQ9IkxgRFBIT05FIiB0cmFuc2Zvcm09ImLiIGNvbG9yX2RpbT0iMWQxZgFkIiBjb2xvcj0iMWQxZgFkIiBkaXNwbGF5PSJiZCIvPDoDICADPExheWVyIHR5cGU9InRleHQiIHD9Ii0LOSIDeT0iNTIiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0iMTAwIiBhbGlnbm1lbnQ9ImNjIiB0ZXh0PSJ7Ymx9IiB0ZXh0X3NpemU9IjE5IiBmb250PSJMQ0RQSE9ORSIDdHJhbnNmb3JtPSJuIiBjb2xvcl9kaW09IjFkMWQxZCIDY29sb3I9IjFkMWQxZCIDZGlzcGxheT0iYmQi4zLKICADIgxMYXllciB0eXBlPSJzaGFwZSIDeg0iMSIDeT0iNgDiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0ie3N3cnN9PT0DMCBhbmQDMCBvciAxMgAiIGFsaWdubWVudg0iY2MiIHNoYXBlPSJTcXVhcmUiIHdpZHRoPSIyNzDiIGhlaWdodg0iNgUiIGNvbG9yPSJhYmFiYWIiIGRpc3BsYXk9ImJkIi8+CiADICA8TGF5ZXIDdHlwZT0idGVLdCIDeg0iMCIDeT0iNgDiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0ie3N3cnN9PT0DMCBhbmQDMCBvciAxMgAiIGFsaWdubWVudg0iY2MiIHRleHQ9Intzd219Ontzd3N9Ontzd3Nzc30iIHRleHRfc2l6ZT0iMzYiIGZvbnQ9IkxgRFBIT05FIiB0cmFuc2Zvcm09ImLiIGNvbG9yX2RpbT0iMWQxZgFkIiBjb2xvcj0iMWQxZgFkIiBkaXNwbGF5PSJiZCIvPDoDICADPExheWVyIHR5cGU9InRleHQiIHD9Ii0xMTDiIHk9Ijk2IiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDdGVLdg0ie3N3cn0DYW5kICZhcG9zO1NUT1AmYXBvczsDb3IDJmFwb3M7U1RBUlQmYXBvczsiIHRleHRfc2l6ZT0iMTUiIGZvbnQ9IlJvYm90by1SZWd1bGFyIiB0cmFuc2Zvcm09ImLiIGNvbG9yX2RpbT0iMWQxZgFkIiBjb2xvcj0iMWQxZgFkIiBkaXNwbGF5PSJiZCIDdGFwX2FjdGlvbj0ic3dfc3RhcnRfc3RvcCIvPDoDICADPExheWVyIHR5cGU9InRleHQiIHD9IjExOCIDeT0iOTYiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0iMTAwIiBhbGlnbm1lbnQ9ImNjIiB0ZXh0PSJSRVNFVCIDdGVLdF9zaXplPSIxNSIDZm9udg0iUm9ib3Rv4VJlZ3VsYXIiIHRyYW5zZm9ybT0ibiIDY29sb3JfZGltPSIxZgFkMWQiIGNvbG9yPSIxZgFkMWQiIGRpc3BsYXk9ImJkIiB0YXBfYWN0aW9uPSJzd19yZXNldCIvPDoDICADPExheWVyIHR5cGU9InNoYXBlIiBLPSItMTI0IiB5PSIxNgEiIGd5cm89IjAiIHJvdGF0aW9uPSItOTAiIHNrZXdfeg0iMCIDc2tld195PSIwIiBvcGFjaXR5PSIxMgAiIGFsaWdubWVudg0iY2MiIHNoYXBlPSJUcmlhbmdsZSIDd2lkdGD9IjM1IiBoZWlnaHQ9IjM1IiBjb2xvcj0iMWQxZgFkIiBkaXNwbGF5PSJiZCIvPDoDICADPExheWVyIHR5cGU9InNoYXBlIiBLPSItMTE3IiB5PSIxNgQiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0iMCIDYWxpZ25tZW50PSJjYyIDc2hhcGU9IlNxdWFyZSIDd2lkdGD9IjEyMCIDaGVpZ2h0PSIxMjAiIGNvbG9yPSIyOWI5ZgMiIGRpc3BsYXk9ImJkIiB0YXBfYWN0aW9uPSJzd19zdGFydF9zdG9wIi8+CiADICA8TGF5ZXIDdHlwZT0ic2hhcGUiIHD9IjEyNCIDeT0iMTQxIiBneXJvPSIwIiByb3RhdGlvbj0iOTAiIHNrZXdfeg0iMCIDc2tld195PSIwIiBvcGFjaXR5PSIxMgAiIGFsaWdubWVudg0iY2MiIHNoYXBlPSJUcmlhbmdsZSIDd2lkdGD9IjM1IiBoZWlnaHQ9IjM1IiBjb2xvcj0iMWQxZgFkIiBkaXNwbGF5PSJiZCIvPDoDICADPExheWVyIHR5cGU9InNoYXBlIiBLPSIxMTciIHk9IjE0NCIDZ3lybz0iMCIDcm90YXRpb2L9IjAiIHNrZXdfeg0iMCIDc2tld195PSIwIiBvcGFjaXR5PSIwIiBhbGlnbm1lbnQ9ImNjIiBzaGFwZT0iU3F1YXJlIiB3aWR0ag0iMTIwIiBoZWlnaHQ9IjEyMCIDY29sb3I9IjI5YjlkMyIDZGlzcGxheT0iYmQiIHRhcF9hY3Rpb2L9InN3X3Jlc2V0Ii8+CiADICA8TGF5ZXIDdHlwZT0idGVLdCIDeg0i4TDyIiB5PSItNgMiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0iMTAwIiBhbGlnbm1lbnQ9ImNjIiB0ZXh0PSJ7d3RkfSIDdGVLdF9zaXplPSIyNSIDZm9udg0iTENEUEhPTkUiIHRyYW5zZm9ybT0ibiIDY29sb3JfZGltPSIxZgFkMWQiIGNvbG9yPSIxZgFkMWQiIGRpc3BsYXk9ImJkIi8+CiADICA8TGF5ZXIDdHlwZT0iaW1hZ2VfY29uZCIDeg0i4TD5IiB5PSItOTIiIGd5cm89IjAiIHJvdGF0aW9uPSIwIiBza2V3X3D9IjAiIHNrZXdfeT0iMCIDb3BhY2l0eT0iMTAwIiBhbGlnbm1lbnQ9ImNjIiBwYXRoPSJ3ZWF0aGVyX3NldF8z4nBwbmciIHdpZHRoPSI3MCIDaGVpZ2h0PSI3MCIDY29sb3I9IjQ1NgU0NSIDY29uZF92YWx1ZT0iJmFwb3M7e3djaX0mYXBvczsDPT0DJmFwb3M7MgFkJmFwb3M7IGFuZCAxIG9yICZhcG9zO3t3Y2l9JmFwb3M7Ig09ICZhcG9zOzAyZCZhcG9zOyBhbmQDMiBvciAmYXBvczt7d2NpfSZhcG9zOyA9PSAmYXBvczswM2QmYXBvczsDYW5kIgMDb3IDJmFwb3M7e3djaX0mYXBvczsDPT0DJmFwb3M7MgRkJmFwb3M7IGFuZCA0IG9yICZhcG9zO3t3Y2l9JmFwb3M7Ig09ICZhcG9zOzA5ZCZhcG9zOyBhbmQDNSBvciAmYXBvczt7d2NpfSZhcG9zOyA9PSAmYXBvczsxMGQmYXBvczsDYW5kIgYDb3IDJmFwb3M7e3djaX0mYXBvczsDPT0DJmFwb3M7MTFkJmFwb3M7IGFuZCA3IG9yICZhcG9zO3t3Y2l9JmFwb3M7Ig09ICZhcG9zOzEzZCZhcG9zOyBhbmQDOCBvciAmYXBvczt7d2NpfSZhcG9zOyA9PSAmYXBvczs1MGQmYXBvczsDYW5kIgkDb3IDMSIDY29uZF9ncmlkPSIzegMiIGRpc3BsYXk9ImJkIi8+CiADICA8TGF5ZXIDdHlwZT0idGVLdCIDeg0iMTEzIiB5PSIzIiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDdGVLdg0iTW9vbiIDdGVLdF9zaXplPSIxOCIDZm9udg0iTENEUEhPTkUiIHRyYW5zZm9ybT0ibiIDY29sb3JfZGltPSIxZgFkMWQiIGNvbG9yPSIxZgFkMWQiIGRpc3BsYXk9ImJkIi8+CiADICA8TGF5ZXIDdHlwZT0ic2hhcGUiIHD9IjExNCIDeT0i4TUyIiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDc2hhcGU9IkNpcmNsZSIDd2lkdGD9IjYwIiBoZWlnaHQ9IjYwIiBjb2xvcj0iNgU0NTQ1IiBkaXNwbGF5PSJiZCIvPDoDICADPExheWVyIHR5cGU9ImltYWdlX2NvbmQiIHD9IjExNCIDeT0i4TUyIiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDcGF0ag0ibW9vbl9zZXRfMy5wcG5nIiB3aWR0ag0iNjAiIGhlaWdodg0iNjAiIGNvbG9yPSJhYmFiYWIiIGNvbmRfdmFsdWU9Int3bXB9IiBjb25kX2dyaWQ9IjNLMyIDZGlzcGxheT0iYmQi4zLKICADIgxMYXllciB0eXBlPSJ0ZXh0IiBLPSIwIiB5PSIwIiBneXJvPSIwIiByb3RhdGlvbj0iMCIDc2tld19LPSIwIiBza2V3X3k9IjAiIG9wYWNpdHk9IjEwMCIDYWxpZ25tZW50PSJjYyIDdGVLdg0iIiB0ZXh0X3NpemU9IjQwIiBmb250PSJSb2JvdG8tUmVndWxhciIDdHJhbnNmb3JtPSJuIiBjb2xvcl9kaW09ImZmZmZmZiIDY29sb3I9ImZmZmZmZiIDZGlzcGxheT0iYmQi4zLKICADIgxMYXllciB0eXBlPSJ0ZXh0IiBLPSIxMCIDeT0i4TExNiIDZ3lybz0iMCIDcm90YXRpb2L9IjAiIHNrZXdfeg0iMCIDc2tld195PSIwIiBvcGFjaXR5PSIxMgAiIGFsaWdubWVudg0iY2MiIHRleHQ9IntkZHd9IHtkbm5ufSB7ZGR9IiB0ZXh0X3NpemU9IjE2IiBmb250PSJMQ0RQSE9ORSIDdHJhbnNmb3JtPSJuIiBjb2xvcl9kaW09IjFkMWQxZCIDY29sb3I9IjFkMWQxZCIDZGlzcGxheT0iYmQi4zLKPC9XYXRjagLK";
var data = Convert.FromBase64String(encStr);
// Iterate over all encodings, and decode
foreach (EncodingInfo ei in Encoding.GetEncodings())
{
Encoding e = ei.GetEncoding();
Console.Write("{0,-15} - {1}{2}", ei.CodePage, e.EncodingName, System.Environment.NewLine);
Console.WriteLine(e.GetString(data).Substring(0, 200));
Console.Write(System.Environment.NewLine + "---------------------" + System.Environment.NewLine);
}
Fiddle: https://dotnetfiddle.net/LcV5s8

How to strip out 0x0a special char from utf8 file using c# and keep file as utf8?

The following is a line from a UTF-8 file from which I am trying to remove the special char (0X0A), which shows up as a black diamond with a question mark below:
2464577 外國法譯評 True s6620178 Unspecified <1>�1009-672
This is generated when SSIS reads a SQL table then writes out, using a flat file mgr set to code page 65001.
When I open the file up in Notepad++, displays as 0X0A.
I'm looking for some C# code to definitely strip that char out and replace it with either nothing or a blank space.
Here's what I have tried:
string fileLocation = "c:\\MyFile.txt";
var content = string.Empty;
using (StreamReader reader = new System.IO.StreamReader(fileLocation))
{
content = reader.ReadToEnd();
reader.Close();
}
content = content.Replace('\u00A0', ' ');
//also tried: content.Replace((char)0X0A, ' ');
//also tried: content.Replace((char)0X0A, '');
//also tried: content.Replace((char)0X0A, (char)'\0');
Encoding encoding = Encoding.UTF8;
using (FileStream stream = new FileStream(fileLocation, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream, encoding))
{
writer.Write(encoding.GetPreamble()); //This is for writing the BOM
writer.Write(content);
}
}
I also tried this code to get the actual string value:
byte[] bytes = { 0x0A };
string text = Encoding.UTF8.GetString(bytes);
And it comes back as "\n". So in the code above I also tried replacing "\n" with " ", both in double quotes and single quotes, but still no change.
At this point I'm out of ideas. Anyone got any advice?
Thanks.
may wanna have a look at regex replacement, for a good example of this, take a look at the post towards the bottom of this page...
http://social.msdn.microsoft.com/Forums/en-US/1b523d24-dab6-4870-a9ca-5d313d1ee602/invalid-character-returned-from-webservice
You can convert the string to a char array and loop through the array.
Then check what char the black diamond is and just remove it.
string content = "blahblah" + (char)10 + "blahblah";
char find = (char)10;
content = content.Replace(find.ToString(), "");

Decoding base64-encoded data from xml document

I receive some xml-files with embedded base64-encoded images, that I need to decode and save as files.
An unmodified (other than zipped) example of such a file can be downloaded below:
20091123-125320.zip (60KB)
However, I get errors like "Invalid length for a Base-64 char array" and "Invalid character in a Base-64 string". I marked the line in the code where I get the error in the code.
A file could look like this:
<?xml version="1.0" encoding="windows-1252"?>
<mediafiles>
<media media-type="image">
<media-reference mime-type="image/jpeg"/>
<media-object encoding="base64"><![CDATA[/9j/4AAQ[...snip...]P4Vm9zOR//Z=]]></media-object>
<media.caption>What up</media.caption>
</media>
</mediafiles>
And the code to process like this:
var xd = new XmlDocument();
xd.Load(filename);
var nodes = xd.GetElementsByTagName("media");
foreach (XmlNode node in nodes)
{
var mediaObjectNode = node.SelectSingleNode("media-object");
//The line below is where the errors occur
byte[] imageBytes = Convert.FromBase64String(mediaObjectNode.InnerText);
//Do stuff with the bytearray to save the image
}
The xml-data is from an enterprise newspaper system, so I am pretty sure the files are ok - and there must be something in the way I process them, that is just wrong. Maybe a problem with the encoding?
I have tried writing out the contents of mediaObjectNode.InnerText, and it is the base64 encoded data - so the navigating the xml-doc is not the issue.
I have been googling, binging, stackoverflowing and crying - and found no solution... Help!
Edit:
Added an actual example file (and a bounty). PLease note the downloadable file is in a bit different schema, since I simplified it in the above example, removing irrelevant stuff...
For a first shot i didn't use any programming language, just Notepad++
I opened the xml file within and copy and pasted the raw base64 content into a new file (without square brackets).
Afterwards I selected everything (Strg-A) and used the option Extensions - Mime Tools - Base64 decode. This threw an error about the wrong text length (must be mod 4). So i just added two equal signs ('=') as placeholder at the end to get the correct length.
Another retry and it decoded successfully into 'something'. Just save the file as .jpg and it opens like a charm in any picture viewer.
So i would say, there IS something wrong with the data you'll get. They just don't have the right numbers of equal signs at the end to fill up to a number of signs which can be break into packets of 4.
The 'easy' way would be to add the equal sign till the decoding doesn't throw an error. The better way would be to count the number of characters (minus CR/LFs!) and add the needed ones in one step.
Further investigations
After some coding and reading of the convert function, the problem is a wrong attaching of a equal sign from the producer. Notepad++ has no problem with tons of equal signs, but the Convert function from MS only works with zero, one or two signs. So if you fill up the already existing one with additional equal signs you get an error too! To get this damn thing to work, you have to cut off all existing signs, calculate how much are needed and add them again.
Just for the bounty, here is my code (not absolute perfect, but enough for a good starting point): ;-)
static void Main(string[] args)
{
var elements = XElement
.Load("test.xml")
.XPathSelectElements("//media/media-object[#encoding='base64']");
foreach (XElement element in elements)
{
var image = AnotherDecode64(element.Value);
}
}
static byte[] AnotherDecode64(string base64Decoded)
{
string temp = base64Decoded.TrimEnd('=');
int asciiChars = temp.Length - temp.Count(c => Char.IsWhiteSpace(c));
switch (asciiChars % 4)
{
case 1:
//This would always produce an exception!!
//Regardless what (or what not) you attach to your string!
//Better would be some kind of throw new Exception()
return new byte[0];
case 0:
asciiChars = 0;
break;
case 2:
asciiChars = 2;
break;
case 3:
asciiChars = 1;
break;
}
temp += new String('=', asciiChars);
return Convert.FromBase64String(temp);
}
The base64 string is not valid as Oliver has already said, the string length must be multiples of 4 after removing white space characters. If you look at then end of the base64 string (see below) you will see the line is shorter than the rest.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRX//Z=
If you remove this line, your program will work, but the resulting image will have a missing section in the bottom right hand corner. You need to pad this line so the overall string length is corect. From my calculations if you had 3 characters it should work.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRX//Z=
remove last 2 characters while image not get proper
public Image Base64ToImage(string base64String)
{
// Convert Base64 String to byte[]
byte[] imageBytes=null;
bool iscatch=true;
while(iscatch)
{
try
{
imageBytes = Convert.FromBase64String(base64String);
iscatch = false;
}
catch
{
int length=base64String.Length;
base64String=base64String.Substring(0,length-2);
}
}
MemoryStream ms = new MemoryStream(imageBytes, 0,
imageBytes.Length);
// Convert byte[] to Image
ms.Write(imageBytes, 0, imageBytes.Length);
Image image = Image.FromStream(ms, true);
pictureBox1.Image = image;
return image;
}
Try using Linq to XML:
using System.Xml.XPath;
class Program
{
static void Main(string[] args)
{
var elements = XElement
.Load("test.xml")
.XPathSelectElements("//media/media-object[#encoding='base64']");
foreach (var element in elements)
{
byte[] image = Convert.FromBase64String(element.Value);
}
}
}
UPDATE:
After downloading the XML file and analyzing the value of the media-object node it is clear that it is not a valid base64 string:
string value = "PUT HERE THE BASE64 STRING FROM THE XML WITHOUT THE NEW LINES";
byte[] image = Convert.FromBase64String(value);
throws a System.FormatException saying that the length is not a valid base 64 string. Event when I remove the \n from the string it doesn't work:
var elements = XElement
.Load("20091123-125320.xml")
.XPathSelectElements("//media/media-object[#encoding='base64']");
foreach (var element in elements)
{
string value = element.Value.Replace("\n", "");
byte[] image = Convert.FromBase64String(value);
}
also throws System.FormatException.
I've also had a problem with decoding Base64 encoded string from XML document (specifically Office OpenXML package document).
It turned out that string had additional encoding applied: HTML encoding, so doing first HTML decoding and then Base64 decoding did the trick:
private static byte[] DecodeHtmlBase64String(string value)
{
return System.Convert.FromBase64String(System.Net.WebUtility.HtmlDecode(value));
}
Just in case someone else stumbles on the same issue.
Well, it's all very simple. CDATA is a node itself, so mediaObjectNode.InnerText actually produces <![CDATA[/9j/4AAQ[...snip...]P4Vm9zOR//Z=]]>, which is obviously not valid Base64-encoded data.
To make things work, use mediaObjectNode.ChildNodes[0].Value and pass that value to Convert.FromBase64String'.
Is the character encoding correct? The error sounds like there's a problem that causes invalid characters to appear in the array. Try copying out the text and decoding manually to see if the data is indeed valid.
(For the record, windows-1252 is not exactly the same as iso-8859-1, so that may be the cause of a problem, barring other sources of corruption.)

Categories

Resources