Remove '�' from different encoded file when reading in C#

Remove '�' from different encoded file when reading in C# - c#

I can't control what encoding some of our clients save a file, and when it's ASCII the file may have missing characters that then show, '�'. How can I remove these characters, '�', after the file is read?
I am reading the file with the below line, but for each column would like to replace that character with a whitespace in C# .NET.
using (var parser = new TextFieldParser("", Encoding.UTF8))

Looks like you can create a UTF-8 Encoding with a custom error replacement:
var encoding = Encoding.GetEncoding(
"UTF-8",
null,
new DecoderReplacementFallback(string.Empty));
using (var parser = new TextFieldParser("", encoding)) {
⋮
}
I don’t know if the encoder fallback is allowed to be null. Replace it with new EncoderReplacementFallback(string.Empty) if not!

Related

How can I use C# to search an XML file for specific words?

I'm very new to C# and XML files in general, but currently I have an XML file that still has some html markup in it (&amp, ;quot;, etc.) and I want to read through the XML file and remove all of those so it becomes easily readable. I can open and print the file to the console with no issue, but I'm stumped trying to search for those specific strings and remove them.

One way to do this would be to put all the words you want to remove into an array, and then use the Replace method to replace them with empty strings:
var xmlFilePath = #"c:\temp\original.xml";
var newFilePath = #"c:\temp\modified.xml";
var wordsToRemove = new[] {"&amp", ";quot;"};
// Read existing xml file
var fileContents = File.ReadAllText(xmlFilePath);
// Remove words
foreach (var word in wordsToRemove)
{
fileContents = fileContents.Replace(word, "");
}
// Create new file with words removed
File.WriteAllText(newFilePath, fileContents);

I suppose you are looking for this: https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?view=netcore-3.1
Converts a string that has been HTML-encoded for HTTP transmission into a decoded string.
// Encode the string.
string myEncodedString = HttpUtility.HtmlEncode(myString);
Console.WriteLine($"HTML Encoded string is: {myEncodedString}");
StringWriter myWriter = new StringWriter();
// Decode the encoded string.
HttpUtility.HtmlDecode(myEncodedString, myWriter);
string myDecodedString = myWriter.ToString();
Console.Write($"Decoded string of the above encoded string is: {myDecodedString}");
Your string is html encoded, probably for transmission over network. So there is a built in method to decode it.

XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter

I have an XmlDocument that includes Kanji in its text content, and I need to write it to a stream using ISO-8859-1 encoding. When I do, none of the Kanji characters are encoded properly, and are instead replaced with "??".
Here is sample code that demonstrates how the XML is written from the XmlDocument:
MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();
What can be done to correctly encode Kanji in this specific situation?

As mentioned in comments, the ? character is showing up because Kanji characters are not supported by the encoding ISO-8859-1, so it substitutes ? as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding:
Note that the encoding classes allow errors (unsupported characters) to:
Silently change to a "?" character.
Use a "best fit" character.
Change to an application-specific behavior through use of the EncoderFallback and DecoderFallback classes with the U+FFFD Unicode replacement character.
This is the behavior you are seeing.
However, even though Kanji characters are not supported by ISO-8859-1, you can get a much better result by switching to the newer XmlWriter returned by XmlWriter.Create(Stream, XmlWriterSettings) and setting your encoding on XmlWriterSettings.Encoding like so:
MemoryStream mStream = new MemoryStream();
var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
Encoding = enc,
CloseOutput = false,
// Remove to enable the XML declaration if you want it. XmlTextWriter doesn't include it automatically.
OmitXmlDeclaration = true,
};
using (var writer = XmlWriter.Create(mStream, settings))
{
doc.WriteTo(writer);
}
mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();
By setting the Encoding property of XmlWriterSettings, the XML writer will be made aware whenever a character is not supported by the current encoding and automatically replace it with an XML character entity reference rather than some hardcoded fallback.
E.g. say you have XML like the following:
<Root>
<string>畑 はたけ hatake "field of crops"</string>
</Root>
Then your code will output the following, mapping all Kanji to the single fallback character:
<Root><string>? ??? hatake "field of crops"</string></Root>
Whereas the new version will output:
<Root><string>畑 はたけ hatake "field of crops"</string></Root>
Notice that the Kanji characters have been replaced with character entities such as 畑? All compliant XML parsers will recognize and reconstruct those characters, and thus no information will be lost despite the fact that your preferred encoding does not support Kanji.
Finally, as an aside note the documentation for XmlTextWriter states:
Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.
So replacing it with an XmlWriter is a good idea in general.
Sample .Net fiddle demonstrating usage of both writers and asserting that the XML generated by XmlWriter is semantically equivalent to the original XML despite the escaping of characters.

XML Illegal Characters in path using XPathDocument to do XSL Transform in C#

I have an XML in a string that I need to actually transform to html using an xsl.
I do the transform with XslCompiledTransform. In order for this to work, I am parsing the string that contains the XML to XML using XPathDocument.
However if I try to parse the string straight to the XPathDocument, then I get the error:
Illegal Characters in path.
So I had to include a StringReader in order to be able to parse the string to the XPathDocument. (Using the solutions in the posts I linked below.)
Here is my step by step procedure:
The string is retrieved from SDL Trados Studio and it depends on the XML that is being worked on (how it was originally created and loaded for translations) the string sometimes has a BOM sometimes not. The 'xml' is actually parsed from the segments of the source and target text and the structure element. The textual elements are escaped for xml and the markup and text is joined in one string. (My separate post on the removal of the BOM is C# XPathDocument parsing string to XML with BOM.)
The the string is then parsed into an XPathDocument using a StringReader.
The transform is done with the XslCompiledTransform, using a StringBuilder and a StringWriter.
Transformed xml (now html) is saved to a file.
Here is my code:
//Recreate XML file using an extractor returns a string array
string strSourceXML = String.Join("", extractor.TextSrc);
//strip BOM
strSourceXML = strSourceXML.Substring(strSourceXML.IndexOf("<?"));
//Transform XML with the preview XSL
var xSourceDoc = new XPathDocument(strSourceXML);
//Load XSL
var xTr = new XslCompiledTransform();
var xslt = Settings.GetValue("WordPreview", "XSLTpath", "");
xTr.Load(xslt);
//Parse XML string
dynamic xSourceDoc;
using (StringReader s = new StringReader(strSourceXML))
{
xSourceDoc = new XPathDocument(s);
}
//Transform the XML
StringBuilder sb1 = new StringBuilder();
StringWriter swSource = new StringWriter(sb1);
xTr.Transform(xSourceDoc, null, swSource);
//Transformed file saved to the disk
string tmpSourceDoc = Path.GetTempFileName();
System.IO.StreamWriter writer1 = new System.IO.StreamWriter(tmpSourceDoc, false, Encoding.Unicode);
writer1.Write(sb1.ToString());
writer1.Close();
My question is: Is there a simpler way to solve it? Any suggestions to transform the string straight using the XSLT? Or if not, is there a direct way to parse a string to the XPathDocument?
I have searched over many posts on Stack Overflow such as these:
XML Illegal Characters in path
Illegal characters in path error while parsing XML in C#
Illegal characters in path when loading a string with XDocument
But none of them give me the solution to do this simpler. Any suggestion is welcome. Thanks.

Don't need the intermediate StringBuilder and StringWriter.
XsltCompiledTransform instance can immediately writes to the stream on disk.
string strSourceXML = string.Concat(extractor.TextSrc);
strSourceXML = strSourceXML.Substring(strSourceXML.IndexOf("<?"));
var xTr = new XslCompiledTransform();
var xslt = Settings.GetValue("WordPreview", "XSLTpath", "");
xTr.Load(xslt);
string tmpSourceDoc = Path.GetTempFileName();
using (var reader = new StringReader(strSourceXML))
using (var writer = new StreamWriter(tmpSourceDoc, false, Encoding.Unicode))
{
var xSourceDoc = new XPathDocument(reader);
xTr.Transform(xSourceDoc, null, writer);
}

How to read byte[] with current encoding using streamreader

I would like to read byte[] using C# with the current encoding of the file.
As written in MSDN the default encoding will be UTF-8 when the constructor has no encoding:
var reader = new StreamReader(new MemoryStream(data)).
I have also tried this, but still get the file as UTF-8:
var reader = new StreamReader(new MemoryStream(data),true)
I need to read the byte[] with the current encoding.

A file has no encoding. A byte array has no encoding. A byte has no encoding. Encoding is something that transforms bytes to text and vice versa.
What you see in text editors and the like is actually program magic: The editor tries out different encodings an then guesses which one makes the most sense. This is also what you enable with the boolean parameter. If this does not produce what you want, then this magic fails.
var reader = new StreamReader(new MemoryStream(data), Encoding.Default);
will use the OS/Location specific default encoding. If that is still not what you want, then you need to be completely explicit, and tell the streamreader what exact encoding to use, for example (just as an example, you said you did not want UTF8):
var reader = new StreamReader(new MemoryStream(data), Encoding.UTF8);

I just tried leveraging different way of trying to figure out the ByteEncoding and it is not possible to do so as the byte array does not have an encoding in place as Jan mentions in his reply. However you can always take the value and do the type conversion to UTF8 or ASCII/Unicode and test the string values in case you are doing a "Text.EncodingFormat.GetString(byte [] array)"
public static bool IsUnicode(string input)
{
var asciiBytesCount = Encoding.ASCII.GetByteCount(input);
var unicodBytesCount = Encoding.UTF8.GetByteCount(input);
return asciiBytesCount != unicodBytesCount;
}

XDocument: saving XML to file without BOM

I'm generating an utf-8 XML file using XDocument.
XDocument xml_document = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement(ROOT_NAME,
new XAttribute("note", note)
)
);
...
xml_document.Save(#file_path);
The file is generated correctly and validated with an xsd file with success.
When I try to upload the XML file to an online service, the service says that my file is wrong at line 1; I have discovered that the problem is caused by the BOM on the first bytes of the file.
Do you know why the BOM is appended to the file and how can I save the file without it?
As stated in Byte order mark Wikipedia article:
While Unicode standard allows BOM in
UTF-8 it does not require or
recommend it. Byte order has no
meaning in UTF-8 so a BOM only
serves to identify a text stream or
file as UTF-8 or that it was converted
from another format that has a BOM
Is it an XDocument problem or should I contact the guys of the online service provider to ask for a parser upgrade?

Use an XmlTextWriter and pass that to the XDocument's Save() method, that way you can have more control over the type of encoding used:
var doc = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement("root", new XAttribute("note", "boogers"))
);
using (var writer = new XmlTextWriter(".\\boogers.xml", new UTF8Encoding(false)))
{
doc.Save(writer);
}
The UTF8Encoding class constructor has an overload that specifies whether or not to use the BOM (Byte Order Mark) with a boolean value, in your case false.
The result of this code was verified using Notepad++ to inspect the file's encoding.

First of all: the service provider MUST handle it, according to XML spec, which states that BOM may be present in case of UTF-8 representation.
You can force to save your XML without BOM like this:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = new UTF8Encoding(false); // The false means, do not emit the BOM.
using (XmlWriter w = XmlWriter.Create("my.xml", settings))
{
doc.Save(w);
}
(Googled from here: http://social.msdn.microsoft.com/Forums/en/xmlandnetfx/thread/ccc08c65-01d7-43c6-adf3-1fc70fdb026a)

The most expedient way to get rid of the BOM character when using XDocument is to just save the document, then do a straight File read as a file, then write it back out. The File routines will strip the character out for you:
XDocument xTasks = new XDocument();
XElement xRoot = new XElement("tasklist",
new XAttribute("timestamp",lastUpdated),
new XElement("lasttask",lastTask)
);
...
xTasks.Add(xRoot);
xTasks.Save("tasks.xml");
// read it straight in, write it straight back out. Done.
string[] lines = File.ReadAllLines("tasks.xml");
File.WriteAllLines("tasks.xml",lines);
(it's hoky, but it works for the sake of expediency - at least you'll have a well-formed file to upload to your online provider) ;)

By UTF-8 Documents
String XMLDec = xDoc.Declaration.ToString();
StringBuilder sb = new StringBuilder(XMLDec);
sb.Append(xDoc.ToString());
Encoding encoding = new UTF8Encoding(false); // false = without BOM
File.WriteAllText(outPath, sb.ToString(), encoding);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove '�' from different encoded file when reading in C# - c#

Related

How can I use C# to search an XML file for specific words?

XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter

XML Illegal Characters in path using XPathDocument to do XSL Transform in C#

How to read byte[] with current encoding using streamreader

XDocument: saving XML to file without BOM

Categories

Resources