Convert all HTML entities not predefined for XML to unicode - c#

I am trying to manipulate a string containing HTML-Code and then save the content to a htm-file. Afterwards the htm file is imported to a Word-File. Goal is to append a document formatted in HTML to a Word document. This process is part of a much larger programm and i cannot modify the given parameters.
To easily modify the HTML-Code I thought using XDocument would be a great idea.
So I tried this:
AppendContent(string content, Document doc)
{
string filePath = ...; //somewhere in /AppData/Local
var xDoc = XDocument.Parse(content);
// code left out because irrelevant
// Finding all "img" elements, in order to
// extract the embedded picture and save it as external file
FileHelper.SaveToFile(filePath, xDoc.ToString());
//... After this, the file is appended to the word file (the one in doc)
}
First attempt worked actually, with a small test html. Using any of the big documents I'm trying to append to the word document, cause an exception to be thrown:
XDocument.Parse cannot parse entities like "nbsp" or "uuml" (german ü). I already found out that XML only supports a hand full of predefined entities, so i would have to manually add the definition to the html file. This is not an option, because this operation is supposed to work with ANY Html file.
I found following fix:
var decodedContent = WebUtility.HtmlDecode(content);
var xDoc = XDocument.Parse(decodedContent);
This converts all entities to the representing character. So "uuml" is converted to "ü", etc. This worked until i hit a document that contained the "amp" entity, which is then converted to "&"... and such the XDocument.Parse is complaining again.
I'm looking for a way to convert HTML to unicode-representation ("\0x1234") or a HTML-decode, that does not decode XML-predefined entities.

Related

Save XDocument without any formatting changes

I have a XML File where i have to replace a single value of an element. For this im loading my XML file into a XDocument: var camtXml = XDocument.Load(fileStream); After im done with my changes and saving the XDocument to a file there are multiple changes that shouldn't be done. As you can see in the following picture (Left side file from XDocument, rigth site original file):
The UTF-8 was changed from upper- to lower case, CR Linefeeds were added and
the indentation has been changed by removing withespaces. I really wan't to use XDocument because of its libary what easily allows to create and iterate through XElements. But the formatting changes are a show stopper. Is there a way to preserve these formatting changes or is there an alternativ to XDocument with the same options like XPath, XElement etc.?
I found this but it didn't solved my problem.
XDocument how to save without Byte Order Mark AND preseve formatting/whitespace

Convert file path to UTF-8

I want to get, print and write to a text file the full path on disk of a file named A&T+X-8_L_R1.png but when I print it I get A&T+X-8_L_R1.png.
AFAIK I need to change the encoding. I did a search and found this potential solution but it doesn't work:
String filePathString = relativeUri.ToString();
byte[] bytes = Encoding.Default.GetBytes(filePathString);
filePathString = Encoding.UTF8.GetString(bytes);
filePathNode.SetValue(filePathString);
This is the full code of my class: http://pastebin.com/dZLGeS8p
The class searches recursively for *.png files and creates an XML structure from their paths. When I save the XML file the special characters from the paths like & are changed.
Can anyone point me to a solution?
You are writing an XML file, not a plain text file. In XML, an ampersand needs to be escaped to &.
So the result you get is perfectly ok. It's even required to be like this.
I recommend to open the XML file with an application that can properly validate and display XML. It'll be easier to see that the file is correct.
The UTF-8 conversion in your code isn't required. If the XML file is encoded in UTF-8, your XML classes will take care of any required conversions.

When saving XML file with XElement, alignment in file changes as well, how to avoid?

I am using
XElement root = XElement.Load(filepath);
to load XML file, then finding things that I need.
IEnumerable<XElement> commands = from command in MyCommands
where (string) command.Attribute("Number") == Number
select command;
foreach (XElement command in commands)
{
command.SetAttributeValue("Group", GroupFound);
}
When I am done with my changes, I save the file with the following code.
root.Save(filepath);
When file is saved, all the lines in my XML file are affected. Visual Studio aligns all the lines by default, but I need to save the original file format.
I cannot alter any part of the document, except the Group attribute values.
command.SetAttributeValue("Group") attributes.
You would need to do:
XElement root = XElement.Load(filepath, LoadOptions.PreserveWhitespace);
then do:
root.Save(filepath, SaveOptions.DisableFormatting);
This will preserve your original whitespace through the use of LoadOptions and SaveOptions.
The information you're looking to preserve is lost to begin in the XDocument.
XDocument doesn't care if your elements had tabs or spaces on the line in front of them and if there are multiple whitespaces between attributes etc. If you want to rely on the Save() method you have to give up the idea you can preserve formatting.
To preserve formatting you'll need to add custom processing and figure out where precisely to make changes. Alternatively you may be able to adjust your save options to match the formatting you have if your XML is coming from a machine and not human edited

C# XMLDocument Encoding?

I'm trying to code a function that validates an XML settings file, so if a node does not exist on the file, it should create it.
I have this function
private void addMissingSettings() {
XmlDocument xmldocSettings = new XmlDocument();
xmldocSettings.Load("settings.xml");
XmlNode xmlMainNode = xmldocSettings.SelectSingleNode("settings");
XmlNode xmlChildNode = xmldocSettings.CreateElement("ExampleNode");
xmlChildNode.InnerText = "Hello World!";
//add to parent node
xmlMainNode.AppendChild(xmlChildNode);
xmldocSettings.Save("settings.xml");
}
But on my XML file, if I have
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Write Suffix"></wPortSuffix>
When the I save the document, it saves those lines as
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Sufijo en puerto de escritura"></wPortSuffix>
<ExampleNode>Hello World!</ExampleNode>
Is there a way to prevent this behaviour? Like setting a working charset or something like that?
The two files are equivalent, and should be treated as being equivalent by all XML parsers, I believe.
Additionally, Unicode character U+0003 isn't a valid XML character, so you've fundamentally got other problems if you're trying to represent it in your file. Even though that particular .NET XML parser doesn't seem to object, other parsers may well do so.
If you need to represent absolutely arbitrary characters in your XML, I suggest you do so in some other form - e.g.
<rPortSuffix desc="Read Suffix">\u000c\u000a</rPortSuffix>
<wPortSuffix desc="Write Suffix">\u0003</wPortSuffix>
Obviously you'll then need to parse that text appropriately, but at least the XML parser won't get in the way, and you'll be able to represent any UTF-16 code unit.

C# XML conversion

I have a string containing fully formatted XML data, created using a Perl script.
I now want to convert this string into an actual XML file in C#. Is there anyway to do this?
Thanks,
You can load a string into an in-memory representation, for example, using the LINQ to SQL XDocument type. Loading string can be done using Parse method and saving the document to a file is done using the Save method:
open System.Xml.Linq;
XDocument doc = XDocument.Parse(xmlContent);
doc.Save(fileName);
The question is why would you do that, if you already have correctly formatted XML document?
A good reasons that I can think of are:
To verify that the content is really valid XML
To generate XML with nice indentation and line breaks
If that's not what you need, then you should just write the data to a file (as others suggest).
Could be as simple as
File.WriteAllText(#"C:\Test.xml", "your-xml-string");
or
File.WriteAllText(#"C:\Test.xml", "your-xml-string", Encoding.UTF8);
XmlDocument doc = new XmlDocument();
doc.Load(... your string ...);
doc.Save(... your destination path...);
see also
http://msdn.microsoft.com/fr-fr/library/d5awd922%28v=VS.80%29.aspx

Categories

Resources