How to save HTML in XML file using Linq to XML?

How to save HTML in XML file using Linq to XML? - c#

I am trying to use Linq to XML to save & retrieve some HTML between an XML file and a windows forms application. When it saves it to the XML file the HTML tags get xml encoded and it isn't saved as straight HTML.
Example HTML:
<P><FONT color=#004080><U>Sample HTML</U></FONT></P>
Saved in XML File:
<P><FONT color=#004080><U>Sample HTML</U></FONT></P>
When I manually edit the XML file and put in the desired HTML the Linq pulls in the HTML and displays it properly.
Here is the code that saves the HTML to the XML file:
XElement currentReport = (from item in callReports.Descendants("callReport")
where (int)item.Element("localId") == myCallreports.LocalId
select item).FirstOrDefault();
currentReport.Element("studio").Value = myCallreports.Studio;
currentReport.Element("visitDate").Value = myCallreports.Visitdate.ToShortDateString();
// *** The next two XElements store the HTML
currentReport.Element("recomendations").Value = myCallreports.Comments;
currentReport.Element("reactions").Value = myCallreports.Ownerreaction;
I assume this is happening b/c of xml encoding but I am not sure how to deal with it. This question gave me some clues...but no answer (for me, at least).
Thanks for the help,
Oran

Setting the Value property will automatically encode the html string. This should do the trick, but you'll need to make sure that your HTML is valid XML (XHTML).
currentReport.Element("recomendations").ReplaceNodes(XElement.Parse(myCallreports.Comments));
Edit: You may need to wrap the user entered HTML in <div> </div> tags. XElement.Parse expects to find a string with at least a start and end xml tag. So, this might work better:
currentReport.Element("recomendations").ReplaceNodes(XElement.Parse("<div>" + myCallreports.Comments + "</div>"));
Then you just have to make sure that tags like <br> are being sent in as <br />.
Edit 2: The other option would be use XML CDATA. Wrap the HTML with <![CDATA[ and ]]>, but I've never actually used that and I'm not sure how it affects reading the xml.
currentReport.Element("recomendations").ReplaceNodes(XElement.Parse("<![CDATA[" + myCallreports.Comments + "]]>"));

Try using currentReport.Element("studio").InnerXml instead of currentReport.Element("studio").Value

Related

How to fix this bullet issue

In my website, admin uploads a .docx file. I convert the file into xml using OpenXmlPowerTools Api.
The issue is the document has some bullets in it.
• This is my bullet 1 in the document.
• This is my bullet 2 in the document.
XElement html = OpenXmlPowerTools.HtmlConverter.ConvertToHtml(wDoc, settings);
var htmlString = html.ToString();
File.WriteAllText(destFileName.FullName, htmlString, Encoding.UTF8);
Now when I open the xml file, it renders the bullets as below:-
I need to read each node of XML & save in the database & reconsturct html from nodes.
Please don't ask me why so, as I am not the boss of the system.
How do I get the bullets render correctly in xml so that I can save the right
html in the database?

I have fixed same issue for my requirement and this working without issue so far.
In case like this you'll always have to try workaround i.e. copy this character and compare it within your input/read strings etc. if found then replace it with equivalent html encoded character. In your case it will be bullet list character "ampersandbull;" or "ampersand#8226;" .
Code should look like
listItem == "Compare with your copied character like one in your pic" ? "•" : listItem
you can find more equivalent characters at this link:
http://www.zytrax.com/tech/web/entities.html

Hey I don't think XML can read bullets. I'll advise you programmatically handle it. Try and debug and see what the square is being represented as and then do an if statement to find it and replace it with a code you can define so that when you return it to use it you can convert that code if found to a bullet.

how to used less than sign in xml document?

I am using C#.net where I required to used xml string,which needs to populate into xmldocument. It is loading fine,but when that string has special following values in one of the node then it is not working
sometime I have html tags with style and class. so how to load that string in xml document. so How to deal with in such cases?
here my string which produces an error
<restdata>
<listingAddress>
fsdfsdf dfdf <Not Specified=""> Argentina dsfsf</listingAddress>
<listingAddress>
xxk dfsdf 899993
</listingAddress>
</restdata>
in my case error may be because of <not Specified="".
also sometime there may be html tags.
so how this would be used generalized way so any data my it should work fine?

Generally if you need to use characters that are commonly reserved in XML, you can use their encoded HTML entities if you need to enter HTML data :
Use < for <
Use > for >
Use & for &
Use " for "
You can find a complete list of them here. If you need to programatically encode HTML cotent in C#, you can use the HttpUtility.HtmlEncode() method :
// Your original text
var input = "<a href='http://example-site.com'>This is a link</a>";
// This yields <a href='http://example-site.com'>This is a link</a>
var encoded = HttpUtility.HtmlEncode(input);

Using C# to convert incorrect html string to real html

My original issue is that I am trying to serialize a string containing html tags to an XML element.
hello World, this
is
a nice
test
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
However, I have 2 issues
Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.
2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags).
Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:
string result = "<p>" + text
.Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
.Replace(Environment.NewLine, "<br />")
.Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";
However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.
Target would be to have a result like the following (line breaks are only for better readability and don't matter here)
<p>hello World, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?
Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.
Thank you!
EDIT: Clearly separated the 2 issues
EDIT2: No influence on deserialization
EDIT3: Added target output

What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.
You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.
Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.

I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:
1) Replace line breaks with HTML line breaks
string result = text.Replace(Environment.NewLine, "<br />");
2) Use the HTMLAgility pack to fix any invalid HTML
var doc = new HtmlDocument();
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
doc.OptionFixNestedTags = false;
doc.LoadHtml(result);
if (doc.ParseErrors.Count() > 0)
{
// throw error
}else{
// get fixed html
result= doc.DocumentNode.OuterHtml;
}

Stop CDATA tags from being output-escaped when writing to XML in C#

We're creating a system outputting some data to an XML schema. Some of the fields in this schema need their formatting preserved, as it will be parsed by the end system into potentially a Word doc layout. To do this we're using <![CDATA[Some formatted text]]> tags inside of the App.Config file, then putting that into an appropriate property field in a xsd.exe generated class from our schema. Ideally the formatting wouldn't be out problem, but unfortunately thats just how the system is going.
The App.Config section looks as follows:
<header>
<![CDATA[Some sample formatted data]]>
</header>
The data assignment looks as follows:
HeaderSection header = ConfigurationManager.GetSection("header") as HeaderSection;
report.header = "<[CDATA[" + header.Header + "]]>";
Finally, the Xml output is handled as follows:
xs = new XmlSerializer(typeof(report));
fs = new FileStream (reportLocation, FileMode.Create);
xs.Serialize(fs, report);
fs.Flush();
fs.Close();
This should in theory produce in the final Xml a section that has information with CDATA tags around it. However, the angled brackets are being converted into < and >
I've looked at ways of disabling Outout Escaping, but so far can only find references to XSLT sheets. I've also tried #"<[CDATA[" with the strings, but again no luck.
Any help would be appreciated!

You're confusing markup with content.
When you assign the string "<![CDATA[ ... ]]>" to the value, you are saying that is the content that you wish to put in there. The XmlSerializer does not, and indeed should not, attempt to infer any markup semantics from this content, and simply escapes it according to the normal rules.
If you want CDATA markup in there, then you need to explicitly instruct the serializer to do so. Some examples of how to do this are here.

Have you tried changing
report.header = "<[CDATA[" + header.Header + "]]>";
to
report.header = "<![CDATA[" + header.Header + "]]>";

Protecting from XSLT injection

I use a xsl tranform to convert a xml file to html in dotNet. I transform the node values in the xml to html tag contents and attributes.
I compose the xml by using .Net DOM manipulation, setting the InnerText property of the nodes with the arbitrary and possibly malicious text.
Right now, maliciously crafted input strings will make my html unsafe. Unsafe in the sense that some javascript might come from the the user and find its way to a link href attribute in the output html, for example.
The question is simple, what is the sanitizing, if any, that I have to do with my text before assigning it to the InnerText property? I thought that assigning to InnerText instead of InnerXml would do all the needed sanitization of the text, but that seems to not be the case.
Does my transform have to have any special characteristics to make this work safely? Any .net specific caveats that I should be aware?
Thanks!

You should sanitize your XML before transforming it with XSLT. You probably will need something like:
string encoded = HttpUtility.HtmlEncode("<script>alert('hi')</script>");
XmlElement node = xml.CreateElement("code");
node.InnerText = encoded;
Console.WriteLine(encoded);
Console.WriteLine(node.OuterXml);
With this, you'll get
<script>alert('hi')</script>
When you add this text into your node, you'll get
<code>&lt;script&gt;alert('hi')&lt;/script&gt;</code>
Now, if you run your XSLT, this encoded HTML will not cause any problems in your output.

It turns out that the problem came from the xsl itself, wich used disable-output-escaping. Without that the Tranform itself will do all the encoding necessary.
If you must use disable-output-escaping, you have to use the appriate encodeinf function for each element. HtmlEncode for tag contents, HtmlAttributeEncode for attribute values and UrlEncode for html attribute values (e.g href)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to save HTML in XML file using Linq to XML? - c#

Try using currentReport.Element("studio").InnerXml instead of currentReport.Element("studio").Value

Related

How to fix this bullet issue

how to used less than sign in xml document?

Using C# to convert incorrect html string to real html

Stop CDATA tags from being output-escaped when writing to XML in C#

Protecting from XSLT injection

Categories

Resources