OpenOfficeXML and HTML written to spreadsheet's cells

OpenOfficeXML and HTML written to spreadsheet's cells - c#

I have been trying to write html content into Excel spreadsheet's cells using ExcelPackage OpenOfficeXML and c#.
I am getting errors stating that input string has an invalid token. Has anyone came across anything similar?
Saving html directly in Excel works OK.
I do not want to use html encoding as the content has to be in readable form.

Without knowing more of the specifics:
If you're using XML to create your spreadsheet, you should use CDATA tags for HTML content
<someEntry> <![CDATA[ YOUR HTML HERE ]]> </someEntry>
In c# you can add CDATA. It is a type of element.
XmlCDataSection CData;
CData = doc.CreateCDataSection("<someNodeName><h1>blah</h1></someNodeName>");
XmlElement root = doc.DocumentElement;
root.AppendChild(CData);
Otherwise, you probably need to escape quotes and double quotes \" \'
Or you can use System.Security.SecurityElement.Escape() to encode both single and double quotes.

Escaping single quotes solved the problem - Replace("'", #"""");
Thank you all.

Related

How to escape invalid characters inside XML string in C#

I have an XML string in C#. This XML has several tags. In some of these tags there are invalid characters like '&' in the text. I need to escape these characters inside the text from the whole long XML string but I want to keep the tags.
I have tried HttpUtility.HtmlEncode and few other available methods but they encode the whole string rather then just the text inside the tags. Example tags are
<node1>This is a string & so is this</node1> should be converted to
<node1>This is a string & so is this</node1>
Any ideas? thanks
P.S. I know similar question has been asked before I have not found a complete solution for this problem.

I guess the simplest solution is to load the whole Xml document in memory as an XmlDocument and then go through the elements and replace the values with their html encoded form.

you can use a CDATA field, like this:
<YourXml>
<Id>1</Id>
<Content>
<![CDATA[
your special caracteres
]]>
</content>
</yourXml>

I dont get what is the big deal in this. When you have the entire xml as a string, the easiest way to achieve what u want is to use the Replace function.
For example the whole xml is in the string str, then all u have to do is,
str.Replace("&" , "&");
Thats it man. You have achieved whatever u wanted to. Some times very simple solutions exist for big problems. Hope this helps for you.

XDocument or XmlDocument is a way to go. If for some crazy out of your control reason you need to encode just text blocks inside XmlElement:
using System.Text;
using System.Xml;
static string EncodeText(string unescapedText) {
if (string.IsNullOrEmpty(unescapedText)) {
return unescapedText;
}
var builder = new StringBuilder(unescapedText.Length);
using (var writer = XmlTextWriter.Create(builder, new XmlWriterSettings {
ConformanceLevel = ConformanceLevel.Fragment
})) {
writer.WriteValue(unescapedText);
}
return builder.ToString();
}

Cleanup xml file - Invalid character in the given encoding

I am integrating against Magento ecommerce using their "SOAP" api, and the API returns "XML" results. Problem is, this is not always well formed:
<product>
<entity_id>18</entity_id>
<price regular="2925  <span>Nok</span>"/>
...
In this specific case, the price regular attribute has both an invisible character 0xa0 (before the span tag), and < > within the attribute text.
I have no way to get proper well-formed XML from Magento it seems, so the alternative is to clean it up before I feed it to my XmlSerializer deserialization:
XmlSerializer serializer = new XmlSerializer(typeof(Responses.Product.product));
product = serializer.Deserialize(textReader) as Responses.Product.product;
The invisible character I can get rid of using a simple text replace, but I'm more unsure about the <> within the attribute text.
My question is, how to clean it up for be valid XML?

The character 0x3c is the < character. For an invisible character you would rather be looking for something like the 0x09 TAB character.
To fix the broken markup you could look for that specific HTML tag in the content, using a regular expression to allow any currency within the tag:
xml = Regex.Replace(xml, "<span>([A-Za-z]{3})</span>", "<span>$1</span>");
This works as long as there isn't any span elements in the XML code itself, with a three character content. You could do similar replacements for other HTML tags, but try to keep the pattern as specific as possible, to avoid false positives.

Problem getting XML properly formatted

I use classes (autogenerated from a schema) to generate xml documents. It has worked fine, until now, when I need to use inline HTML elements. I've tried several different methods, but as soon as I use the inline HTML, the "<" and ">" gets replaced with %lt; etc.
Example:
<meta>
<name>test</name>
<value>test <br />new row</value>
</meta>
becomes "destroyed" later on when trying to get it as a string for database storage, the value is changed to:
<value>test <br />new row</value>
How is it possible to keep the angle brackets intact?

You need to use CDATA sections for XML (or XML like) content.

The XML write is escaping the reserved characters such as <, > etc. If you're reading the text back using a Xml reader then your < will be correctly read as <.

Reading RSS feed with Linq-to-XML and C# - how to decode CDATA section?

I am trying to read an RSS feed using C# and Linq to XML.
The feed is encoded in utf-8 (see http://pc03224.kr.hsnr.de/infosys/feed/) and reading it out generally works fine except for the description node because it is enclosed in a CDATA section.
For some reason I can't see the CDATA tag in the debugger after reading out the content of the "description" tag but I guess it must be there somewhere because only in this section the German Umlaute (äöü) and other special characters are not shown correctly. Instead they remain in the string utf-8 encoded like ü.
Can I somehow read them out correctly or at least decode them afterwards?
This is a sample of the RSS section giving me troubles:
<description><![CDATA[blabla bietet Hörern meiner Vorlesungen “IAS”, “WEB” und “SWE” an, Lizenzen für blabla [...]]]></description>
Here is my code which reads out and parses the RSS feed data:
RssItems = (from xElem in xml.Descendants("channel").Descendants("item")
select new RssItem
{
Content = xElem.Descendants("description").FirstOrDefault().Value,
...
}).ToList();
Thanks in advance!

Your code is working as intended. A CDATA section means that the contents should not be interpreted, i.e. "ö" should not be treated as an HTML entity but just as a sequence of characters.
Contact the author of the RSS feed and tell him to fix it, either by removing the CDATA tags so the entities get interpreted, or by putting the intended characters directly into the HTML file.
Alternatively, have a look at HttpUtility.HtmlDecode to decode the CDATA contents a second time.

Protecting from XSLT injection

I use a xsl tranform to convert a xml file to html in dotNet. I transform the node values in the xml to html tag contents and attributes.
I compose the xml by using .Net DOM manipulation, setting the InnerText property of the nodes with the arbitrary and possibly malicious text.
Right now, maliciously crafted input strings will make my html unsafe. Unsafe in the sense that some javascript might come from the the user and find its way to a link href attribute in the output html, for example.
The question is simple, what is the sanitizing, if any, that I have to do with my text before assigning it to the InnerText property? I thought that assigning to InnerText instead of InnerXml would do all the needed sanitization of the text, but that seems to not be the case.
Does my transform have to have any special characteristics to make this work safely? Any .net specific caveats that I should be aware?
Thanks!

You should sanitize your XML before transforming it with XSLT. You probably will need something like:
string encoded = HttpUtility.HtmlEncode("<script>alert('hi')</script>");
XmlElement node = xml.CreateElement("code");
node.InnerText = encoded;
Console.WriteLine(encoded);
Console.WriteLine(node.OuterXml);
With this, you'll get
<script>alert('hi')</script>
When you add this text into your node, you'll get
<code>&lt;script&gt;alert('hi')&lt;/script&gt;</code>
Now, if you run your XSLT, this encoded HTML will not cause any problems in your output.

It turns out that the problem came from the xsl itself, wich used disable-output-escaping. Without that the Tranform itself will do all the encoding necessary.
If you must use disable-output-escaping, you have to use the appriate encodeinf function for each element. HtmlEncode for tag contents, HtmlAttributeEncode for attribute values and UrlEncode for html attribute values (e.g href)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

OpenOfficeXML and HTML written to spreadsheet's cells - c#

Escaping single quotes solved the problem - Replace("'", #""""); Thank you all.

Related

How to escape invalid characters inside XML string in C#

Cleanup xml file - Invalid character in the given encoding

Problem getting XML properly formatted

Reading RSS feed with Linq-to-XML and C# - how to decode CDATA section?

Protecting from XSLT injection

Categories

Resources