processing Xhtml files with Xdocument class adds unwanted elements

processing Xhtml files with Xdocument class adds unwanted elements - c#

I am working on a project which requires me to process xhtml files to fix the content of certain tags. The fixing itself is not a problem, however I have troubles when saving the files.
THe code I am using is:
var spanNodesList = p.GetSpanNodesList(xDoc);
foreach (XElement span in spanNodesList)
{
if (span.Value == null || span.Value == "")
{
span.Remove();
}
else
{
string[] words = p.SplitNodeText(span.Value);
XElement parent = span.Parent;
span.Remove();
foreach (string word in words)
{
parent.Add(new XElement("span", word,
new XAttribute("id", "w" + p.currentNodeID.ToString())));
p.currentNodeID++;
}
}
}
List<XElement> GetSpanNodesList(XDocument file)
{
//Get only 'word' nodes
var spanNodes = file.Descendants("{http://www.w3.org/1999/xhtml}span");
if (spanNodes != null)
{
var spanNodesList = spanNodes.ToList();
spanNodesList.RemoveAll(x => ((x.Attribute("id") == null) || !x.Attribute("id").Value.Contains("w")));
return spanNodesList;
}
else return null;
}
As firstly, I couldn't get any elements, I have found out somewhere in SO that I might need to add namespace reference to file.Descendants("{http://www.w3.org/1999/xhtml}span"); as it yielded no results. This has indeed helped and I get the nodes I want. However, the resulting code produces has two problems.
<span id="w1" xmlns="">Word one</span>
<span id="w2" xmlns="">Word two</span>
<span id="w3" xmlns="">Word three</span>
It adds the xmlns attribute, which I don't need (and which was not in the original file) and it adds <?xml version="1.0" encoding="utf-8"?> header. I assume this is expected behaviour resulting from what I coded, so my question is - what can I do to remove these 'problems'. Or perhaps there's a better way of dealing with xHtml files? Also, I don't know if this is relevant, but source files have references to a number of different namespaces...
Cheers
Bartosz

When you add the span element, you're doing it without a namespace - whereas some ancestor element has set the default namespace. All you need to do is use the right namespace for your new elements:
XNamespace ns = "http://www.w3.org/1999/xhtml";
...
parent.Add(new XElement(ns + "span", ...);
Likewise you can use:
var spanNodes = file.Descendants(ns + "span");
which is rather more readable, IMO. You almost certainly don't need to worry about the XML declaration.

Related

Find and delete all occurrences of a string that starts with x

I'm parsing an XML file, to compare it to another XML file. XML Diff works nicely, but we have found there are a lot of junk tags that exist in one file, not in the other, that have no bearing on our results, but clutter up the report. I have loaded the XML file into memory to do some other things to it, and I'm wondering if there is an easy way at the same time to go through that file, and remove all tags that start with, as an example color=. The value of color is all over the map, so not easy to grab them all remove them.
Doesn't seem to be any way in XML Diff to specify, "ignore these tags".
I could roll through the file, find each instance, find the end of it, delete it out, but I'm hoping there will be something simpler. If not, oh well.
Edit: Here's a piece of the XML:
<numericValue color="-103" hidden="no" image="stuff.jpg" key="More stuff." needsQuestionFormatting="false" system="yes" systemEquivKey="Stuff." systemImage="yes">
<numDef increment="1" maximum="180" minimum="30">
<unit deprecated="no" key="BPM" system="yes" />
</numDef>
</numericValue>

If you are using Linq to XML, you can load your XML into an XDocument via:
var doc = XDocument.Parse(xml); // Load the XML from a string
Or
var doc = XDocument.Load(fileName); // Load the XML from a file.
Then search for all elements with matching names and use System.Xml.Linq.Extensions.Remove() to remove them all at once:
string prefix = "L"; // Or whatever.
// Use doc.Root.Descendants() instead of doc.Descendants() to avoid accidentally removing the root element.
var elements = doc.Root.Descendants().Where(e => e.Name.LocalName.StartsWith(prefix, StringComparison.Ordinal));
elements.Remove();
Update
In your XML, the color="-103" substring is an attribute of an element, rather than an element itself. To remove all such attributes, use the following method:
public static void RemovedNamedAttributes(XElement root, string attributeLocalNamePrefix)
{
if (root == null)
throw new ArgumentNullException();
foreach (var node in root.DescendantsAndSelf())
node.Attributes().Where(a => a.Name.LocalName == attributeLocalNamePrefix).Remove();
}
Then call it like:
var doc = XDocument.Parse(xml); // Load the XML
RemovedNamedAttributes(doc.Root, "color");

how to find an element value in xml when we have multiple namespaces

I want to get the value of the element rsm:CIIHExchangedDocument/ram:ID
But I have problem with multiple namespaces, and null values (I can not know if requested element exists)
It can be achieved this way:
XElement invoice = XElement.Load(invoiceStream);
XNamespace rsm = invoice.GetNamespaceOfPrefix("rsm");
XNamespace ram = invoice.GetNamespaceOfPrefix("ram");
if ((invoice.Element(rsm + "CIIHExchangedDocument")) != null)
{
if (invoice.Element(rsm + "CIIHExchangedDocument").Element(ram + "ID") != null)
{
string id = invoice.Element(rsm + "CIIHExchangedDocument").Element(ram + "ID").Value;
}
}
but I think using xPath will suit my needs better. I want to do something like this:
invoice.XPathSelectElement("rsm:CIIHExchangedDocument/ram:ID"):
I need to retrieve a lot of elements of different depth in the document, and I have many namespaces.
What is the simplest whay to achive this? Execution speed is also important to me.

I believe what you are looking for is the XPathNavigator class. An example of how to use can be found here XpathNaivigator

Getting the right node in Linq to XML

Im trying to parse an XML file containing all the uploaded videos on a certain channel. Im attempting to get tbe value of the URL attribute in one of the <media:content> nodes and put it in the ViewerLocation field. However there are several of them. My current code is this:
var videos = from xElem in xml.Descendants(atomNS + "entry")
select new YouTubeVideo()
{
Title = xElem.Element(atomNS + "title").Value,
Description = xElem.Element(atomNS + "content").Value,
DateUploaded = xElem.Element(atomNS + "published").Value,
ThumbnailLocation = xElem.Element(mediaNS + "group").Element(mediaNS + "content").Attribute("url").Value,
ViewerLocation = xElem.Element(mediaNS + "group").Element(mediaNS + "content").Attribute("url").Value
};
It gets me the first node in the XML for entry with the name <media:content> as you would expect. However, the first entry in the XML isn't what I want. I want the second.
Below is the relevant XML.
<!-- I currently get the value held in this node -->
<media:content
url='http://www.youtube.com/v/ZTUVgYoeN_b?f=gdata_standard...'
type='application/x-shockwave-flash' medium='video'
isDefault='true' expression='full' duration='215' yt:format='5'/>
<!-- What i actually want is this one -->
<media:content
url='rtsp://rtsp2.youtube.com/ChoLENy73bIAEQ1kgGDA==/0/0/0/video.3gp'
type='video/3gpp' medium='video'
expression='full' duration='215' yt:format='1'/>
<media:content
url='rtsp://rtsp2.youtube.com/ChoLENy73bIDRQ1kgGDA==/0/0/0/video.3gp'
type='video/3gpp' medium='video'
expression='full' duration='215' yt:format='6'/>
I want the second node because it has a type of 'video/3gpp'. How would I go about selecting that one? My logic would be
if attribute(type == "video/3gpp") get this value.
But i do not know how to express this in Linq.
Thanks,
Danny.

Probably something like;
where xElem.Element(atomNS + "content").Attribute("type").Value == "video/3gpp"
Edit: I didn't quite know how to expand and explain this one without assuming the OP had no knowledge of Linq. You want to make your original query;
from xElem in xml.Descendants(atomNS + "entry")
where xElem.Element(atomNS + "content").Attribute("type").Value == "video/3gpp"
select new YouTubeVideo() {
...
}
You can interrogate attributes of a node, just like you can look at the elements of the document. If there are multiple elements with that attribute, you could then (assuming you always want the first you find)..
( from xElem in xml.Descendants(atomNS + "entry")
where xElem.Element(atomNS + "content").Attribute("type").Value == "video/3gpp"
select new YouTubeVideo() {
...
}).First();
I changed the original post, as I believe the node you're querying is the Element(atomNS + "content"), not the top level xElem

Using XPath from this Xml Library (Just because I know how to use it) with associated Get methods:
string videoType = "video/3gpp";
XElement root = XElement.Load(file); // or .Parse(xmlstring)
var videos = root.XPath("//entry")
.Select(xElem => new YouTubeVideo()
{
Title = xElem.Get("title", "title"),
Description = xElem.Get("content", "content"),
DateUploaded = xElem.Get("published", "published"),
ThumbnailLocation = xElem.XGetElement("group/content[#type={0}]/url", "url", videoType),
ViewerLocation = xElem.XGetElement("group/content[#type={0}]/url", "url", videoType)
});
If the video type doesn't change, you can replace the XGetElement's with:
xElem.XGetElement("group/content[#type='video/3gpp']/url", "url")
Its a lot cleaner not having to specify namespaces using the library. There is the Microsoft's XPathSelectElements() and XPathSelectElement() you can look into, but they require you to specify the namespaces and don't have the nice Get methods imo. The caveat is that the library isn't a complete XPath implementation, but it does work with the above.

Can this be improved? Scrubbing of dangerous html tags

I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.
I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.
I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.
So is this script missing anything?
It uses html agility pack.
public string ScrubHTML(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Remove potentially harmful elements
HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.ParentNode.RemoveChild(node, false);
}
}
//remove hrefs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("href", "#");
}
}
//remove img with refs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(#src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(#src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(#src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.SetAttributeValue("src", "#");
}
}
//remove on<Event> handlers from all tags
nc = doc.DocumentNode.SelectNodes("//*[#onclick or #onmouseover or #onfocus or #onblur or #onmouseout or #ondoubleclick or #onload or #onunload]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("onFocus");
node.Attributes.Remove("onBlur");
node.Attributes.Remove("onClick");
node.Attributes.Remove("onMouseOver");
node.Attributes.Remove("onMouseOut");
node.Attributes.Remove("onDoubleClick");
node.Attributes.Remove("onLoad");
node.Attributes.Remove("onUnload");
}
}
// remove any style attributes that contain the word expression (IE evaluates this as script)
nc = doc.DocumentNode.SelectNodes("//*[contains(translate(#style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
if (nc != null)
{
foreach (HtmlNode node in nc)
{
node.Attributes.Remove("stYle");
}
}
return doc.DocumentNode.WriteTo();
}
Edit
2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).
How do you make a white list? Is it just a list collection?
How do you actual parse out all html tags, script tags and every other tag?
Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.
Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.
Should I be using html agility pack?

That code is dangerous -- you should be whitelisting elements, not blacklisting them.
In other words, make a small list of tags and attributes you want to allow, and don't let any others through.
EDIT: I'm not familiar with HTML agility pack, but I see no reason why it wouldn't work for this. Since I don't know the framework, I'll give you pseudo-code for what you need to do.
doc.LoadHtml(html);
var validTags = new List<string>(new string[] {"b", "i", "u", "strong", "em"});
var nodes = doc.DocumentNode.SelectAllNodes();
foreach(HtmlNode node in nodes)
if(!validTags.Contains(node.Tag.ToLower()))
node.Parent.ReplaceNode(node, node.InnerHtml);
Basically, for each tag, if it's not contained in the whitelist, replace the tag with just its inner HTML. Again, I don't know your framework, so I can't really give you specifics, sorry. Hopefully this gets you started in the right direction.

Yes, I already see you're missing onmousedown, onmouseup, onchange, onsubmit, etc. This is part of why should use whitelisting for both tags and attributes. Even if you had a perfect blacklist now (very unlikely), tags and attributes are added fairly often.
See Why use a whitelist for HTML sanitizing?.

Parse controls in an aspx file and convert it to xml

I need to parse through the aspx file (from disk, and not the one rendered on the browser) and make a list of all the server side asp.net controls present on the page, and then create an xml file from it. which would be the best way to do it? Also, are there any available libraries for this?
For eg, if my aspx file contains
<asp:label ID="lbl1" runat="server" Text="Hi"></asp:label>
my xml file would be
<controls>
<ID>lbl1</ID>
<runat>server</runat>
<Text>Hi</Text>
</controls>

Xml parsers wouldn't understand the ASP directives: <%# <%= etc.
You'll probably best to use regular expressions to do this, likely in 3 stages.
Match any tag elements from the entire page.
For Each tag, match the tag and control type.
For Each tag that matches (2), match any attributes.
So, starting from the top, we can use the following regex:
(?<tag><[^%/](?:.*?)>)
This will match any tags that don't have <% and < / and does so lazily (we don't want greedy expressions, as we won't read the content correctly). The following could be matched:
<asp:Content ID="ph_PageContent" ContentPlaceHolderID="ph_MainContent" runat="server">
<asp:Image runat="server" />
<img src="/test.png" />
For each of those captured tags, we want to then extract the tag and type:
<(?<tag>[a-z][a-z1-9]*):(?<type>[a-z][a-z1-9]*)
Creating named capture groups makes this easier, this will allow us to easily extract the tag and type. This will only match server tags, so standard html tags will be dropped at this point.
<asp:Content ID="ph_PageContent" ContentPlaceHolderID="ph_MainContent" runat="server">
Will yield:
{ tag = "asp", type = "Content" }
With that same tag, we can then match any attributes:
(?<name>\S+)=["']?(?<value>(?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Which yields:
{ name = "ID", value = "ph_PageContent" },
{ name = "ContentPlaceHolderID", value = "ph_MainContent" },
{ name = "runat", value = "server" }
So putting that all together, we can create a quick function that can create an XmlDocument for us:
public XmlDocument CreateDocumentFromMarkup(string content)
{
if (string.IsNullOrEmpty(content))
throw new ArgumentException("'content' must have a value.", "content");
RegexOptions options = RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase;
Regex tagExpr = new Regex("(?<tag><[^%/](?:.*?)>)", options);
Regex serverTagExpr = new Regex("<(?<tag>[a-z][a-z1-9]*):(?<type>[a-z][a-z1-9]*)", options);
Regex attributeExpr = new Regex("(?<name>\\S+)=[\"']?(?<value>(?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?", options);
XmlDocument document = new XmlDocument();
XmlElement root = document.CreateElement("controls");
Func<XmlDocument, string, string, XmlElement> creator = (document, name, value) => {
XmlElement element = document.CreateElement(name);
element.InnerText = value;
return element;
};
foreach (Match tagMatch in tagExpr.Matches(content)) {
Match serverTagMatch = serverTagExpr.Match(tagMatch.Value);
if (serverTagMatch.Success) {
XmlElement controlElement = document.CreateElement("control");
controlElement.AppendChild(
creator(document, "tag", serverTagMatch.Groups["tag"].Value));
controlElement.AppendChild(
creator(document, "type", serverTagMatch.Groups["type"].Value));
XmlElement attributeElement = document.CreateElement("attributes");
foreach (Match attributeMatch in attributeExpr.Matches(tagMatch.Value)) {
if (attributeMatch.Success) {
attributeElement.AppendChild(
creator(document, attributeMatch.Groups["name"].Value, attributeMatch.Groups["value"].Value));
}
}
controlElement.AppendChild(attributeElement);
root.AppendChild(controlElement);
}
}
return document;
}
The resultant document could look like this:
<controls>
<control>
<tag>asp</tag>
<type>Content</type>
<attributes>
<ID>ph_PageContent</ID>
<ContentPlaceHolderID>ph_MainContent</ContentPlaceHolderID>
<runat>server</runat>
</attributes>
</control>
</controls>
Hope that helps!

I used the below three regular expressions with the above code and it gives me html tags as well. Also I can obtain the value in between opening and closing tags too.
Regex tagExpr = new Regex("(?<tag><[^%/](?:.*?)>[^/<]*)", options);
Regex serverTagExpr = new Regex("<(?<type>[a-z][a-z1-9:]*)[^>/]*(?:/>|[>/])(?<value>[^</]*)", options);
Regex attributeExpr = new Regex("(?<name>\\S+)=[\"']?(?<value>(?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?", options);

Func<XmlDocument, string, string, XmlElement> creator = (document, name, value) => {
XmlElement element = document.CreateElement(name);
element.InnerText = value;
the above generic template will work version 3.5 and above.. so if any one using version below that , create function like :
public XmlElement creator(XmlDocument document, string name, string value)
{
XmlElement element = document.CreateElement(name);
element.InnerText = value;
return element;
}
this will work

ASPX files should be valid XML, so maybe XSLT would be a good solution. The W3 Schools site has a good introduction and reference. You could then call this XSLT from a simple program to pick the required file(s).
Alternatively, you could use Linq to XML to load the ASPX file(s) and iterate over the controls in a Linq-style.

if the code for the tag is written in more than one line, we may have an issue in extracting the tag data. to avoid that I have removed the newline characters as below from the source string that we are passing to the above function (content)
string contentRemovedNewLines = Regex.Replace(content, #"\t|\n|\r", "");
then we can use contentRemovedNewLines instead of content.
Above code works as i wanted. one more thing can be added. you can call the above method as shown below and then save as an xml file so, we can check that the expected result is there or not.
XmlDocument xmlDocWithWebContent = CreateDocumentFromMarkup(sourceToRead);
string xmlfileLocation = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) + "tempXmlOutputFileOfWebSource.xml";
xmlDocWithWebContent.Save(xmlfileLocation);
to do that, we have to have a root element for the xml file
XmlDocument document = new XmlDocument();
XmlNode xmlNode = document.CreateNode(XmlNodeType.XmlDeclaration, "", "");
XmlElement root = document.CreateElement("controls");
document.AppendChild(root);
i used the above fix for that

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

processing Xhtml files with Xdocument class adds unwanted elements - c#

Related

Find and delete all occurrences of a string that starts with x

how to find an element value in xml when we have multiple namespaces

Getting the right node in Linq to XML

Can this be improved? Scrubbing of dangerous html tags

Parse controls in an aspx file and convert it to xml

Categories

Resources