strip out tag occurrences from XML - c#

I'd like to strip out occurrences of a specific tag, leaving the inner XML intact. I'd like to do this with one pass (rather than searching, replacing, and starting from scratch again). For instance, from the source:
<element>
<RemovalTarget Attribute="Something">
Content Here
</RemovalTarget>
</element>
<element>
More Here
</element>
I'd like the result to be:
<element>
Content Here
</element>
<element>
More Here
</element>
I've tried something like this (forgive me, I'm new to Linq):
var elements = from element in doc.Descendants()
where element.Name.LocalName == "RemovalTarget"
select element;
foreach (var element in elements) {
element.AddAfterSelf(element.Value);
element.Remove();
}
but on the second time through the loop I get a null reference, presumably because the collection is invalidated by changing it. What is an efficient way to make remove these tags on a potentially large document?

You'll have to skip the deferred execution with a call to ToList, which probably won't hurt your performance in large documents as you're just going to be iterating and replacing at a much lower big-O than the original search. As #jacob_c pointed out, I should be using element.Nodes() to replace it properly, and as #Panos pointed out, I should reverse the list in order to handle nested replacements accurately.
Also, use XElement.ReplaceWith, much faster than your current approach in large documents:
var elements = doc.Descendants("RemovalTarget").ToList().Reverse();
/* reverse on the IList<T> may be faster than Reverse on the IEnumerable<T>,
* needs benchmarking, but can't be any slower
*/
foreach (var element in elements) {
element.ReplaceWith(element.Nodes());
}
One last point, in reviewing what this MAY be used for, I tend to agree with #Trull that XSLT may be what you're actually looking for, if say you're removing all say <b> tags from a document. Otherwise, enjoy this fairly decent and fairly well performing LINQ to XML implementation.

Have you considered using XSLT? Seems like the perfect soution, as you are doing exactly what XSLT is meant for, transforming one XML doc into another. The templating system will delve into nested nastiness for you without problems.
Here is a basic example

I would recommend either doing XSLT as Trull recommended as the best solution.
Or you might look at using a string builder and regex matching to remove the items.
You could look at walking through the document, and working with nodes and parent nodes to effectively move the code from inside the node to the parent, but it would be tedious, and very un-necessary with the other potential solutions out there.

A lightweight solution would be to use XmlReader to go trough the input document and XmlWriter to write the output.
Note: XmlReader and XmlWriter clases are abstract, use the appropriate for your situation derived classes.

Depending on how you manage your XML, you could use a regular expression to remove the tags.
Here's a simple console application that demonstrates the use of a regex:
static void Main(string[] args)
{
string content = File.ReadAllText(args[0]);
Regex openTag = new Regex("<([/]?)RemovalTarget([^>]*)>", RegexOptions.Multiline);
string cleanContent = openTag.Replace(content, string.Empty);
File.WriteAllText(args[1], cleanContent);
}
This leaves newline characters in the file, but it shouldn't be too difficult to augment the regular expression.

Related

Removing empty elements from xml with regex that matches a sequence twice

I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>
I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
#"<.*></*>
I need some sort of regex that makes sure the pattern of the two * are the same.
So:
<Item><One>1</One><Two></Two><Three>3</Three></Item>
Would change into:
<Item><One>1</One><Three>3</Three></Item>
So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.
I don't have access to the original data that would allow recreating valid xml.
You want to capture one or more word characters inside <...>and match the closing tag by using \1 backreference to what was captured by first group.
<(\w+)></\1>
See demo at regex101
AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.
Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).
What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):
<\w+><\/\w+>
You should/may have a better check for tag name, for example \s*[\w\d]+\s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.
Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).
Use XML Linq
string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
XElement item = XElement.Parse(xml);
item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));

Removing useless TextNodes in HtmlAgilityPack

I'm scraping a number of websites using HtmlAgilityPack. The problem is that it seems to insist on inserting TextNodes in most places which are either empty or just contain a mass of \n, whitespaces and \r.
They tend to cause me issues when I'm counting childnodes , since firebug doesn't show them, but HtmlAgilityPack does.
Is there a way of telling HtmlAgilityPack to stop doing it, or at least clearing out these textnodes? (I want to keep USEFUL ones though). While we're here, same thing for Comment and Script tags.
You can use the following extension method:
static class HtmlNodeExtensions
{
public static List<HtmlNode> GetChildNodesDiscardingTextOnes(this HtmlNode node)
{
return node.ChildNodes.Where(n => n.NodeType != HtmlNodeType.Text).ToList();
}
}
And call it like this:
List<HtmlNode> nodes = someNode.GetChildNodesDiscardingTextOnes();
There is a difference between "no whitespace" between two nodes and "some whitespace". So all-whitespace textnodes still are needed and significant.
Couldn't you preprocess the html and remove all nodes that you do not need, before starting the "real scraping"?
See also this answer for the "how to remove".
Create an extension method that operates on the "Child" collection (or similar) on a node that uses some LINQ to filter out unwanted nodes. Then, when you traverse your tree do something like this:
myNode.Children.FilterNodes().ForEach(x => {});
I am looking for a better answer. Here is my current method with respect to childnodes like tables rows and table cells. Nodes are identified by their name TR, TH, TD so I strip out #text every time.
List<HtmlNode> rows = table.ChildNodes.Where(w => w.Name != "#text").ToList();
Sure, it is tedious and works and could be improved by an extension.

How to iterate over xml using linq2xml or Xquery

I have an incoming file with data as
<root><![CDATA[<defs><elements>
<element><item>aa</item><int>1</int></element>
<element><item>bb</item><int>2</int></element>
<element><item>cc</item><int>3</int></element>
</elements></defs>]]></root>
writing multiple foreach( xElement x in root.Elements ) seems superfluous !
looking for a less verbose method preferably using C#
UPDATE - yes - the input is in a CDATA, rest assured it's not my design and i have ZERO control over it !
Assuming that nasty CDATA section is intentional, and you're only interested in the text content of your leaf elements, you can do something like:
XElement root = XElement.Load(yourFile);
var data = from element in XElement.Parse(root.Value).Descendants("element")
select new {
Item = element.Elements("item").First().Value,
Value = element.Elements("int").First().Value
};
That said, if the code that generates your input file is under your control, consider getting rid of the CDATA section. Storing XML within XML that way is not the way to go most of the time, as it defeats the purpose of the markup language (and requires multiple parser passes, as shown above).

How do I match complete XML objects in a string?

I'm attempting to find complete XML objects in a string. They have been placed in the string by an XmlSerializer, but may or may not be complete. I've toyed with the idea of using a regular expression, because it seems like the kind of thing they were built for, except for the fact that I'm trying to parse XML.
I'm trying to find complete objects in the form:
<?xml version="1.0"?>
<type>
<field>value</field>
...
</type>
My thought was a regex to find <?xml version="1.0"?><type> and </type>, but if a field has the same name as type, it obviously won't work.
There's plenty of documentation on XML parsers, but they seem to all need a complete, fully-formed document to parse. My XML objects can be in a string surrounded by pretty much anything else (including other complete objects).
hw<e>reR#lot$0fr#ndm&nchrs%<?xml version="1.0"?><type><field>...</field>...</type>#ndH#r$omOre!!>nuT6erjc?y!<?xml version="1.0"?><type><field>...</field>...</type>ty!=]
A regex would be able to match a string while excluding the random characters, but not find a complete XML object. I'd like some way to extract an object, parse it with a serializer, then repeat until the string contains no more valid objects.
Can you use a regular expression to search for the "<?xml" piece and then assume that's the beginning of an XML object, then use an XMLReader to read/check the remainder of the string until you have parsed one entire element at the root level (then stop reading from the stream with XMLReader after the root node has been completely parsed)?
Edit: For more information about using XMLReader, I suggest one of the questions I asked: I can never predict xmlreader behavior, any tips on understanding?
My final solution was to stick with the "Read" method when parsing XML and avoid other methods that actually read from the stream advancing the current position.
You could try using the Html Agility Pack, which can be used to parse "malformed XML" and make it accessible with a DOM.
It would be necessary to know which element you are looking for (like <type> in your example), because it will be parsing the accidental elements too (like <e> in your example).

C# xml read/write/xpath without using XmlDocument

I am refactoring some code in an existing system. The goal is to remove all instances of the XmlDocument to reduce the memory footprint. However, we use XPath to manipulate the xml when certain rules apply. Is there a way to use XPath without using a class that loads the entire document into memory? We've replaced all other instances with XmlTextReader, but those only worked because there is no XPath and the reading is very simple.
Some of the XPath uses values of other nodes to base its decision on. For instance, the value of the message node may be based on the value of the amount node, so there is a need to access multiple nodes at one time.
If your XPATH expression is based on accessing multiple nodes, you're just going to have to read the XML into a DOM. Two things, though. First, you don't have to read all of it into a DOM, just the part you're querying. Second, which DOM you use makes a difference; XPathDocument is read-only and tuned for XPATH query speed, unlike the more general purpose but expensive XmlDocument.
I supose that using System.Xml.Linq.XDocument is also prohibited? Otherwise, it would be a good choice, as it is faster than XmlDocument (as I remember).
Supporting XPath means supporting queries like:
//address[/states/state[#code=current()/#code]='California']
or
//item[#id != preceding-sibling/item/#id]
which require the XPath processor to be able to look everywhere in the document. You're not going to find a forward-only XPath processor.
The way to do this is to use XPathDocument, which can take a stream - therefore you can use StringReader.
This returns the value in a forward read way without the overhead of loading the whole XML DOM into memory with XmlDocument.
Here is an example which returns the value of the first node that satisfies the XPath query:
public string extract(string input_xml)
{
XPathDocument document = new XPathDocument(new StringReader(input_xml));
XPathNavigator navigator = document.CreateNavigator();
XPathNodeIterator node_iterator = navigator.Select(SEARCH_EXPRESSION);
node_iterator.MoveNext();
return node_iterator.Current.Value;
}

Categories

Resources