C# - XML - Removing the nodes without using recursion - c#

I have the following recursive method which takes the an XHTML document and marks nodes based on certain conditions and It is called like below for a number of HTML contents:-
XmlDocument document = new XmlDocument();
document.LoadXml(xmlAsString);
PrepNodesForDeletion(document.DocumentElement, document.DocumentElement);
The method definition is below
/// <summary>
/// Recursive function to identify and mark all unnecessary nodes so that they can be removed from the document.
/// </summary>
/// <param name="nodeToCompareAgainst">The node that we are recursively comparing all of its descendant nodes against</param>
/// <param name="nodeInQuestion">The node whose children we are comparing against the "nodeToCompareAgainst" node</param>
static void PrepNodesForDeletion(XmlNode nodeToCompareAgainst, XmlNode nodeInQuestion)
{
if (infinityIndex++ > 100000)
{
throw;
}
foreach (XmlNode childNode in nodeInQuestion.ChildNodes)
{
// make sure we compare all of the childNodes descendants to the nodeToCompareAgainst
PrepNodesForDeletion(nodeToCompareAgainst, childNode);
if (AreNamesSame(nodeToCompareAgainst, childNode) && AllAttributesPresent(nodeToCompareAgainst, childNode))
{
// the function AnyAttributesWithDifferingValues assumes that all attributes are present between the two nodes
if (AnyAttributesWithDifferingValues(nodeToCompareAgainst, childNode) && InnerTextIsSame(nodeToCompareAgainst, childNode))
{
MarkNodeForDeletion(nodeToCompareAgainst);
}
else if (!AnyAttributesWithDifferingValues(nodeToCompareAgainst, childNode))
{
MarkNodeForDeletion(childNode);
}
}
// make sure we compare all of the childNodes descendants to the childNode
PrepNodesForDeletion(childNode, childNode);
}
}
And then the following method which would delete the marked node:-
static void RemoveMarkedNodes(XmlDocument document)
{
// in order for us to make sure we remove everything we meant to remove, we need to do this in a while loop
// for instance, if the original xml is = <a><a><b><a/></b></a><a/></a>
// this should result in the xml being passed into this function as:
// <a><b><a DeleteNode="TRUE" /></b><a DeleteNode="TRUE"><b><a DeleteNode="TRUE" /></b></a><a DeleteNode="TRUE" /></a>
// then this function (without the while) will not delete the last <a/>, even though it is marked for deletion
// if we incorporate a while loop, then we can insure all nodes marked for deletion are removed
// TODO: understand the reason for this -- see http://groups.google.com/group/microsoft.public.dotnet.xml/browse_thread/thread/25df058a4efb5698/7dd0a8b71739216c?lnk=st&q=xmlnode+removechild+recursive&rnum=2&hl=en#7dd0a8b71739216c
XmlNodeList nodesToDelete = document.SelectNodes("//*[#DeleteNode='TRUE']");
while (nodesToDelete.Count > 0)
{
foreach (XmlNode nodeToDelete in nodesToDelete)
{
nodeToDelete.ParentNode.RemoveChild(nodeToDelete);
}
nodesToDelete = document.SelectNodes("//*[#DeleteNode='TRUE']");
}
}
When I use the PrepNodesForDeletion method without the infinityIndex counter, I get OutOfMemoryException for few HTML contents. However, If I use infinityIndex counter, It may not be deleting nodes for some HTML contents.
Could anybody suggest any way to remove recursion. Also I am not familiar with the HtmlAgility pack. So, If this can be done using that, could somebody provide some code sample.

Well, if I understand your algorithm correctly you want to do this:
For each node in the tree compare it against all its child nodes in a non-recursive fashion, correct?
// walk the tree in DFS
public void XmlTreeWalk(XmlNode root, Action<XmlNode, XmlNode> action)
{
var nodesToCompare = new Stack<XmlNode>();
foreach (XmlNode child in root.ChildNodes)
{
nodesToCompare.Push(child);
}
while (nodesToCompare.Count > 0)
{
var top = nodesToCompare.Pop();
action(root, top);
foreach (XmlNode child in top.ChildNodes)
{
nodesToCompare.Push(child);
}
}
}
// for each node: prepare all its children for deletion
public void PrepareForDeletion(XmlNode root)
{
XmlTreeWalk(root, (r, c) => PrepareSubtreeForDeletion(r, c));
}
// for each node, compare all its children against the toCompare node
private void PrepareSubtreeForDeletion(XmlNode toCompare, XmlNode root)
{
XmlTreeWalk(root, (unused, current) => MarkNodeForDeletion(toCompare, current));
}
// your delete logic
public void MarkNodeForDeletion(XmlNode toCompare, XmlNode toCompareAgains)
{
...
}
What this should do is: Walk the tree top to bottom and for each node walk the subtree of that node comparing all children against this node.
I haven't tested it so it might contain bugs but the idea should be clear. Apparently this algorithm is O(n^2).

To remove recursion, the childs and parents must know about each other.
Then you can traverse say down the right leg from the root parent, until you reach the right most bottom leg.
And then from there, go up one, then down left one, and then down right until bottom. Repeat up one, down left, and then right as far as possible, etc. until you have looped over the entire tree structure.
I'm not sure on what you are attempting to do, to suggest how to use this method on your problem.

Your problem is that you have badly formed XML and as a direct result your DOM is a mess. What I think you are going to have to do is to use a SAX parser (which must exist for .net) and implement the logic to fix the DOM yourself which appears to be what you re trying to do.
This method isn't recursive but is going to require you to do some work that you didn't realize that you needed to do.
also note that you are getting an out of memory exception and not a stack overflow exception which reinforces the idea that too much recursion is not your problem per se.

Related

Delete item in nested collections of Nth level

I'm having trouble trying to delete a item inside a tree structured object.
My object is as below
TreeNode
{
string name;
ObservableCollection<TreeNode> Children;
}
I thought if I recursively process through the tree and find my node and delete it but I ran into trouble.
I did something along the lines of
Updated:
DeleteNode(ObservableCollection<TreeNode> children, TreeNode nodetodelete)
{
if(children.remove(nodetodelete))
{
return;
}
else
{
foreach(var child in children)
{
DeleteNode(child, nodetodelete);
}
}
}
I realize while I was writing the code that I would eventually run into manipulation exception while iterating through a collection that has a chance of being changed.
I could build a giant change of for loops since I know exactly the max deep length(which I did for a place holder) but that seems really bad. . . .
Can anyone point me in a better general direction. I kind of wonder if my data structure is the cause of this.
Update:
This will look awful and kinda of code smell but I got the recursion to "work"
by throw a exception when I find my node.
DeleteNode(children, nodetodelete)
{
if(children.remove(nodetodelete)
{
throw FoundException();
}
else
{
foreach(var child in children)
{
DeleteNode(child, nodetodelete)
}
}
}
Is there any other way of breaking out of a recursion.
I would deal with this by making a small change to my design (assuming the snippet in your question is pseudocode for a class):
TreeNode
{
string name;
TreeNode Parent;
ObservableCollection<TreeNode> Children;
public void Delete()
{
Parent.Children.Remove(this);
}
}
This makes a little bit more work for you maintaining an extra reference when manipulating your object graph, but saves you a lot of effort and code when doing things like deletes as you can see above.
You haven't shown how you're constructing TreeNodes, but I'd make the parent and a collection for the children arguments of the constructor.
You can safely iterate over the collection of children nodes and remove them, as long as you don't change the original collection. This can be done by creating an array of the collection and iterating over that instead.
DeleteNode(ObservableCollection<TreeNode> children, TreeNode nodetodelete)
{
if (children.remove(nodetodelete))
{
return;
}
else
{
foreach (var child in children.ToArray())
{
// If anything is deleted in the collection, it will not break the iteration here, as we are iterating over an Array and not "children"
DeleteNode(child, nodetodelete);
}
}
}
This will create a new collection for you to iterate over. If a child node is deleted from children, the foreach loop will not throw an exception. That is because the original collection was changed, while we iterate over a secondary collection.

A special C# Tree algorithm in Umbraco CMS

I'm creating a special tree algorithm and I need a bit of help with the code that I currently have, but before you take a look on it please let me explain what it really is meant to do.
I have a tree structure and I'm interacting with a node (any of the nodes in the tree(these nodes are Umbraco CMS classes)) so upon interaction I render the tree up to the top (to the root) and obtain these values in a global collection (List<Node> in this particular case). So far, it's ok, but then upon other interaction with another node I must check the list if it already contains the parents of the clicked node if it does contain every parent and it doesn't contain this node then the interaction is on the lowest level (I hope you are still with me?).
Unfortunately calling the Contains() function in Umbraco CMS doesn't check if the list already contains the values which makes the list add the same values all over again even through I added the Contains() function for the check.
Can anyone give me hand here if he has already met such a problem? I exchanged the Contains() function for the Except and Union functions, and they yield the same result - they do contain duplicates.
var currentValue = (string)CurrentPage.technologies;
List<Node> globalNodeList = new List<Node>();
string[] result = currentValue.Split(',');
foreach (var item in result)
{
var node = new Node(int.Parse(item));
if (globalNodeList.Count > 0)
{
List<Node> nodeParents = new List<Node>();
if (node.Parent != null)
{
while (node != null)
{
if (!nodeParents.Contains(node))
{
nodeParents.Add(node);
}
node = (Node)node.Parent;
}
}
else { globalNodeList.Add(node); }
if (nodeParents.Count > 0)
{
var differences = globalNodeList.Except<Node>(globalNodeList);
globalNodeList = globalNodeList.Union<Node>(differences).ToList<Node>();
}
}
else
{
if (node.Parent != null)
{
while (node != null)
{
globalNodeList.Add(node);
node = (Node)node.Parent;
}
}
else
{
globalNodeList.Add(node);
}
}
}
}
If I understand your question, you only want to see if a particular node is an ancestor of an other node. If so, just (string) check the Path property of the node. The path property is a comma separated string. No need to build the list yourself.
Just myNode.Path.Contains(",1001") will work.
Small remarks.
If you are using Umbraco 6, use the IPublishedContent instead of Node.
If you would build a list like you do, I would rather take you can provide the Umbraco helper with multiple Id's and let umbraco build the list (from cache).
For the second remark, you are able to do this:
var myList = Umbraco.Content(1001,1002,1003);
or with a array/list
var myList = Umbraco.Content(someNode.Path.Split(','));
and because you are crawling up to the root, you might need to add a .Reverse()
More information about the UmbracoHelper can be found in the documentation: http://our.umbraco.org/documentation/Reference/Querying/UmbracoHelper/
If you are using Umbraco 4 you can use #Library.NodesById(...)

Recursive collection search

I have a collection (List<Element>) of objects as described below:
class Element
{
string Name;
string Value;
ICollection<Element> ChildCollection;
IDictionary<string, string> Attributes;
}
I build a List<Element> collection of Element objects based on some XML that I read in, this I am quite happy with. How to implement searching of these elements currently has me, not stumped, but wondering if there is a better solution.
The structure of the collection looks something like this:
- Element (A)
- Element (A1)
- Element (A1.1)
- Element (A2)
- Element (B)
- Element (B1)
- Element (B1.1)
- Element (B1.2)
- Element (C)
- Element (C1)
- Element (C2)
- Element (C3)
Currently I am using recursion to search the Attributes dictionary of each top level (A, B, C) Element for a particular KeyValuePair. If I do not find it in the top level Element I start searching its ChildElement collection (1, 1.1, 2, 2.1, n, etc.) in the same manner.
What I am curious about is if there is a better method of implementing a search on these objects or if recursion is the better answer in this instance, if I should implement the search as I am currently, top -> child -> child -> etc. or if I should search in some other manner such as all top levels first?
Could I, and would it be reasonable to use the TPL to search each top level (A, B, C) in parallel?
Recursion is one way of implementing a tree search where you visit elements in depth-first order. You can implement the same algorithm with a loop instead of recursion by using a stack data structure to store the nodes of your tree that you need to visit.
If you use the same algorithm with a queue instead of a stack, the search would proceed in breath-first order.
In both cases the general algorithm looks like this:
var nodes = ... // some collection of nodes
nodes.Add(root);
while (nodes.Count != 0) {
var current = nodes.Remove ... // Take the current node from the collection.
foreach (var child in current.ChildCollection) {
nodes.Add(child);
}
// Process the current node
if (current.Attributes ...) {
...
}
}
Note that the algorithm is not recursive: it uses an explicit collection of nodes to save the current state of the search, whereas a recursive implementation uses the call stack for the same purpose. If nodes is a Stack<Element>, the search proceeds in depth-first order; if nodes is a Queue<Element>, the search proceeds in breadth-first order.
I grabbed this bit from SO somewhere, Its not mine but I cant provide a link to it. This class Flattens out a treeview for a recursive search, looks like it should do the same for you.
public static class SOExtension
{
public static IEnumerable<TreeNode> FlattenTree(this TreeView tv)
{
return FlattenTree(tv.Nodes);
}
public static IEnumerable<TreeNode> FlattenTree(this TreeNodeCollection coll)
{
return coll.Cast<TreeNode>()
.Concat(coll.Cast<TreeNode>()
.SelectMany(x => FlattenTree(x.Nodes)));
}
}
I found the link I got this from - its very easy to use. have a look. Is there a method for searching for TreeNode.Text field in TreeView.Nodes collection?
You can re-use existing components designed specifically for traversing in different ways, such as NETFx IEnumerable.Traverse Extension Method. It allows you to depth or breadth first. It lets you traverse an enumerable tree, depth or breadth first.
Example to get a flattened enumerable of directories:
IEnumerable<DirectoryInfo> directories = ... ;
IEnumerable<DirectoryInfo> allDirsFlattened = directories.Traverse(TraverseKind.BreadthFirst, dir => dir.EnumerateDirectories());
foreach (DirectoryInfo directoryInfo in allDirsFlattened)
{
...
}
For BreadhFirst it uses Queue<T> internally and for DepthFirst it uses Stack<T> internally.
It is not traversing nodes parallell and unless the traversal is resource demanding it isn't appropriate to use parallellism at this level. But that depends on the context.

Most appropriate way to construct a File and Directory class in order to easily filter results when placing them on a tree

I am creating a program that cursively finds all the files and directories in the specified path. So one node may have other nodes if that node happens to be a directory.
Here is my Node class:
class Node
{
public List<Node> Children = new List<Node>(); // if node is directory then children will be the files and directories in this direcotry
public FileSystemInfo Value { get; set; } // can eather be a FileInfo or DirectoryInfo
public bool IsDirectory
{
get{ return Value is DirectoryInfo;}
}
public long Size // HERE IS WHERE I AM HAVING PROBLEMS! I NEED TO RETRIEVE THE
{ // SIZE OF DIRECTORIES AS WELL AS FOR FILES.
get
{
long sum = 0;
if (Value is FileInfo)
sum += ((FileInfo)Value).Length;
else
sum += Children.Sum(x => x.Size);
return sum;
}
}
// this is the method I use to filter results in the tree
public Node Search(Func<Node, bool> predicate)
{
// if node is a leaf
if(this.Children.Count==0)
{
if (predicate(this))
return this;
else
return null;
}
else // Otherwise if node is not a leaf
{
var results = Children.Select(i => i.Search(predicate)).Where(i => i != null).ToList();
if (results.Any()) // THIS IS HOW REMOVE AND RECUNSTRUCT THE TREE WITH A FILTER
{
var result = (Node)MemberwiseClone();
result.Children = results;
return result;
}
return null;
}
}
}
and thanks to that node class I am able to display the tree as:
In one column I display the name of the directory or file and on the right the size. The size is formated as currency just because the commas help visualize it more clearly.
So now my problem is The reason why I have this program was to perform some advance searches. So I may only want to search for files that have the ".txt" extension for example. If I perform that filter on my tree I will get:
(note that I compile the text to a function that takes a Node and returns a bool and I pass that method to the Search method on my Node class in order to filter results. More information on how to dynamically compile code can be found at: http://www.codeproject.com/Articles/10324/Compiling-code-during-runtime) Anyways that has nothing to do with this question. The important part was that I removed all the nodes that did not matched that criteria and because I removed those nodes now the sizes of the directories changed!!!
So my question is how will I be able to filter results maintaining the real size of the directory. I guess I will have to remove the property Size and replace it with a field. The problem with that is that every time I add to the tree I will have to update the size of all the parent directories and that gets complex. Before starting coding it that way I will appreciate your opinion on how I should start implementing the class.
Since you're using recursion and your weight is a node-level property you can't expect that will continue to sum even after you remove the node. You either promote it to a upper level (collection) or use an external counter within the recursion (which counts but not depending on filter, you'll need to carry this through the recuersion).
Anyway, why are you implementing a core .NET functionality again? any reason beyond filtering or recursive search? both are pretty well implemented in the BCL.

Removing default namespace attributes in XML with C# - can't pass object by ref and then iterate

I'm currently working on a buggy bit of code that's designed to strip out all the namespaces from an XML document and re-add them in the header. We use it because we ingest very large xml documents and then re-serve them in small fragments, so each item needs to replicate the namespaces in the parent document.
The XML is first loaded ias an XmlDocument and then passed to a function that removes the namespaces:
_fullXml = new XmlDocument();
_fullXml.LoadXml(itemXml);
RemoveNamespaceAttributes(_fullXml.DocumentElement);
The remove function iterates through the whole documents looking for namespaces and removing them. It looks like this:
private void RemoveNamespaceAttributes(XmlNode node){
if (node.Attributes != null)
{
for (int i = node.Attributes.Count - 1; i >= 0; i--)
{
if (node.Attributes[i].Name.Contains(':') || node.Attributes[i].Name == "xmlns")
node.Attributes.Remove(node.Attributes[i]);
}
}
foreach (XmlNode n in node.ChildNodes)
{
RemoveNamespaceAttributes(n);
}
}
However, I've discovered that it doesn't work - it leaves all the namespaces intact.
If you iterate through the code with the debugger then it looks to be doing what it's supposed to - the nodes objects have their namespace attributes removed. But the original _fullXml document remains untouched. I assume this is because the function is looking at a clone of the data passed to it, rather than the original data.
So my first thought was to pass it by ref. But I can't do that because the iterative part of the function inside the foreach loop has a compile error - you can't pass the object n by reference.
Second thought was to pass the whole _fullXml document but that doesn't work either, guessing because it's still a clone.
So it looks like I need to solve the problem of passing the document by ref and then iterating through the nodes to remove all namespaces. This will require re-designing this code fragment obviously, but I can't see a good way to do it. Can anyone help?
Cheers,
Matt
To strip namespaces it could be done like this:
void StripNamespaces(XElement input, XElement output)
{
foreach (XElement child in input.Elements())
{
XElement clone = new XElement(child.Name.LocalName);
output.Add(clone);
StripNamespaces(child, clone);
}
foreach (XAttribute attr in input.Attributes())
{
try
{
output.Add(new XAttribute(attr.Name.LocalName, attr.Value));
}
catch (Exception e)
{
// Decide how to handle duplicate attributes
//if(e.Message.StartsWith("Duplicate attribute"))
//output.Add(new XAttribute(attr.Name.LocalName, attr.Value));
}
}
}
You can call it like so:
XElement result = new XElement("root");
StripNamespaces(NamespaceXml, result);
I'm not 100% sure there aren't failure cases with this but it occurs to me that you can do
string x = Regex.Replace(xml, #"(xmlns:?|xsi:?)(.*?)=""(.*?)""", "");
on the raw xml to get rid of namespaces.
It's probably not the best way to solve this but I thought I'd put it out there.

Categories

Resources