Create XML based on text tree - c#

I need to go from a list like this:
/home
/home/room1
/home/room1/subroom
/home/room2
/home/room2/miniroom
/home/room2/bigroom
/home/room2/hugeroom
/home/room3
to an xml file. I've tried using LINQ to XML to do this but I just end up getting confused and not sure what to do from there. Any help is much appreciated!
Edit:
I want the XML file to look something like this:
<home>
<room1>
<subroom>This is a subroom</subroom>
</room1>
<room2>
<miniroom>This is a miniroom</miniroom>
<bigroom>This is a bigroom</bigroom>
<hugeroom>This is a hugeroom</hugeroom>
</room2>
<room3></room3>
</home>
The text inside if the tags ("this is a subroom", etc) is optional, but would be really nice to have!

Ok buddy, here's a solution.
Couple of notes and explanation.
Your text structure can be split up into lines and then again by the slashes into the names of the XML nodes. If you think of the text in this way, you get a list of "lines" broken into a list of
names.
/home
First of all, the first line /home is the root of the XML; we can get rid of it and just create and XDocument object with that name as the root element;
var xDoc = new XDocument("home");
Of course we don't want to hard code things but this is just an example. Now, on to the real work:
/home/room1/
/home/room1/bigroom
etc...
as a List<T> then it will look like this
myList = new List<List<string>>();
... [ add the items ]
myList[0][0] = home
myList[0][1] = room1
myList[1][0] = home
myList[1][1] = room1
myList[1][2] = bigroom
So what we can do to get the above structure is use string.Split() multiple times to break your text first into lines, then into parts of each line, and end up with a multidimensional array-style List<T> that contains List<T> objects, in this case, List<List<string>>.
First let's create the container object:
var possibleNodes = new List<List<string>>();
Next, we should split the lines. Let's call the variable that holds the text, "text".
var splitLines = text
.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
.ToList();
This gives us a List but our lines are still not broken up. Let's split them again by the slash (/) character. This is where we build our node names. We can do this in a ForEach and just add to our list of possible nodes:
splitLines.ForEach(l =>
possibleNodes.Add(l
.Split(new char[] { '/' }, StringSplitOptions.RemoveEmptyEntries)
.ToList()
)
);
Now, we need to know the DEPTH of the XML. Your text shows that there will be 3 nodes of depth. The node depth is the maximum depth of any one given line of nodes, now stored in the List<List<string>>; we can use the .Max() method to get this:
var nodeDepth = possibleNodes.Max(n => n.Count);
A final setup step: We don't need the first line, because it's just "home" and it will be our root node. We can just create an XDocument object and give it this first line to use as the name of Root:
// Create the root node
XDocument xDoc = new XDocument(new XElement(possibleNodes[0][0]));
// We don't need it anymore
possibleNodes.RemoveAt(0);
Ok, here is where the real work happens, let me explain the rules:
We need to loop through the outer list, and through each inner list.
We can use the list indexes to understand which node to add to or which names to ignore
We need to keep hierarchy proper and not duplicate nodes, and some XLinq helps here
The loops - see the comments for a detailed explanation:
// This gets us looping through the outer nodes
for (var i = 0; i < possibleNodes.Count; i++)
{
// Here we go "sideways" by going through each inner list (each broken down line of the text)
for (var ii = 1; ii < nodeDepth; ii++)
{
// Some lines have more depth than others, so we have to check this here since we are looping on the maximum
if (ii < possibleNodes[i].Count)
{
// Let's see if this node already exists
var existingNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii]));
// Let's also see if a parent node was created in the previous loop iteration.
// This will tell us whether to add the current node at the root level, or under another node
var parentNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii - 1]));
// If the current node has already been added, we do nothing (this if statement is not entered into)
// Otherwise, existingNode will be null and that means we need to add the current node
if (null == existingNode)
{
// Now, use parentNode to decide where to add the current node
if (null == parentNode)
{
// The parent node does not exist; therefore, the current node will be added to the root node.
xDoc.Root.Add(new XElement(possibleNodes[i][ii]));
}
else
{
// There IS a parent node for this node!
// Therefore, we must add the current node to the parent node
// (remember, parent node is the previous iteration of the inner for loop on nodeDepth )
var newNode = new XElement(possibleNodes[i][ii]);
parentNode.Add(newNode);
// Add "this is a" text (bonus!) -- only adding this text if the current node is the last one in the list.
if (possibleNodes[i].Count -1 == ii)
{
newNode.Add(new XText("This is a " + newNode.Name.LocalName));
}
}
}
}
}
}
The bonus here is this code will work with any number of nodes and build your XML.
To check it, XDocument has a nifty .ToString() overriden implementation that just spits out all of the XML it is holding, so all you do is this:
Console.Write(xDoc.ToString());
And, you'll get this result:
(Note I added a test node to make sure it works with more than 3 levels)
Below, you will find the entire program with your test text, etc, as a working solution:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;
namespace XmlFromTextString
{
class Program
{
static void Main(string[] args)
{
// This simulates text from a file; note that it must be flush to the left of the screen or else the extra spaces
// add unneeded nodes to the lists that are generated; for simplicity of code, I chose not to implement clean-up of that and just
// ensure that the string literal is not indented from the left of the Visual Studio screen.
string text =
#"/home
/home/room1
/home/room1/subroom
/home/room2
/home/room2/miniroom
/home/room2/test/thetest
/home/room2/bigroom
/home/room2/hugeroom
/home/room3";
var possibleNodes = new List<List<string>>();
var splitLines = text
.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
.ToList();
splitLines.ForEach(l =>
possibleNodes.Add(l
.Split(new char[] { '/' }, StringSplitOptions.RemoveEmptyEntries)
.ToList()
)
);
var nodeDepth = possibleNodes.Max(n => n.Count);
// Create the root node
XDocument xDoc = new XDocument(new XElement(possibleNodes[0][0]));
// We don't need it anymore
possibleNodes.RemoveAt(0);
// This gets us looping through the outer nodes
for (var i = 0; i < possibleNodes.Count; i++)
{
// Here we go "sideways" by going through each inner list (each broken down line of the text)
for (var ii = 1; ii < nodeDepth; ii++)
{
// Some lines have more depth than others, so we have to check this here since we are looping on the maximum
if (ii < possibleNodes[i].Count)
{
// Let's see if this node already exists
var existingNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii]));
// Let's also see if a parent node was created in the previous loop iteration.
// This will tell us whether to add the current node at the root level, or under another node
var parentNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii - 1]));
// If the current node has already been added, we do nothing (this if statement is not entered into)
// Otherwise, existingNode will be null and that means we need to add the current node
if (null == existingNode)
{
// Now, use parentNode to decide where to add the current node
if (null == parentNode)
{
// The parent node does not exist; therefore, the current node will be added to the root node.
xDoc.Root.Add(new XElement(possibleNodes[i][ii]));
}
else
{
// There IS a parent node for this node!
// Therefore, we must add the current node to the parent node
// (remember, parent node is the previous iteration of the inner for loop on nodeDepth )
var newNode = new XElement(possibleNodes[i][ii]);
parentNode.Add(newNode);
// Add "this is a" text (bonus!) -- only adding this text if the current node is the last one in the list.
if (possibleNodes[i].Count -1 == ii)
{
newNode.Add(new XText("This is a " + newNode.Name.LocalName));
// For the same default text on all child-less nodes, us this:
// newNode.Add(new XText("This is default text"));
}
}
}
}
}
}
Console.Write(xDoc.ToString());
Console.ReadKey();
}
}
}

Time for LINQ magic?
// load file into string[]
var input = File.ReadAllLines("TextFile1.txt");
// in case you have more than one home in your file
var homes =
new XDocument(
new XElement("root",
from line in input
let items = line.Split(new[] { "/" }, StringSplitOptions.RemoveEmptyEntries)
group items by items[0] into g
select new XElement(g.Key,
from rooms in g.OrderBy(x => x.Length).Skip(1)
group rooms by rooms[1] into g2
select new XElement(g2.Key,
from name in g2.OrderBy(x => x.Length).Skip(1)
select new XElement(name[2], string.Format("This is a {0}", name[2]))))));
// get the right home
var home = new XDocument(homes.Root.Element("home"));

Related

HTMLAgilityPack error: "Multiple node elements can't be created."

I'm attempting to use the HTMLAgilityPack to get retrieve and edit inner text of some HTML. The inner text of each node i retrieve needs to be checked for matching strings and those matching strings to be highlighted like so:
var HtmlDoc = new HtmlDocument();
HtmlDoc.LoadHtml(item.Content);
var nodes = HtmlDoc.DocumentNode.SelectNodes("//div[#class='guide_subtitle_cell']/p");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(Methods.HighlightWords(htmlNode.InnerText, searchstring)), htmlNode);
}
This is the code for the HighlightWords method I use:
public static string HighlightWords(string input, string searchstring)
{
if (input == null || searchstring == null)
{
return input;
}
var lowerstring = searchstring.ToLower();
var words = lowerstring.Split(' ').ToList();
for (var i = 0; i < words.Count; i++)
{
Match m = Regex.Match(input, words[i], RegexOptions.IgnoreCase);
if (m.Success)
{
string ReplaceWord = string.Format("<span class='search_highlight'>{0}</span>", m.Value);
input = Regex.Replace(input, words[i], ReplaceWord, RegexOptions.IgnoreCase);
}
}
return input;
}
Can anyone suggest how to get this working or indicate what i'm doing wrong?
The problem is that HtmlTextNode.CreateNode can only create one node. When you add a <span> inside, that's another node, and CreateNode throws the exception you see.
Make sure that you are only doing a search and replace on the lowest leaf nodes (nodes with no children). Then rebuild that node by:
Create a new empty node to replace the old one
Search for the text in .InnerText
Use HtmlTextNode.Create to add the plain text before the text you want to highlight
Then add your new <span> with the highlighted text with HtmlNode.CreateNode
Then search for the next occurrence (start back at 1) until no more occurrences are found.
Your function HighlightWords must be returning multiple top-level HTML nodes. For example:
<p>foo</p>
<span>bar</span>
The HtmlAgilityPack only allows one top-level node to be returned. You can hardcode the return value for HighlightWords to test.
Also, this post has run across the same problem.

How to check XML nodes contained in different XML files for equality?

I have two XML files (file A and file B where file A is a subset of file B) which I read using the System.Xml.XmlDocument.LoadXml(fileName) method.
I am then selecting nodes within these files using the System.Xml.XmlNode.SelectNodes(nodeName) I need to compare that each selected xml node in file A is either equal or a subset of that same node in file B. Need to also check that the order of the subnodes contained within any node in file A is the same of the order of those same subnodes contained within that node in fileB.
For example,
fileA
<rootNodeA>
<elementA>
<subelementA>content</subElementA>
<subelementB>content</subElementB>
<subelementB>content</subElementC>
<subelementB>content</subElementD>
</elementA>
<elementB>
<subelementA>content</subElementA>
<subelementB>content</subElementB>
</elementB>
</rootNodeA>
fileB
<rootNodeB>
<elementA>
<subelementB>content</subElementB>
<subelementD>content</subElementD>
</elementA>
<elementB>
<subelementA>content</subElementA>
</elementB>
</rootNodeB>
As you see, fileB is a subset of fileA. I need to check that elementA node of file B is equal or a subset of that same elementA node in file A. This should be true for the subnodes (subElementA, etc.) as well and the content of the nodes/subnodes.
Also, if you see elementA in fileA, there are 4 subelements in the order A,B,C,D. For that same elementA in fileB, there are 2 subelements in the order A,D. This order i.e A comes before D is same as the order in file A, need to check this as well.
My idea is to compute Hashes of the nodes and then compare them but unsure of how or if this would satisfy the purpose.
EDIT: Code I have so far,
HashSet<XmlElement> hashA = new HashSet<XmlElement>();
HashSet<XmlElement> hashB = new HashSet<XmlElement>();
foreach (XmlElement node in nodeList)
{
hashA.Add(node);
}
foreach(XmlElement node in masterNodeList)
{
hashB.Add(node);
}
isSubset = new HashSet<XmlElement>(hashA).IsSubsetOf(hashB);
return isSubset;
this sounds like a simple recursive function.
didn't check if it actually work, but that should do it:
public static bool isSubset(XmlElement source, XmlElement target)
{
if (!target.HasChildNodes)
{
if (source.HasChildNodes) // surly not same.
return false;
return string.Equals(source.Value, target.Value); // equalize values.
}
var sourceChildren = source.ChildNodes.OfType<XmlElement>().ToArray(); // list all child tags in source (by order)
var currentSearchIndex = 0; // where are we searching from (where have we found our match)
foreach (var targetChild in target.ChildNodes.OfType<XmlElement>())
{
var findIndex = Array.FindIndex(sourceChildren, currentSearchIndex, el => el.Name == targetChild.Name);
if (findIndex == -1)
return false; // not found in source, therefore not a subset.
if (!isSubset(sourceChildren[findIndex], targetChild))
return false; // if the child is not a subset, then parent isn't too.
currentSearchIndex = findIndex; // increment our search index so we won't match nodes that already passed.
}
}

Removing invalid child nodes but keep its contents intact..?

I have some xml files that look like sample file
I want to remove invalid xref nodes from it but keep the contents of those nodes as it is.
The way to know whether a xref node is valid is to check its attribute rid's value exactly matches any of the attributes id of any node present in the entire file, so the output file of the above sample should be something like sample output file
The code I've written thus far is below
XDocument doc=XDocument.Load(#"D:\sample\sample.xml",LoadOptions.None);
var ids = from a in doc.Descendants()
where a.Attribute("id") !=null
select a.Attribute("id").Value;
var xrefs=from x in doc.Descendants("xref")
where x.Attribute("rid")!=null
select x.Attribute("rid").Value;
if (ids.Any() && xrefs.Any())
{
foreach(var xref in xrefs)
{
if (!ids.Contains(xref))
{
string content= File.ReadAllText(#"D:\sample\sample.xml");
string result=Regex.Replace(content,"<xref ref-type=\"[^\"]+\" rid=\""+xref+"\">(.*?)</xref>","$1");
File.WriteAllText(#"D:\sample\sample.xml",result);
}
}
Console.WriteLine("complete");
}
else
{
Console.WriteLine("No value found");
}
Console.ReadLine();
The problem is when the values of xref contain characters like ., *, (etc. which on a regex replace needs to be escaped properly or the replace can mess up the file.
Does anyone have a better solution to the problem?
You don't need regex to do this. Instead use element.ReplaceWith(element.Nodes()) to replace node with its children. Sample code:
XDocument doc = XDocument.Load(#"D:\sample\sample.xml", LoadOptions.None);
// use HashSet, since you only use it for lookups
var ids = new HashSet<string>(from a in doc.Descendants()
where a.Attribute("id") != null
select a.Attribute("id").Value);
// select both element itself (for update), and value of "rid"
var xrefs = from x in doc.Descendants("xref")
where x.Attribute("rid") != null
select new { element = x, rid = x.Attribute("rid").Value };
if (ids.Any()) {
var toUpdate = new List<XElement>();
foreach (var xref in xrefs) {
if (!ids.Contains(xref.rid)) {
toUpdate.Add(xref.element);
}
}
if (toUpdate.Count > 0) {
foreach (var xref in toUpdate) {
// replace with contents
xref.ReplaceWith(xref.Nodes());
}
doc.Save(#"D:\sample\sample.xml");
}
}

passing a list to a list<T>

I am working with openxml, and have something that is pulling my hairs up, basicly i am editing a pré existing document, it is a template, the template should mantain the first page and the second, so every section i add(paragraph, table etc..) it should be added between the 2 pages, i already accomplish that, i can insert a simple table this way:
DocTable docTable = new DocTable();
Paragraph paragraph = doc.MainDocumentPart.Document.Body.Descendants<Paragraph>()
.Where<Paragraph>(p => p.InnerText.Equals("some Text")).First();
Table table = docTable.createTable(Convert.ToInt16(2), Convert.ToInt16(2));
mainPart.Document.Body.InsertAfter(table, paragraph);
i basicly search the pargraph at the end of the page 1 and insert the table after. My problem is: i don't receive a single section from a frontEnd webpage, i receive a list of sections, i defined this list as a list of object without a defined type since it can have Tables, paragraphs and other things.
so basicly i have this:
List<Object> listOfSections = new List<Object>();
In receive the sections from the front end, and identify what it is with the key like this:
foreach (DocumentAtributes section in sections.atributes)
{
if(section.key != "Document")
{
checkSection(mainPart, section, listOfSections);
}
}
public void checkSection(MainDocumentPart mainPart,DocumentAtributes section,List<Object> listOfSections)
{
switch (section.key)
{
case "Table":
DocTable docTable = new DocTable();
Table table = docTable.createTable(Convert.ToInt16(section.rows), Convert.ToInt16(section.cols));
listOfSections.Add(new Run(table));
break;
case "Paragraph":
DocRun accessTypeTitle = new DocRun();
Run permissionTitle = accessTypeTitle.createParagraph(section.text, PARAGRAPHCOLOR, Convert.ToInt16(section.fontSize), DEFAULTFONT,section.align);
listOfSections.Add(permissionTitle);
break;
case "Image":
DocImage docImage = new DocImage();
Run image = docImage.imageCreatorFromDisk(mainPart, "abcd", Convert.ToInt16(section.width), Convert.ToInt16(section.height), section.align, null, null, section.wrapChoice, section.base64);
listOfSections.Add(image);
break;
}
}
I need a way to add this list to the insertAfter, it must be the list i can't add the individual object since after i insert the first the next sections will be added after the paragraph either it brings me a issue since i want the order to be the same as it comes in the sections.atributes.
So the insertAfter accepts a list and i have a list of objects the method is like this: insertAfter(List, refChild)
Can i cast my list of objects or do something else? need some help here.
You can iterate the list in reverse to have the first element in the list immediately after the paragraph, followed by the second, then the third etc.
for (int i = listOfSections.Count - 1; i >= 0; i--)
{
mainPart.Document.Body.InsertAfter(listOfSections[i], paragraph);
}
If you start with a list with elements:
Element1
Element2
Element3
Element4
And the document starts with just:
Paragraph
Then after each iteration you would end up with:
Iteration 1
Paragraph
Element4
Iteration 2
Paragraph
Element3
Element4
Iteration 3
Paragraph
Element2
Element3
Element4
and finally, Iteration 4
Paragraph
Element1
Element2
Element3
Element4
which is the desired result.

Select specific nodes in XML with LINQ

I'm writing a function that loads and XML document and converts it to a CSV. Since I need only some values from the XML file, the goal i'm trying to achieve is to select only the nodes I'm interested in.
Here's my code:
XDocument csvDocument = XDocument.Load(tempOutput);
StringBuilder csvBuilder = new StringBuilder(1000);
foreach (XElement node in csvDocument.Descendants("Sample"))
{
foreach (XElement innerNode in node.Elements())
{
csvBuilder.AppendFormat("{0},", innerNode.Value);
}
csvBuilder.Remove(csvBuilder.Length -1, 1);
csvBuilder.AppendLine();
}
csvOut = csvBuilder.ToString();
But, in this way I'm selectin ALL the child nodes inside the "Sample" node.
In the XML, "Sample" tree is:
<Sample Type="Object" Class ="Sample">
<ID>1</ID>
<Name>10096</Name>
<Type>2</Type>
<Rep>0</Rep>
<Selected>True</Selected>
<Position>1</Position>
<Pattern>0</Pattern>
</Sample>
Code works flawlessly, but I need only "ID" and "Selected" to be selected and their values written inside the CSV file.
Could anyone point me in the right direction, please?
Thanks.
Learn more about Linq-to-xml here. You're not really taking advantage of the 'linq-edness' of XObjects
var samples = csvDocument.Descendants("Sample")
.Select(el => new {
Id = el.Element("ID").Value,
Selected = el.Elemnt("Selected").Value
});
This creates for you an IEnumerable<T> where 'T' is an anonymous type with the properties Id and Selected.
You can parse (int.Parse or bool.Parse) the Id and Selected values for type safety. But since you are simply writing to a StringBuilder object you may not care ...just an FYI.
The StringBuilder object can then be written as follows:
foreach (var sample in samples) {
csvBuilder.AppendFormat(myFormattedString, sample.Id, sample.Selected);
}
The caveat to this is that your anonymous object and the for-each loop should be within the same scope. But there are ways around that if necessary.
As always, there is more than one way to skin a cat.
Update ...in ref. to comment:
foreach (XElement node in csvDocument.Descendants("Sample"))
{
foreach (XElement innerNode in node.Elements())
{
// this logic assumes different formatting for values
// otherwise, change if statement to || each comparison
if(innerNode.Name == "ID") {
// append/format stringBuilder
continue;
}
if(innerNode.Name == "Selected") {
// append/format stringBuilder
continue;
}
}
csvBuilder.Remove(csvBuilder.Length -1, 1);
csvBuilder.AppendLine();
}

Categories

Resources