Best way to combine nodes with Html Agility Pack

Best way to combine nodes with Html Agility Pack - c#

I've converted a large document from Word to HTML. It's close, but I have a bunch of "code" nodes that I'd like to merge into one "pre" node.
Here's the input:
<p>Here's a sample MVC Controller action:</p>
<code> public ActionResult Index()</code>
<code> {</code>
<code> return View();</code>
<code> }</code>
<p>We'll start by making the following changes...</p>
I want to turn it into this, instead:
<p>Here's a sample MVC Controller action:</p>
<pre class="brush: csharp"> public ActionResult Index()
{
return View();
}</pre>
<p>We'll start by making the following changes...</p>
I ended up writing a brute-force loop that iterates nodes looking for consecutive ones, but this seems ugly to me:
HtmlDocument doc = new HtmlDocument();
doc.Load(file);
var nodes = doc.DocumentNode.ChildNodes;
string contents = string.Empty;
foreach (HtmlNode node in nodes)
{
if (node.Name == "code")
{
contents += node.InnerText + Environment.NewLine;
if (node.NextSibling.Name != "code" &&
!(node.NextSibling.Name == "#text" && node.NextSibling.NextSibling.Name == "code")
)
{
node.Name = "pre";
node.Attributes.RemoveAll();
node.SetAttributeValue("class", "brush: csharp");
node.InnerHtml = contents;
contents = string.Empty;
}
}
}
nodes = doc.DocumentNode.SelectNodes(#"//code");
foreach (var node in nodes)
{
node.Remove();
}
Normally I'd remove the nodes in the first loop, but that doesn't work during iteration since you can't change the collection as you iterate over it.
Better ideas?

The first approach: select all the <code> nodes, group them, and create a <pre> node per group:
var idx = 0;
var nodes = doc.DocumentNode
.SelectNodes("//code")
.GroupBy(n => new {
Parent = n.ParentNode,
Index = n.NextSiblingIsCode() ? idx : idx++
});
foreach (var group in nodes)
{
var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>");
pre.AppendChild(doc.CreateTextNode(
string.Join(Environment.NewLine, group.Select(g => g.InnerText))
));
group.Key.Parent.InsertBefore(pre, group.First());
foreach (var code in group)
code.Remove();
}
The grouping field here is combined field of a parent node and group index which is increased when new group is found.
Also I used NextSiblingIsCode extension method here:
public static bool NextSiblingIsCode(this HtmlNode node)
{
return (node.NextSibling != null && node.NextSibling.Name == "code") ||
(node.NextSibling is HtmlTextNode &&
node.NextSibling.NextSibling != null &&
node.NextSibling.NextSibling.Name == "code");
}
It used to determine whether the next sibling is a <code> node.
The second approach: select only the top <code> node of each group, then iterate through each of these nodes to find the next <code> node until the first non-<code> node. I used xpath here:
var nodes = doc.DocumentNode.SelectNodes(
"//code[name(preceding-sibling::*[1])!='code']"
);
foreach (var node in nodes)
{
var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>");
node.ParentNode.InsertBefore(pre, node);
var content = string.Empty;
var next = node;
do
{
content += next.InnerText + Environment.NewLine;
var previous = next;
next = next.SelectSingleNode("following-sibling::*[1][name()='code']");
previous.Remove();
} while (next != null);
pre.AppendChild(doc.CreateTextNode(
content.TrimEnd(Environment.NewLine.ToCharArray())
));
}

Sanitize the html you want to parse. HTML Agility Pack strip tags NOT IN whitelist

Related

Linq to XML, simpler way to select element with highest depth level in hierarchy

I am trying to select the element with the highest depth (most nested) in hierarchy.
var elems = my_list.Elements()
.Where(x => x.Attribute("Name") != null && x.Attribute("Name").Value == "John")
;
Is there a simpler method than this one, together with filtering?
XElement elem2 = null;
int i = 0;
foreach (var elem in elems)
{
var depth = elem.AncestorsAndSelf().Count();
if(depth >= i)
{
i = depth;
elem2 = elem;
}
}

You can use MaxBy() (either from .NET 6 or from the MoreLinq package) along with your ancestor counting:
// Note the change to use Descendants, as otherwise only direct
// children will be returned, which will all have the same "level"
var element = list.Descendants()
.Where(x => (string) x.Attribute("Name") == "John")
.MaxBy(x => x.AncestorsAndSelf().Count());

parsing xsd complexType recursively

private ElementDefinition ParseComplexType(XElement complexType, string nameValue = "")
{
var name = complexType.Attribute("name");
ElementDefinition element = new ElementDefinition()
{
Elements = new List<ElementDefinition>(),
ElementName = name != null ? name.Value : string.Empty
};
foreach (var el in complexType.Descendants().Where(k => k.Parent.Parent == complexType && k.Name.LocalName == "element"))
{
ElementDefinition tempElement = new ElementDefinition();
var tempName = el.Attribute("name");
var tempType = el.Attribute("type");
if (tempName != null)
{
tempElement.ElementName = tempName.Value;
}
if (tempType != null)
{
var tempTypeValue = tempType.Value.Substring(tempType.Value.IndexOf(":") + 1, tempType.Value.Length - tempType.Value.IndexOf(":") - 1);
if (tipovi.Contains(tempTypeValue))
{
tempElement.ElementType = tempTypeValue;
element.Elements.Add(tempElement);
}
else
{
complexType = GetComplexType(tempTypeValue);
element.Elements.Add(ParseComplexType(complexType, tempName.Value));
}
}
}
if (nameValue != "") element.ElementName = nameValue;
return element;
}
Hi so this is a function i use for parsing XSD complexTypes.
This is a xsd schema i use xsd Schema.
I have problem parsing complexType element at line 14.
It only parses shipTo element, skipping billTo and parsing badly items.
The result is http://pokit.org/get/?b335243094f635f129a8bc74571c8bf2.jpg
Which fixes can i apply to this function in order to work properly?
PS. "tipovi" is list of xsd supported types, e.g. string, positiveInteger....
EDITED:
private XElement GetComplexType(string typeName)
{
XNamespace ns = "http://www.w3.org/2001/XMLSchema";
string x = "";
foreach (XElement ele in xsdSchema.Descendants())
{
if (ele.Name.LocalName == "complexType" && ele.Attribute("name") != null)
{
x = ele.Attribute("name").Value;
if (x == typeName)
{
return ele;
}
}
}
return null;
}
GetComplexType finds complexType definition of an element type. For example, for "PurchaseOrderType" (line 10) it returns element at line 14.

NOTE: This is only a partial answer as it only explains the issue regarding the skipped "billTo" element. The code as presented in the question has many more issues.
The problem regarding skipping of the billTo element
The complexType variable is used in the predicate for the Linq method Where in the foreach loop:
complexType.Descendants().Where(k => k.Parent.Parent == complexType && k.Name.LocalName == "element"))
This lambda expression uses the variable complexType, not merely its value.
By assigning another value to complexType deep down inside your foreach loop
complexType = GetComplexType(tempTypeValue);
you also change the logic of which elements are filtered by the predicate of the Where method in the the foreach loop.
The Fix
The solution is rather simple: Do not change the complexType variable within the foreach loop. You could do the call of GetComplexType like this:
XElement complexTypeUsedByElement = GetComplexType(tempTypeValue);
element.Elements.Add(ParseComplexType(complexTypeUsedByElement, tempName.Value));

Use the object returned by LINQ

I'm using LINQ to find an object from an XML file. After I find the object, I want to print its details, but I'm not really sure how I can use the object I found.
This is my code:
var apartmentExist =
from apartment1 in apartmentXml.Descendants("Apartment")
where (apartment1.Attribute("street_name").Value == newApartment.StreetName) &&
(apartment1.Element("Huose_Num").Value == newApartment.HouseNum.ToString())
select apartment1.Value;
if (apartmentExist.Any() == false)
{
Console.WriteLine("Sorry, Apartment at {0} or at num {1}", newApartment.StreetName,
newApartment.HouseNum);
}
else
{
//print the details of apartment1
}
My XML is:
<?xml version="1.0" encoding="utf-8"?>
<Apartments>
<Apartment street_name="sumsum">
<Huose_Num>13</Huose_Num>
<Num_Of_Rooms>4</Num_Of_Rooms>
<Price>10000</Price>
<Flags>
<Elevator>true</Elevator>
<Floor>1</Floor>
<parking_spot>true</parking_spot>
<balcony>true</balcony>
<penthouse>true</penthouse>
<status_sale>true</status_sale>
</Flags>
</Apartment>
</Apartments>

You LINQ query returns IEnumerable<XElement> If you expect it to return more then one element you can use foreach loop to print the elementss, if there is only one result you can call .Single() extension method to get the XElement, not collection:
Casting XElement to string is safer then using XElement.Value property, because it will not throw NullReferenceException when element does not exist. You should also use (int)XElement cast and compare numbers instead of XElement.Value and comparing it to string representation of a number.
You should not use Descendants method, Use Elements instead. It will make your query faster because only elements that need to be searched will be processed.
You should call FirstOrDefault and check if result is null instead of using Any and then another First call. It will prevent your query from execution twice.
Instead of returning apartment1.Value, which is a string, return apartment1 itself. It will be XElement and you'll be able to get into it's content later when it's necessary.
var apartmentExist =
from apartment1 in apartmentXml.Root.Elements("Apartment")
where ((string)apartment1.Attribute("street_name") == newApartment.StreetName) &&
((int)apartment1.Element("Huose_Num") == newApartment.HouseNum)
select apartment1;
var apartment = apartmentExist.FirstOrDefault();
if (apartment == null)
{
Console.WriteLine("Sorry, Apartment at {0} or at num {1}", newApartment.StreetName, newApartment.HouseNum);
}
else
{
// you can use apartment variable here. It's an XElement
var huoseNum = (string)apartment.Element("Huose_Num");
// flags
foreach(var flag in apartment.Elements("Flags"))
{
var name = flag.Name;
var value = (string)flag;
}
}

You can do it with one linq query like this:
var apartment =
(from a in apartmentXml.Descendants("Apartment")
where (a.Attribute("street_name").Value == newApartment.StreetName) &&
(a.Element("Huose_Num").Value == newApartment.HouseNum.ToString())
select new {
street_name = a.Attribute("street_name").Value,
Huose_Num = a.Element("Huose_Num").Value,
Num_Of_Rooms = a.Element("Num_Of_Rooms").Value,
Price = a.Element("Price").Value,
Flags = (from f in a.Element("Flags")
select new {
Elevator = f.Element("Elevator").Value,
Floor = f.Element("Floor").Value,
parking_spot = f.Element("Floor").Value,
balcony = f.Element("balcony").Value,
penthouse = f.Element("penthouse").Value,
status_sale = f.Element("status_sale").Value
})
}).FirstOrDefault();
if(aparment == null)
{
Console.WriteLine("Sorry, Apartment at {0} or at num {1}", newApartment.StreetName,
newApartment.HouseNum);
}
else
{
Console.WriteLine(apartment.street_name);
Console.WriteLine(apartment.Huose_Num);
Console.WriteLine(apartment.Num_Of_Rooms);
Console.WriteLine(apartment.Price);
Console.WriteLine(apartment.street_name);
Console.WriteLine(apartment.Flags.Elevator);
Console.WriteLine(apartment.Flags.Floor);
Console.WriteLine(apartment.Flags.parking_spot);
Console.WriteLine(apartment.Flags.balcony);
Console.WriteLine(apartment.Flags.penthouse);
Console.WriteLine(apartment.Flags.status_sale);
}

Try this:
var xml = #"<?xml version=""1.0"" encoding=""utf-8""?>
<Apartments>
<Apartment street_name=""sumsum"">
<Huose_Num>13</Huose_Num>
<Num_Of_Rooms>4</Num_Of_Rooms>
<Price>10000</Price>
<Flags>
<Elevator>true</Elevator>
<Floor>1</Floor>
<parking_spot>true</parking_spot>
<balcony>true</balcony>
<penthouse>true</penthouse>
<status_sale>true</status_sale>
</Flags>
</Apartment>
</Apartments>
";
var apartmentXml = XElement.Parse( xml );
//apartmentXml.Dump(); // This is a linqpad feature
var new_street = "sumsum";
var new_house_num = "13";
var match_apartment = apartmentXml.Elements().Where (x => x.Attribute("street_name").Value == new_street && x.Element("Huose_Num").Value == new_house_num );
//match_apartment.Dump();
if (match_apartment.Count() < 1 )
{
Console.WriteLine("Sorry, Apartment at {0} or at num {1}", new_street,
new_house_num);
}
else
{
foreach( var x in match_apartment.Elements() )
{
Console.WriteLine("{0} | {1}", x.Name, x.Value );
}
}

appatmentExist is an IEnumerable so to access the individual items within it use List indexing to access an individual element
Comsole.Writeline(appartmentExist.toList()[0].StreetName);
will print the streetname for the first element found in the query above

How iterate on a Jdom Element that contains a list of node?

I am pretty new in XML parsing in Java using org.jdom.** and I don't know C#.
In this time I have to translate some method from C# to Java and I have the following problem.
In C# I have something like this:
System.Xml.XmlNodeList nodeList;
nodeList = _document.SelectNodes("//root/settings/process-filters/process");
if (nodeList == null || nodeList.Count == 0) {
return risultato;
}
Objects.MyItem o;
foreach (System.Xml.XmlNode n in nodeList){
o = new Objects.MyItem(n.ChildNodes[1].InnerText, n.ChildNodes[0].InnerText);
risultato.Add(o);
}
And I have translate it in Java in this way:
public List<ProcessiDaEscludere> getProcessiDaEscludere() {
Vector<ProcessiDaEscludere> risultato = new Vector<ProcessiDaEscludere>();
Element nodeList;
XPath xPath;
try {
// Query XPath che accede alla root del tag <process-filters>:
xPath = XPath.newInstance("//root/settings/process-filters/process");
nodeList = (Element) xPath.selectSingleNode(CONFIG_DOCUMENT);
if (nodeList == null || nodeList.getChildren().size() == 0)
return risultato;
ProcessiDaEscludere processoDaEscludere = new ProcessiDaEscludere();
}catch (JDOMException e){
}
return risultato;
}
My problem is that now I have no idea about how iterate on the Element nodeList variable as do in C# by these lines:
foreach (System.Xml.XmlNode n in nodeList){
o = new Objects.MyItem(n.ChildNodes[1].InnerText, n.ChildNodes[0].InnerText);
risultato.Add(o);
}
Someone can help me?

C# XML Parsing- find position of an element and read next elements

Hi I have a sample xml as follows
<ROOTELEMENT>
<RECORDSET>
<ROW><VALUE>AAA</VALUE></ROW>
<ROW><VALUE>0</VALUE></ROW>
<ROW><VALUE>00</VALUE></ROW>
<ROW><VALUE>BBB</VALUE></ROW>
<ROW><VALUE>1</VALUE></ROW>
<ROW><VALUE>2</VALUE></ROW>
<ROW><VALUE>CCC</VALUE></ROW>
<ROW><VALUE>3</VALUE></ROW>
<ROW><VALUE>30</VALUE></ROW>
</RECORDSET>
<RECORDSET>
<ROW><VALUE>DDD</VALUE></ROW>
<ROW><VALUE>4</VALUE></ROW>
<ROW><VALUE>40</VALUE></ROW>
<ROW><VALUE>EEE</VALUE></ROW>
<ROW><VALUE>5</VALUE></ROW>
<ROW><VALUE>6</VALUE></ROW>
<ROW><VALUE>FFF</VALUE></ROW>
<ROW><VALUE>7</VALUE></ROW>
<ROW><VALUE>70</VALUE></ROW>
</RECORDSET>
</ROOTELEMENT>
I have to get the position of particular ROW with some VALUE. After that, I have to read the VALUE of a speicifed number of ROWs from that position onwards.
Ex: If i give some value as 'BBB', for this i have to get the next two values '1' and '2'.If i give some value as 'FFF', for this i have to get the next two values '7' and '70'.
I am using .Net framework2.0. I can not use LINQ. Please help me.

You can use the code below. It iterates through the nodes and stores the values you expect in foundValues
string valueToFind = "FFF";
string xml = #"<ROOTELEMENT>
<RECORDSET>
<ROW><VALUE>AAA</VALUE></ROW>
<ROW><VALUE>0</VALUE></ROW>
<ROW><VALUE>00</VALUE></ROW>
<ROW><VALUE>BBB</VALUE></ROW>
<ROW><VALUE>1</VALUE></ROW>
<ROW><VALUE>2</VALUE></ROW>
<ROW><VALUE>CCC</VALUE></ROW>
<ROW><VALUE>3</VALUE></ROW>
<ROW><VALUE>30</VALUE></ROW>
</RECORDSET>
<RECORDSET>
<ROW><VALUE>DDD</VALUE></ROW>
<ROW><VALUE>4</VALUE></ROW>
<ROW><VALUE>40</VALUE></ROW>
<ROW><VALUE>EEE</VALUE></ROW>
<ROW><VALUE>5</VALUE></ROW>
<ROW><VALUE>6</VALUE></ROW>
<ROW><VALUE>FFF</VALUE></ROW>
<ROW><VALUE>7</VALUE></ROW>
<ROW><VALUE>70</VALUE></ROW>
</RECORDSET>
</ROOTELEMENT>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
int count = 0;
List<string> foundValues = new List<string>();
foreach (XmlNode root in doc.ChildNodes)
foreach (XmlNode recorset in root.ChildNodes)
foreach (XmlNode row in recorset.ChildNodes)
foreach (XmlNode value in row.ChildNodes)
{
if (value.InnerText == valueToFind || count == 1 || count == 2)
{
if (count == 1 || count == 2)
foundValues.Add(value.InnerText);
count++;
}
}

Bit late, but here is a Linq to XML alternative:
private static string getXML()
{
return #"<ROOTELEMENT>
<RECORDSET>
<ROW><VALUE>AAA</VALUE></ROW>
<ROW><VALUE>0</VALUE></ROW>
<ROW><VALUE>00</VALUE></ROW>
<ROW><VALUE>BBB</VALUE></ROW>
<ROW><VALUE>1</VALUE></ROW>
<ROW><VALUE>2</VALUE></ROW>
<ROW><VALUE>CCC</VALUE></ROW>
<ROW><VALUE>3</VALUE></ROW>
<ROW><VALUE>30</VALUE></ROW>
</RECORDSET>
<RECORDSET>
<ROW><VALUE>DDD</VALUE></ROW>
<ROW><VALUE>4</VALUE></ROW>
<ROW><VALUE>40</VALUE></ROW>
<ROW><VALUE>EEE</VALUE></ROW>
<ROW><VALUE>5</VALUE></ROW>
<ROW><VALUE>6</VALUE></ROW>
<ROW><VALUE>FFF</VALUE></ROW>
<ROW><VALUE>7</VALUE></ROW>
<ROW><VALUE>70</VALUE></ROW>
</RECORDSET>
</ROOTELEMENT>";
}
private static void parseXML()
{
var xmlString = getXML();
var xml = XDocument.Parse(xmlString);
var values = xml.Descendants("VALUE");
var groups = values.Select((value, index) => new
{
Index = index,
Value = value
})
.GroupBy(x => x.Index / 3)
.Select(g => new Tuple<XElement, XElement, XElement>(g.ElementAt(0).Value,
g.ElementAt(1).Value,
g.ElementAt(2).Value));
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best way to combine nodes with Html Agility Pack - c#

Sanitize the html you want to parse. HTML Agility Pack strip tags NOT IN whitelist

Related

Linq to XML, simpler way to select element with highest depth level in hierarchy

parsing xsd complexType recursively

Use the object returned by LINQ

How iterate on a Jdom Element that contains a list of node?

C# XML Parsing- find position of an element and read next elements

Categories

Resources