Select node based on sibling properties - HtmlAgilityPack - C#

Select node based on sibling properties - HtmlAgilityPack - C# - c#

I have an HTML-document that is structured as follows
<ul class="beverageFacts">
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>
I need to parse the values of the <strong>-tags to corresponding string's, depending on what value the <span>-tag has.
I have the following:
String vintage;
String sugar;
String abv;
As of now, I am looping through each child node of the beverageFacts-node checking the values to parse it to the correct corresponding string.
The code I have so far to get the "Vintage"-value is the following, though the result is always null.
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode subNode in childNodes)
{
if (subNode.InnerText.TrimStart() == "Vintage")
vintage = subNode.NextSibling.InnerText.Trim();
}
I believe my selection of the nodes is incorrect, but I cannot figure out how to properly do it in the most efficient way.
Is there an easy way to achieve this?
Edit 2013-07-29
I have tried to remove the whitespaces as suggested by enricoariel in the comments using the following code
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://www.systembolaget.se/" + articleID);
string cleanDoc = Regex.Replace(page.DocumentNode.OuterHtml, #"\s*(?<capture><(?<markUp>\w+)>.*<\/\k<markUp>>)\s*", "${capture}", RegexOptions.Singleline);
HtmlDocument cleanPage = new HtmlDocument();
cleanPage.LoadHtml(cleanDoc);
The resulting is still
String vintage = null;

Looking at the HTML markup, I realized I didn't go deep enough in the nodes.
Also, as enricoariel pointed out, there are whitespaces that I do not clean properly. By skipping the sibling which is the whitespaces, and instead jump to the following, I get the correct result.
foreach (HtmlNode bevFactNode in bevFactsNodes)
{
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode node in childNodes)
{
foreach(HtmlNode subNode in node.ChildNodes)
{
if (subNode.InnerText.Trim() == "Årgång")
vintage = HttpUtility.HtmlDecode(subNode.NextSibling.NextSibling.InnerText.Trim());
}
}
}
Console.WriteLine("Vintage: " + vintage);
will output
Vintage: 2007
I decoded the HTML to get the result formatted correctly.
Lessons learned!

to summarize I think the best solution would be stripping all white spaces using a regex prior to retrieve the nextSibling value:
string myHtml =
#"
<ul class='beverageFacts'>
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>";
//Remove space after and before tag
myHtml = Regex.Replace(myHtml, #"\s+<", "<", RegexOptions.Multiline | RegexOptions.Compiled);
myHtml = Regex.Replace(myHtml, #">\s+", "> ", RegexOptions.Compiled | RegexOptions.Multiline);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml.Replace("/r", "").Replace("/n", "").Replace("/r/n", "").Replace(" ", ""));
doc.OptionFixNestedTags = true;
HtmlNodeCollection vals = doc.DocumentNode.SelectNodes("//ul[#class='beverageFacts']//span");
var myNodeContent = string.Empty;
foreach (HtmlNode val in vals)
{
if (val.InnerText == "Vintage")
{
myNodeContent = val.NextSibling.InnerText;
}
}
return myNodeContent;

Related

C#: Regular expression to replace all <font> tags in HTML by <span>

I would like to replace all <font> tags in a HTML file by <span style="..."> and retain the attributes such as font color and font size.
Here are the test cases:
<font color='#000000'>Case 1</font><br />
<font size=6>Case 2</font><br />
<font color="red" size="12">Case 3</font>
Here is the expected result:
<span style="color:#000000">Case 1</span><br />
<span style="font-size:6rem">Case 2</span><br />
<span style="color:red; font-size:12rem">Case 3</span>
With the C# code below, case 1 and 2 can be replaced successfully as they have only 1 style attribute. However, the second attribute in case 3 is missed. Is that possible to improve the C# code below for keeping both "color" and "size"?
string pattern = "<font (color|size)=(?:\"|'|)([a-z0-9#\\-]+)(?:\"|'|).*?>(.*?)<\\/font>";
Regex regex = new Regex(pattern, RegexOptions.Singleline);
output = regex.Replace(output, delegate (Match m) {
string attr = m.Groups[1].Value.Trim();
string value = m.Groups[2].Value.Trim();
string text = m.Groups[3].Value.Trim();
if (attr.Equals("size")) {
attr = "font-size";
value += "px";
}
return string.Format("<span style=\"{0}:{1};\">{2}</span>", attr, value, text);
});
Thank you very much!

As commented by #Steve B
Don't use regex. HTML has so many ways to write tags that you'll end with a monstrous regex. My advise is to use HtmlAgilityPack which allows you to parse and manipulate HTML. This lib is a golden nuget when dealings with HTML manipulation. And it's free and open source.
Here you can do this by using HtmlAgilityPack
public string ReplaceFontBySpan()
{
HtmlDocument doc = new HtmlDocument();
string htmlContent = #"<font color='#000000'>Case 1</font><br />
<font size=6>Case 2</font><br />
<font color='red' size='12'>Case 3</font>";
doc.LoadHtml(htmlContent);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//font"))
{
var attributes = node.Attributes;
foreach (var item in attributes)
{
if (item.Name.Equals("size"))
{
item.Name = "font-size";
item.Value = item.Value + "rem";
}
}
var attributeValueList = node.Attributes.Select(x => x.Name + ":" + x.Value).ToList();
string attributeName = "style";
string attributeValue = string.Join(";", attributeValueList);
HtmlNode span = doc.CreateElement("span");
span.Attributes.Add(attributeName, attributeValue);
span.InnerHtml = node.InnerHtml;
node.ParentNode.ReplaceChild(span, node);
}
return doc.DocumentNode.OuterHtml;
}
Output:

Manipulating an HTML file results in incorrect String indexes [duplicate]

string content="
<br /><br />Cooking School<br /><br />Feed your senses<br /><br />Take your cooking skills to the next level. Find a cooking school near you!<br /><br /><img src="http://www.sdlm1.com/autd3umrl_u_t.jpg" />
"
I need to replace all anchor tags href value with different urls
I used the following function but its getting error
public List<string> GetLinksFromHtml(string content)
{
string regex = #"<(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:""(?<URL>(?:\\""|[^""])*)""|'(?<URL>(?:\\'|[^'])*)'))";
var matches = Regex.Matches(content, regex, RegexOptions.IgnoreCase | RegexOptions.Singleline);
var links = new List<string>();
foreach (Match item in matches)
{
string link = item.Groups[1].Value;
links.Add(link);
}
return links;
}
Thanks for any help

Trying to parse html with regex is not a good idea. See this post. Use a real html parser like HtmlAgilityPack .
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
foreach (var a in doc.DocumentNode.Descendants("a"))
{
a.Attributes["href"].Value = "http://a.com?url=" + HttpUtility.UrlEncode(a.Attributes["href"].Value);
}
var newContent = doc.DocumentNode.OuterHtml;

Get innerText from <div class> with an <a href> child

I am working with a webBrowser in C# and I need to get the text from the link. The link is just a href without a class.
its like this
<div class="class1" title="myfirstClass">
<a href="link.php">text I want read in C#
<span class="order-level"></span>
Shouldn't it be something like this?
HtmlElementCollection theElementCollection = default(HtmlElementCollection);
theElementCollection = webBrowser1.Document.GetElementsByTagName("div");
foreach (HtmlElement curElement in theElementCollection)
{
if (curElement.GetAttribute("className").ToString() == "class1")
{
HtmlElementCollection childDivs = curElement.Children.GetElementsByName("a");
foreach (HtmlElement childElement in childDivs)
{
MessageBox.Show(childElement.InnerText);
}
}
}

This is how you get the element by tag name:
String elem = webBrowser1.Document.GetElementsByTagName("div");
And with this you should extract the value of the href:
var hrefLink = XElement.Parse(elem)
.Descendants("a")
.Select(x => x.Attribute("href").Value)
.FirstOrDefault();
If you have more then 1 "a" tag in it, you could also put in a foreach loop if that is what you want.
EDIT:
With XElement:
You can get the content including the outer node by calling element.ToString().
If you want to exclude the outer tag, you can call String.Concat(element.Nodes()).
To get innerHTML with HtmlAgilityPack:
Install HtmlAgilityPack from NuGet.
Use this code.
HtmlWeb web = new HtmlWeb();
HtmlDocument dc = web.Load("Your_Url");
var s = dc.DocumentNode.SelectSingleNode("//a[#name="a"]").InnerHtml;
I hope it helps!

Here I created the console app to extract the text of anchor.
static void Main(string[] args)
{
string input = "<div class=\"class1\" title=\"myfirstClass\"><a href=\"link.php\">text I want read in C#<span class=\"order-level\"></span>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
foreach (HtmlNode item in doc.DocumentNode.Descendants("div"))
{
var link = item.Descendants("a").First();
var text = link.InnerText.Trim();
Console.Write(text);
}
Console.ReadKey();
}
Note this is htmlagilitypack question so tag the question properly.

HTML Agility Pack - Grab Text after a node

I have some HTML that I'm parsing using C#
The sample text is below, though this is repeated about 150 times with different records
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
I'm trying to get the text in an array which will be like
customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy
I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag
any help would be appreciated

You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :
var raw = #"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}
dotnetfiddle demo
output :
Title, : Mr
First name, : Fake
Surname, : Guy
You should be able to remove the ":" by doing simple string manipulation, if needed...

<strong> is a common tag, so something specific for the sample format you provided.
var html = #"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>
<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
foreach (var node in strong.Where(
// 2. followed by non-empty text node
x => x.NextSibling is HtmlTextNode
&& !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
// 3. followed by <br>
&& x.NextSibling.NextSibling is HtmlNode
&& x.NextSibling.NextSibling.Name.ToLower() == "br"))
{
Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
}
}

How to get Contents from HTML string in Array

I am working with some html contents. The format of the HTML is like below.
<li>
<ul>
<li>Test1</li>
<li>Test2</li>
</ul>
Odd string 1
<ul>
<li>Test3</li>
<li>Test4</li>
</ul>
Odd string 2
<ul>
<li>Test5</li>
<li>Test6</li>
</ul>
<li>
There can be multiple "odd string" in html content. So I want all the "odd string" in array. Is there any easy way ? (I am using C# and HtmlAgilityPack)

Select ul elements and refer to next sibling node, which will be your text:
HtmlDocument html = new HtmlDocument();
html.Load(html_file);
var odds = from ul in html.DocumentNode.Descendants("ul")
let sibling = ul.NextSibling
where sibling != null &&
sibling.NodeType == HtmlNodeType.Text && // check if text node
!String.IsNullOrWhiteSpace(sibling.InnerHtml)
select sibling.InnerHtml.Trim();

something like
MatchCollection matches = Regex.Matches(HTMLString, "</ul>.*?<ul>", RegexOptions.SingleLine);
foreach (Match match in matches)
{
String oddstring = match.ToString().Replace("</ul>","").Replace("<ul>","");
}

Get all the ul descendants and check it the next sibling node is HtmlNodeType.Text and if is not empty:
List<string>oddStrings = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode ul in doc.DocumentNode.Descendants("ul"))
{
HtmlNode nextSibling = ul.NextSibling;
if (nextSibling != null && nextSibling.NodeType == HtmlNodeType.Text)
{
string trimmedText = nextSibling.InnerText.Trim();
if (!String.IsNullOrEmpty(trimmedText))
{
oddStrings.Add(trimmedText);
}
}
}

Agility Pack can already query those texts
var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/li[1]/text()")

Use this XPATH:
//body/li[1]/text()

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Select node based on sibling properties - HtmlAgilityPack - C# - c#

Related

C#: Regular expression to replace all <font> tags in HTML by <span>

Manipulating an HTML file results in incorrect String indexes [duplicate]

Get innerText from <div class> with an <a href> child

HTML Agility Pack - Grab Text after a node

How to get Contents from HTML string in Array

Categories

Resources