Parsing images out of a list using HtmlAgilityPack

Parsing images out of a list using HtmlAgilityPack - c#

There's a html page like this
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
<li class="liclass">
some html
<a href="blabla" class="first aclass">
some other html
</li>
...
I want to get all the hrefs from the lis but like this that I can still get the relation between the li and the a tag.
So first li goes to first a tag, second to second and so on..
I have this code but it always returns the same a href context:
foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']"))
{
String href = node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value
}
How can I improve that code?

you probably want to add all href
string href="";
foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']"))
{
href+= node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value+",";
}

Related

How to get value of nested img src with Html Agility Pack?

I'm trying to get a nested img srcs with Html Agility pack and I've tried multiple things with no success. Basically there are multiple img srcs I need to grab, all are nested. There are 17 of these I need to grab but can't figure it out for the life of me. Here is the barebones html, I need the value of src in the last line:
<div class="largeTitle">
<article class="articleItem" data-id="0000">
<a href="#blank_link"> class="img">
<img class=" lazyloaded" data-src="#blank_link" alt="test" onerror="script"
src="image_link.jpg">
</a>
</article>
<article class="articleItem" data-id="0001">
<a href="#blank_link"> class="img">
<img class=" lazyloaded" data-src="#blank_link" alt="test" onerror="script"
src="image_link.jpg">
</a>
</article>
</div>

With the url you mentioned in comments, you can do:
var web = new HtmlWeb();
var doc = web.Load("https://www.investing.com/");
var images = doc.DocumentNode.SelectNodes("//*[contains(#class,'js-articles')]//a[#class='img']//img");
foreach(var image in images)
{
string source = image.Attributes["data-src"].Value;
string label = image.Attributes["alt"].Value;
Console.WriteLine($"\"{label}\" {source}");
}

How to split ul list into List<string> with li using class or another attribute

I have a html ul list:
<ul>
<li class="ng-scope">Item 1</li>
<li class="ng-scope">Item 2</li>
<li class="ng-scope">Item 3</li>
</ul>
I want to convert it into List<string> in C#. The li element can have an attribute or may not have any. Eg. it can be <li class="ng-scope"> or just <li>
I am currently doing so like thi:
string patternUL = #"<(ul|ol)[\s]*[^\>]*>(<li[ a-z=""\\]*>.*?</li>)+?</\1>";
string trg = Regex.Replace(source, patternUL, (param) =>
{
foreach (Capture c in param.Groups[2].Captures)
{
output += $"{Regex.Replace(c.Value.Replace("&", "&"), "<li>(.*?)</li>", "$1")}|";
}
//}
return output;
});
But I don't get the list split into the string List - it doesn't match the pattern.
If I pass ul list with li without any attribute then it works ok.

It is not recommended to parse html with regex. Instead use a framework like HTML agility pack. Doing so you can achieve getting all <li></li> as list like that:
var html = #"
<ul>
<li class=""ng-scope"">Item 1</li>
<li class=""ng-scope"">Item 2</li>
<li class=""ng-scope"">Item 3</li>
</ul>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var list = new List<string>(doc.DocumentNode.SelectNodes("//li").Select(li => li.InnerText));

I suggest you to use HtmlAgilityPack to parse html :
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(File.ReadAllText("test.txt")); // here you can give a normal string
foreach (var li in doc.DocumentNode.SelectNodes("//li")) // select li only
{
output += li.InnerText; // here do what you want to do
}
It captures following texts:
Item1
Item2
Item3

How to get first tag in a list?

suppose I have the following list:
<div id='page_competition_1_block_competition_left_tree_2'>
<div>
<ul>
<li>
<a href="#" />
<ul>
<li>
<a href="#">
</li>
</ul>
</li>
<li>
<a href="#" />
...
how can I get the first a tag for each li?
I tried using:
HtmlNodeCollection compsLi = doc.DocumentNode
.SelectNodes("div[#id='page_match_1_block_competition_left_tree_2']//div//ul/li[1]");
but this will return null

You need to specify a single / instead of //, so:
HtmlNodeCollection compsLi = doc.DocumentNode.SelectNodes("div[#id='page_match_1_block_competition_left_tree_2']//ul/li/a");
Essentially:
/: search for the current node.
//: search from the root document node.

You should be able to iterate through the compsLi object you retrieve. Additionally, I don't think you need the [1] in your selector. Once you get the <li> item you should be able to do something like this:
foreach(var node in compsLi)
{
var aNode = node.SelectSingleNode("./a");
...
}
You can take a look here for something similar.

Html nodes issue with HtmlAgilityPack

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.
In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.
<div class='downloads' id='download_block'>
<h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class='parts'>
<li>
text here
</li>
</ul>
</div>
This is how it looks on the webpage
And this is what I have:
nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[#class='downloads']/ul[#class='parts']")
I can't just use an array-index to determine the position like:
nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node
...because they could change the amount of nodes and its hosting positions.
Note that also the urls will not contains the hosting names, are redirections like:
http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--
What could I do, in C# or else VB.Net?.

this should do, untested though:
doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value
also use contains because you never know if the text contains spaces.

The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.
void Main()
{
var xml = #"
<div class=""downloads"" id=""download_block"">
<h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
<h4>uploadable.ch</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>uploaded.net</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
<h4>novafile.com</h4>
<ul class=""parts"">
<li>
text here
</li>
</ul>
</div>";
var xmlDocument = new XmlDocument();
xmlDocument.LoadXml(xml);
var nav = xmlDocument.CreateNavigator();
var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/#href").InnerText;
Console.WriteLine(text);
}
Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.
Sorry for the not so clean and error prone code, but it should get you in the right direction.

Give the snippet you supplied, this will help you get started.
var page = "<div class=\"downloads\" id=\"download_block\"> <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5> <h4>uploadable.ch</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>uploaded.net</h4> <ul class=\"parts\"> <li> text here </li> </ul> <h4>novafile.com</h4> <ul class=\"parts\"> <li> text here </li> </ul></div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
attr.Value.Dump();
}

Regex to catch the parent element of an <li> tag

Simply put I have HTML that looks like this:
<ul>
<li>Unorderd Item 1</li>
<li>Unordered Item 2</li>
<li>Unordered Item 3
<ol>
<li>Ordered Item 1</li>
<li>Ordered Item 2</li>
</ol>
</li>
<li>Unordered Item 4</li>
</ul>
I'm looking for a regular expression or some logic of that nature that replaces the <li> tag with something depending on what its parent list element is.
I can use straight up RegEx or I can use (more than likely my route here) the .Net System.Text.RegularExpressions class so:
Regex.Replace
Regex.Matches
<-- I know I could/should be using a HTML parser but this is being used in conjunction with a XSLT config doc. So using a Regex seems to be the best way to go. -->
Desired Output:
<ul>
<Unordered>Unordered Item 1</Unordered>
<Unordered>....</Unordered>
<ol>
<Ordered>......</Ordered>
<Ordered>......</Ordered>
</ol>
<Unordered>.....</Unordered>
</ul>

I would use HtmlAgilityPack for this
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlString);
foreach (var li in doc.DocumentNode.Descendants("li"))
{
if (li.ParentNode.Name == "ul") li.Name = "Unordered";
if (li.ParentNode.Name == "ol") li.Name = "Ordered";
}
var newHtml = doc.DocumentNode.OuterHtml;
Output:
<ul>
<unordered>Unorderd Item 1</unordered>
<unordered>Unordered Item 2</unordered>
<unordered>Unordered Item 3
<ol>
<ordered>Ordered Item 1</ordered>
<ordered>Ordered Item 2</ordered>
</ol>
</unordered>
<unordered>Unordered Item 4</unordered>
</ul>

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing images out of a list using HtmlAgilityPack - c#

you probably want to add all href string href=""; foreach (var node in docu.DocumentNode.SelectNodes("//li[#class='liclass']")) { href+= node.SelectNodes("//a[#class='first aclass']")[0].Attributes["href"].Value+","; }

Related

How to get value of nested img src with Html Agility Pack?

How to split ul list into List<string> with li using class or another attribute

How to get first tag in a list?

Html nodes issue with HtmlAgilityPack

Regex to catch the parent element of an <li> tag

Categories

Resources