Traverse the dom with CsQuery

Traverse the dom with CsQuery - c#

I'm trying to learn how to use CsQuery to traverse a dom to get specific text.
The html looks like this:
<div class="featured-rows">
<div class="row">
<div class="featured odd" data-genres-filter="MA0000002613">
<div class="album-cover">
<div class="artist">
Half apanese
</div>
<div class="title">
<div class="label"> Joyful Noise </div>
<div class="styles">
<div class="rating allmusic">
<div class="rating average">
<div class="headline-review">
</div>
<div class="featured even" data-genres-filter="MA0000002572, MA0000002613">
</div>
<div class="row">
<div class="row">
<div class="row">
My code attempt looks like this:
public void GetRows()
{
var artistName = string.Empty;
var html = GetHtml("http://www.allmusic.com/newreleases");
var rows = html.Select(".featured-rows");
foreach(var row in rows)
{
var odd = row.Cq().Find(".featured odd");
foreach(var artist in odd)
{
artistName = artist.Cq().Text();
}
}
}
The first select for .featured-row works but then i don't know how to get down to the .artist to get the text.

You should try something similar to this:
var html = GetHtml("http://www.allmusic.com/newreleases");
var query = CQ.Create(html)
var row = query[".artist>a"];
string link = row.Attributes["href"];
string text = row.DefaultValue or row.InnerText or row.Value...
CsQuery is port of JQuery so you can google for JQuery code
UPDATE:
To traverse to get all artists and titles
var rows = query[".featured odd"];
foreach(var row in rows)
{
var artistsLink = row[".artists>a"];
var title = row[".title"];
// here do whatever you need with this
}

List<string> artists = html[".featured .artist a"].Select(dom=>dom.TextContent).ToList();
where html == your CQ object.
var odd = row.Cq().Find(".featured odd");
should be
var odd = row.Cq().Find(".featured.odd");

Related

get div information with html agility pack

Hi I want to process information on a html page, with the following code I can get the information
This is how the order is received
new-link-1
new-link-2
new-link-3
But when it comes to the new-link-no-title section, it breaks up And it changes to
new-link-3
new-link-1
new-link-2
And at the end of the program it stops with an ArgumentOutOfRangeException error
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = await web.LoadFromWebAsync(Link);
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex())
{
var x = item.SelectNodes("//div[#class='new-link-2']")[index].InnerText;
var xx = item.SelectNodes("//div[#class='new-link-3']//a")[index];
MessageBox.Show(item.InnerText);
MessageBox.Show(x);
MessageBox.Show(xx.Attributes["href"].Value);
}
and html
<div id="new-link">
<ul>
<li>
<div class="new-link-1"> فصل پنجم</div>
<div class="new-link-2"> تکمیل شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
<li class="new-link-no-titel">
<div class="new-link-1"> فصل ششم</div>
<div class="new-link-2"> درحال پخش</div>
<div class="new-link-3">
<i class="fa fa-arrow-down" title=حال پخش">
</i>
</div>
</li>
<li>
<divs="new-link-1"> قسمت 1</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلودلینک مستقیم
</div>
</li>
<li>
<div class="new-link-1"> قسمت 7</div>
<div class="new-link-2"> پخش شده</div>
<div class="new-link-3">
دانلود با لینک مستقیم
</div>
</li>
</ul>
</div>

This is what I found to be the issue with your code.
foreach ((var item, int index) in doc.DocumentNode.SelectNodes(".//div[#class='new-link-1']").WithIndex()) //-> Gives 4 indecies for index
item.SelectNodes("//div[#class='new-link-2']") // -> This produces 4 nodes
item.SelectNodes("//div[#class='new-link-3']//a") // -> This produces only 3 nodes
Issue:
When you search with //div, you search All nodes.. not just from the item you are currently on.
Solution/Suggestion: Your current code searches all a elements starting from the root node. If you prefix it with a dot instead only the descendants of the current node will be considered. (Excerpt from here)
foreach (HtmlNode item in doc.DocumentNode.SelectNodes(".//li"))
{
try
{
var x0 = item.SelectSingleNode(".//div[#class='new-link-1']");
var x = item.SelectSingleNode(".//div[#class='new-link-2']");
var xx = item.SelectSingleNode(".//a");
MessageBox.Show(x0.InnerText);
MessageBox.Show(x.InnerText);
if (xx.Attributes["href"] != null)
MessageBox.Show(xx.Attributes["href"].Value);
}
catch { }
}

How To Get Div inside Div htmlagilitypack

first .. sorry about my bad english
my question is how can i scrape div inside div in htmlagilitypack c#
this is test html code
<html>
<div class="all_ads">
<div class="ads__item">
<div class="test">
test 1
</div>
</div>
<div class="ads__item">
<div class="test">
test 2
</div>
</div>
<div class="ads__item">
<div class="test">
test 3
</div>
</div>
</div>
</html>
how to make a loop that get all ads then loop that control test inside ads

You can select all the nodes inside class all_ads as follow:-
var res = div.SelectNodes(".//div[#class='all_ads ads__item']");
.//div[#class='all_ads ads__item'] This will select all the nodes inside all_adswhich has class ads_item.

You have to use this path => //div[contains(#class, 'test')]
This means you need to select those div(s) that contains class with name ads__item.
and then select all those selected div(s) inner html. like
class Program
{
static void Main(string[] args)
{
string html = File.ReadAllText(#"Path to your html file");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var innerContent = doc.DocumentNode.SelectNodes("//div[contains(#class, 'test')]").Select(x => x.InnerHtml.Trim());
foreach (var item in innerContent)
Console.WriteLine(item);
Console.ReadLine();
}
}
Output:

Get specific href values or link from email which is parsed as html in c#

I am processing emails in my C# service. I need to extract certain links present in the same to add to DB. I am using HtmlagilityPack. The div and p tags turn out interchangeable in the parsed email. I have to extract the links present below the tags 'Scheduler Link', 'Data Path' and 'Link' from the email. After cleaning it up, a sample data is as follows :
<html>
<body>
......//contains some other tags which i dont need, may include hrefs but
//i dont need them
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Scheduler link :</div>
<div align="justify" style="margin:0;"></div>
<div style="margin:0;"><a href="https://something.com/requests/26428">
https://something.com/requests/26428</a>
</div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div align="justify" style="margin:0;">Data path :</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a>
</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a>
</div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Link :</div>
<div align="justify" style="margin:0;"><a
href="https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y">
This is some text</a></div>
<div align="justify" style="margin:0 0 5pt 0;">This is another text</div>
......//contains some other tags which i dont need
</body>
</html>
I am looking for the div tag of 'Scheduler Link', 'Data Path' and 'Link' using regular expressions as follows :
HtmlNode schedulerLink = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["scheduler"]).Value.ToString() + "')]]");
HtmlNode dataPath = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["datapath"]).Value.ToString() + "')]]");
HtmlNode link = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["link"]).Value.ToString() + "')]]");
The div tags are returning me the respective nodes. The number of links present against the three in each email varies and so does the order of the tags. I need to capture the links against each in a list. I am using the following code :
foreach (HtmlNode link in schedulerLink.Descendants())
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!(link.InnerText.Contains("\r\n")))
{
if (link.InnerText.Contains("/"))
{
schedulersList.Add(link.InnerText.Trim());
}
}
}
The descendants sometimes is not returning the correct number of nodes. Also how do i get the specific links against the 3 tags in 3 different lists since descendants usually return all the nodes present below.

If I understand correctly, you want to capture the content of the first href-attribute after a specific string like scheduler link. I don't know about the HtmlagilityPack, but my approach would be to just search the email body with a regex like this:
Scheduler link(?:\s|\S)*?href="([^"]+)
This regex should capture the content of the first href-attribute after every occurence of "Scheduler link" in the mail.
You can try it here: Regex101
To find the other types of links just replace the Scheduler link part with the respective string.
I hope this is helpful.
Additional info about the regex:
Scheduler link matches the string literally
(?:\s|\S)*?href=" non-capturing group that matches any character until the first occurence of the literal string href="
([^"]+) captures everything despite the " character

As you have mentioned different hrefs in your question,
one way of doing it is by following:
var html = #"<html> <body> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Scheduler link :</div> <div align='justify' style='margin:0;'></div> <div style='margin:0;'><a href='https://something.com/requests/26428'> https://something.com/requests/26428</a> </div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div align='justify' style='margin:0;'>Data path :</div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a> </div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a> </div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Link :</div> <div align='justify' style='margin:0;'><a href='https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y'> This is some text</a></div> <div align='justify' style='margin:0 0 5pt 0;'>This is another text</div> </body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var schedulerNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"something\")]");
var dataPathNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"mycompany\")]");
var linkNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"Thisisanotherlink\")]");
foreach (var item in schedulerNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in dataPathNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in linkNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
Hope that helps !!
EDIT ::
var result = document.DocumentNode.SelectNodes("//div//text()[normalize-space()] | //a");
// select all textnodes and a tags
string sch = "Scheduler link :";
string dataLink = "Data path :";
string linkpath = "Link :";
foreach (var item in result)
{
if (item.InnerText.Trim().Contains(sch))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(sch)).Skip(1);
// skip the result till we reache to Scheduler.
Debug.WriteLine("====================Scheduler link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
// if href then add to list TODO
if (subitem.InnerText.Contains(dataLink)) // break when data link appears.
{
break;
}
}
}
if (item.InnerText.Trim().Contains(dataLink))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(dataLink)).Skip(1);
Debug.WriteLine("====================Data link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
if (item.InnerText.Trim().Contains("Link :"))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(linkpath)).Skip(1);
Debug.WriteLine("====================Link=========================");
foreach (var subitem in processResult)
{
var hrefValue = subitem.GetAttributeValue("href", "");
Debug.WriteLine(hrefValue);
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
}
I have mentioned logic in code commments.
Hope that helps

C#:Regex How to Match specific div close tag but the last close tag?

For examle:
<div id="outer">
<div id="a">
<div class="b"> 11111111111</div>
<div class="b"> 22222222222222</div>
</div>
</div>
Now I want to match the elements of id is a, and replace it to empty, but I found I can't, because id="a" is not the outer div.
This is my c# code ,it will match the last Tag.
Regex regex = new Regex(#"<div id=""a([\s\S]*) (<\/[div]>+)");

Try this:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var divs = doc.DocumentNode.Descendants().Where(x => x.Name == "div" && x.Id == "a");
foreach (var div in divs.ToArray())
{
div.InnerHtml = "";
}
var result = doc.DocumentNode.OuterHtml;
The result I get is:
<div id="outer">
<div id="a"></div>
</div>

Non-Sequential List in Model

I have a list of Guitar objects in my view model.
public List<Guitar> Guitars { get; set; }
The user is able to create these by clicking a button (Thanks to JQuery clone()). I noticed if they remove the 1st list item ([0]) The model returns a null list or if they remove something in the middle of the list like [1], the model only returns item [0] in the list.
I see in the raw request that all of the items exist so I guess I have 2 choices - Maybe someone has a different approach?
1. Operate on the raw Request array in the controller like this:
[HttpPost]
public ActionResult Index(CustomerViewModel customer)
{
var guitars = new List<Guitar>();
var listValues = new List<string>();
var numGuitars = 0;
//Loop through all Request keys in the POST
foreach (string key in Request.Form.AllKeys)
{
//Save any that are part of the Guitar object
if (key.StartsWith("Guitars["))
{
listValues.Add(key);
}
}
//Guitar object has 3 properties so divide by 3 to get total object count
numGuitars = (int)Math.Ceiling(listValues.Count / 3.0);
for (int i = 0; i < numGuitars; i++)
{
var guitarMake = Request["Guitars[" + i + "].Make"];
var guitarModel = Request["Guitars[" + i + "].Model"];
var guitarProductonYear = Request["Guitars[" + i + "].ProductionYear"];
if (!String.IsNullOrEmpty(guitarMake) &&
!String.IsNullOrEmpty(guitarModel) &&
!String.IsNullOrEmpty(guitarProductonYear))
{
var g = new Guitar
{
Make = guitarMake,
Model = guitarModel,
ProductionYear = Int32.Parse(guitarProductonYear)
};
guitars.Add(g);
}
}
2. When a user deletes an item, use JQuery to reassign list indices so we are sequential.
3. Anything else?
Form HTML
<div id="guitars_1" style="display: block;">
<input type="text" value="" name="Guitars[0].Make" id="Guitars_0__Make" placeholder="Make">
<input type="text" value="" name="Guitars[0].Model" id="Guitars_0__Model" placeholder="Model">
</div>
<div id="guitars_2" style="display: block;">
<input type="text" value="" name="Guitars[1].Make" id="Guitars_1__Make" placeholder="Make">
<input type="text" value="" name="Guitars[1].Model" id="Guitars_1__Model" placeholder="Model">
</div>
<div id="guitars_3" style="display: block;">
<input type="text" value="" name="Guitars[2].Make" id="Guitars_2__Make" placeholder="Make">
<input type="text" value="" name="Guitars[2].Model" id="Guitars_2__Model" placeholder="Model">
</div>
<!-- Start Add Guitar Row Template -->
<div style="display:none">
<div id="guitarsTemplate">
<div class="formColumn1"><label>Guitar</label></div>
<div class="formColumn2">#Html.TextBoxFor(model => model.Guitars[0].Make, new { Placeholder = "Make" })
<div class="messageBottom">
#Html.ValidationMessageFor(model => model.Guitars[0].Make)
</div>
</div>
<div class="formColumn3">#Html.TextBoxFor(model => model.Guitars[0].Model, new { Placeholder = "Model" })
<div class="messageBottom">
#Html.ValidationMessageFor(model => model.Guitars[0].Model)
</div>
</div>
<div class="formColumn4">#Html.TextBoxFor(model => model.Guitars[0].ProductionYear, new { Placeholder = "Production Year" })
<div class="messageBottom">
#Html.ValidationMessageFor(model => model.Guitars[0].ProductionYear)
</div><a class="icon delete">Delete</a>
</div>
</div>
</div>
<!-- End Add Guitar Row Template -->
JS that clones and deletes items
$(document).ready(function() {
var uniqueId = 1;
var ctr = 0;
$(function() {
$('.js-add-guitar-hyperlink').click(function() {
var copy = $("#guitarssTemplate").clone(true).appendTo("#addGuitarSection").hide().fadeIn('slow');
var guitarDivId = 'guitars_' + uniqueId;
var copyText = copy.html();
copyText = copyText.replace(/Guitars\[0\]/g, 'Guitars[' + ctr + ']');
copyText = copyText.replace('Guitars_0', 'Guitars_' + ctr);
copy.html(copyText);
$('#guitarsTemplate').attr('id', guitarDivId);
var deleteLink = copy.find("a.icon.delete");
deleteLink.on('click', function() {
copy.fadeOut(300, function() { $(this).remove(); }); //fade out the removal
});
$('#' + cosponsorDivId).find('input').each(function() {
//$(this).attr('id', $(this).attr('id') + '_' + uniqueId);
// $(this).attr('name', $(this).attr('name') + '_' + uniqueId);
});
uniqueId++;
ctr++;
});
});
});

For this kind of dynamic list management in MVC, you could do worse than take a look at the BeginCollectionItem HtmlHelper:
https://www.nuget.org/packages/BeginCollectionItem/
https://github.com/danludwig/BeginCollectionItem
http://blog.stevensanderson.com/2010/01/28/editing-a-variable-length-list-aspnet-mvc-2-style/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Traverse the dom with CsQuery - c#

List<string> artists = html[".featured .artist a"].Select(dom=>dom.TextContent).ToList(); where html == your CQ object. var odd = row.Cq().Find(".featured odd"); should be var odd = row.Cq().Find(".featured.odd");

Related

get div information with html agility pack

How To Get Div inside Div htmlagilitypack

Get specific href values or link from email which is parsed as html in c#

C#:Regex How to Match specific div close tag but the last close tag?

Non-Sequential List in Model

Categories

Resources