search for node using contains in c# - c#

I am trying to search all nodes that start with searchResult1, searchResult2 until searchResult10 in my C# program from an HTML input. here's my code
var results = hdoc.DocumentNode
.Descendants("div")
.Where(x => x.Attributes.Contains("id") &&
x.Attributes["id"].Value.Contains("\"searchResult")).ToList();
for (int i = 0; i < results.Count; i++)
{
rawdata[i] = results[i].InnerHtml.Trim();
}
My HTMl looks like this
<div id="searchResultTable" class="searchReturnData"> some junk html
<li id="searchResult1" class="searchResult searchResultsData_OFF"> searchResult1 html </li>
<li id="searchResult2" class="searchResult searchResultsData_OFF">searchResult2 html </li>
<li id="searchResult3" class="searchResult searchResultsData_OFF">searchResult3 html </li>
</div>
I want to print only searchResult1,searchResult2,searchResult3 html only and not some junk html. How can I do this.
Thanks
Rashmi

if you can use the HTMLAgilityPack to parse HTML. you can do something like this
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:\file.html");
var root = doc.DocumentNode;
var a_nodes = root.Descendants("li").Where(c=>c.GetAttributeValue("id","")
.Contains("searchResult")).ToList()

Related

HTMLAgilityPacker How to get just href value when a contains other HTMLAttributes

II know the question sounds odd, and maybe my head is just to the point of exploding. I am trying to fix some relative URLs to absolute URLs and the method below works fine. Expect, my hrefs contain values of href.
Here is that example which is returned as att.Value from the HtmlAttribute att property ="yuimenubaritemlabel" href="/cgi-bin/Dis :
public string FixLinksForWebsite(string Html, string baseURl)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(Html);
foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[#href]"))
{
string startLinkText = link.OuterHtml;
HtmlAttribute att = link.Attributes["href"];
if (!att.Value.StartsWith("/"))
continue;
att.Value = $"{baseURl}{att.Value}";
htmlDoc.Text = htmlDoc.Text.Replace(startLinkText, link.OuterHtml);
}
return htmlDoc.Text;
}
Here is the HTML <li class="yuimenubaritem"><a class="yuimenubaritemlabel" href="/cgi-bin/DisplayMenu.pl?Reports&id=-1">Reports <div class="spritedownarrow"></div></a></li>
My logic is pretty simple, any help on how to get the href in a href using HTMLAgilityPack would be fantastic.
Here is a full HTML example
<HTML>
<body>
<li class="yuimenubaritem"><a class="yuimenubaritemlabel" href="/cgi-bin/DisplayMenu.pl?Reports&id=-1">Reports <div class="spritedownarrow"></div></a></li>
</body>
</HTML>

Cannot get content of specific div with html agility pack

I'm using html agility pack for take some data from a website, now there is a bit problem. I want get some data from this div:
<div class="container middle">
<div class="details clearfix">
<dl>
<dt>Gara</dt>
<dd>Super League</dd>
<dt>Data</dt>
<dd><span class='timestamp' data-value='1467459300' data-format='d mmmm yyyy'>2 luglio 2016</span></dd>
<dt>Game week</dt>
<dd>15</dd>
<dt>calcio di inizio</dt>
<dd>
<span class='timestamp' data-value='1467459300' data-format='HH:MM'>13:35</span>
(<span class="game-minute">FP'</span>)
</dd>
</dl>
</div>
the problem's that there are two div with the class container middle and details clearfix, I want get the content onlhy of the specific div pasted above. This div have a dl tag for each tag.
This is my code:
var url = "http://it.soccerway.com/matches/2016/07/02/china-pr/csl/henan-jianye/beijing-guoan-football-club/2207361/";
var doc = new HtmlDocument();
doc.LoadHtml(new WebClient().DownloadString(url));
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode("//div[#class='container middle']");
and this return a wrong result, in particular this:
<div class="container middle">
<h3 class="thick scoretime score-orange">
0 - 0
</h3>
this is the complete source code.
Well, you could do the following, for this particular web-page:
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
Console.WriteLine(matchDetails[1].InnerHtml);
and working with HtmlNode via matchDetails[1]. To retrieve other data you can use similar xpath requests, like:
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
var dl = matchDetails[1].SelectSingleNode(".//dl");
var dt = dl.SelectNodes(".//dt");
var dd = dl.SelectNodes(".//dd");
for (int i = 0; i < dt.Count; i++) {
var name = dt[i].InnerHtml;
var value = dd[i].InnerHtml;
Console.WriteLine(name + ": " + value);
}
Of course, you need some check for the NullReference and stuff
Query div with class details clearfix should return the target div element. There is one crucial detail you need to be aware of though,
that a . before / is needed to make the XPath relative to the context element referenced by infoDiv, otherwise the XPath will be evaluated on the root document context (as if it was called on doc.DocumentNode instead of on infoDiv) :
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode(".//div[#class='details clearfix']");

HTML Agility Pack - Get div without class or id (C#)

I've been trying to follow some solutions here on StackOverflow but I need some help.
This is the source HTML:
<div class="myclass">
<div style="font-size:2em;"> STRING_N1 </div>
<div> STRING_N2 </div>
</div>
And this is my current code:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlcode);
var res = doc.DocumentNode.SelectNodes("//div[#class='myclass']");
foreach (var item in res)
{
var firstDiv = item.SelectSingleNode("div");
var content1 = firstDiv.ChildNodes[0].InnerText.Trim();
richTextBox1.AppendText(content1.ToString());
}
So far so good, I can extract "STRING_N1" without a problem. However, I can't figure it out on how to extract STRING_N2 without having a class or id.
Thank you.
You can use LINQ to get descendant divs:
var divs = doc.DocumentNode.SelectNodes("//div[#class='myclass']")
.SelectMany(x => x.Descendants("div"));
var contents = divs.Select(x => x.InnerText.Trim());
richTextBox1.AppendText(string.Join(Environment.NewLine, contents);

li in htmlagilitypack c#

I want to get label and strong values from the following li
<div class="property-summary">
<h3>Listing summary</h3>
<ul>
<li>
<label>Reference</label>
<strong>BR-S-4301</strong>
</li>
<li>
<label>Type</label>
<strong>Apartment</strong>
</li>
<li>
<label>City</label>
<strong>Dubai</strong>
</li>
<li>
<label>Community</label>
<strong>Palm Jumeirah</strong>
</li>
<li>
<label>Subcommunity</label>
<strong>Tiara Residences</strong>
</li>
</ul>
</div>
Here is my c# code
var dataNode = rootNode.SelectNodes("//div[normalize-space(#class)='property-summary']");
Now how to get it? below is not working for me
var Node = dataNode .SelectSingleNode(".//li/strong");
There are couple of ways to do it.
1
var labelNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/label");
var strongNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li/strong");
foreach (var node in labelNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
foreach (var node in strongNodes)
{
Debug.WriteLine(node.InnerText.Trim());
}
2
var liNodes = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']/ul/li");
foreach (var node in liNodes)
{
Debug.WriteLine(node.SelectSingleNode("label").InnerText.Trim());
Debug.WriteLine(node.SelectSingleNode("strong").InnerText.Trim());
}
check for existence of nodes before writing any real code.
If you want to get all the label tags, you can use
IEnumerable<HtmlNode> labels = dataNode.Descendants("label");
And same for strong tags
IEnumerable<HtmlNode> strongs = dataNode.Descendants("strong");
You can also use:
var dataNode = doc.DocumentNode.SelectNodes("//div[normalize-space(#class)='property-summary']")[0];
HtmlNodeCollection strongs = dataNode.SelectNodes(".//li/strong");
HtmlNodeCollection labels = dataNode.SelectNodes(".//li/label");
To get text from strongs or labels use:
foreach (var strong in strongs)
{
string strongText = strong.InnerText.Trim();
}
You may consider switching to these HTML parsing libraries which provide excellent jQuery selectors like features.
http://nsoup.codeplex.com/
http://github.com/jamietre/csquery

Html Agility Pack: how to parse a webresponse and get a specified html element in c#

I googled my problem and found Html Agility Pack to parse html in c#. But there is no good examples and I can't use it to my purpose. I have a html document and it has a part like this:
<div class="pray-times-holder">
<div class="pray-time">
<div class="labels">
Time1:</div>
04:28:24
</div>
<div class="pray-time">
<div class="labels">
Time2:</div>
06:04:41
</div>
</div>
I want to get the value for Time1 and Time2. e.g. Time1 has value 04:28:24 and Time2 has value 06:04:41 and I want to get these values. Can you help me please?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode
.Descendants("div")
.Where(n => n.Attributes["class"] != null && n.Attributes["class"].Value == "pray-time")
.Select(n => n.InnerText.Replace("\r\n","").Trim())
.ToArray();
This console application code:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class = 'labels']"))
{
Console.WriteLine(node.NextSibling.InnerText.Trim());
}
will output this:
04:28:24
06:04:41

Categories

Resources