I want to show a specific section of a html-page in a textbox in a WP7-app (C#). After a bit of searching online I found this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("http://www.positief-project.be/?p=532");
string links = doc.DocumentNode
.Descendants("section")
.Where(section => section.Attributes["class"] != null &&
section.Attributes["class"].Value == "article-content").ToString();
txbContent.Text = links;
This doesn't give an error, but doesn't work either. How can I make it show in the text box?
Is jQuery an option?
HTML
<div class="section">
<div class="article-content">some foo 1</div>
<div class="article-content">some foo 2</div>
<div class="article-content">some foo 3</div>
<div class="article-content">some foo 4</div>
</div>
<br>
<input type="text" id="tbContent" />
jQuery
$(document).ready(function () {
var content;
$('.article-content').each(function(i, obj){
content += obj.innerHTML;
});
$('#tbContent').val(content);
});
See this fiddle http://jsfiddle.net/rodhartzell/Fk2xM/
Related
first .. sorry about my bad english
my question is how can i scrape div inside div in htmlagilitypack c#
this is test html code
<html>
<div class="all_ads">
<div class="ads__item">
<div class="test">
test 1
</div>
</div>
<div class="ads__item">
<div class="test">
test 2
</div>
</div>
<div class="ads__item">
<div class="test">
test 3
</div>
</div>
</div>
</html>
how to make a loop that get all ads then loop that control test inside ads
You can select all the nodes inside class all_ads as follow:-
var res = div.SelectNodes(".//div[#class='all_ads ads__item']");
.//div[#class='all_ads ads__item'] This will select all the nodes inside all_adswhich has class ads_item.
You have to use this path => //div[contains(#class, 'test')]
This means you need to select those div(s) that contains class with name ads__item.
and then select all those selected div(s) inner html. like
class Program
{
static void Main(string[] args)
{
string html = File.ReadAllText(#"Path to your html file");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var innerContent = doc.DocumentNode.SelectNodes("//div[contains(#class, 'test')]").Select(x => x.InnerHtml.Trim());
foreach (var item in innerContent)
Console.WriteLine(item);
Console.ReadLine();
}
}
Output:
I am processing emails in my C# service. I need to extract certain links present in the same to add to DB. I am using HtmlagilityPack. The div and p tags turn out interchangeable in the parsed email. I have to extract the links present below the tags 'Scheduler Link', 'Data Path' and 'Link' from the email. After cleaning it up, a sample data is as follows :
<html>
<body>
......//contains some other tags which i dont need, may include hrefs but
//i dont need them
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Scheduler link :</div>
<div align="justify" style="margin:0;"></div>
<div style="margin:0;"><a href="https://something.com/requests/26428">
https://something.com/requests/26428</a>
</div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div style="margin:0;"></div>
<div align="justify" style="margin:0;">Data path :</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a>
</div>
<div align="left" style="text-align:justify;margin:0;"><a
href="file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui">
\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a>
</div>
<div align="justify" style="margin:0;"></div>
<div align="justify" style="margin:0;">Link :</div>
<div align="justify" style="margin:0;"><a
href="https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y">
This is some text</a></div>
<div align="justify" style="margin:0 0 5pt 0;">This is another text</div>
......//contains some other tags which i dont need
</body>
</html>
I am looking for the div tag of 'Scheduler Link', 'Data Path' and 'Link' using regular expressions as follows :
HtmlNode schedulerLink = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["scheduler"]).Value.ToString() + "')]]");
HtmlNode dataPath = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["datapath"]).Value.ToString() + "')]]");
HtmlNode link = doc.DocumentNode.SelectSingleNode("//*[text()[contains(.,'" + Regex.Match(body, _keyValuePairs["link"]).Value.ToString() + "')]]");
The div tags are returning me the respective nodes. The number of links present against the three in each email varies and so does the order of the tags. I need to capture the links against each in a list. I am using the following code :
foreach (HtmlNode link in schedulerLink.Descendants())
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!(link.InnerText.Contains("\r\n")))
{
if (link.InnerText.Contains("/"))
{
schedulersList.Add(link.InnerText.Trim());
}
}
}
The descendants sometimes is not returning the correct number of nodes. Also how do i get the specific links against the 3 tags in 3 different lists since descendants usually return all the nodes present below.
If I understand correctly, you want to capture the content of the first href-attribute after a specific string like scheduler link. I don't know about the HtmlagilityPack, but my approach would be to just search the email body with a regex like this:
Scheduler link(?:\s|\S)*?href="([^"]+)
This regex should capture the content of the first href-attribute after every occurence of "Scheduler link" in the mail.
You can try it here: Regex101
To find the other types of links just replace the Scheduler link part with the respective string.
I hope this is helpful.
Additional info about the regex:
Scheduler link matches the string literally
(?:\s|\S)*?href=" non-capturing group that matches any character until the first occurence of the literal string href="
([^"]+) captures everything despite the " character
As you have mentioned different hrefs in your question,
one way of doing it is by following:
var html = #"<html> <body> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Scheduler link :</div> <div align='justify' style='margin:0;'></div> <div style='margin:0;'><a href='https://something.com/requests/26428'> https://something.com/requests/26428</a> </div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div style='margin:0;'></div> <div align='justify' style='margin:0;'>Data path :</div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\jui\tui245.5t_2rtfg_tyui</a> </div> <div align='left' style='text-align:justify;margin:0;'><a href='file:///\\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui'> \\mycompany.com\ABC\OPQ1234\tui245.5t_2rtfg_tyui</a> </div> <div align='justify' style='margin:0;'></div> <div align='justify' style='margin:0;'>Link :</div> <div align='justify' style='margin:0;'><a href='https://Thisisanotherlink.abcdef/sites/this/498592/rkjfb/3874y'> This is some text</a></div> <div align='justify' style='margin:0 0 5pt 0;'>This is another text</div> </body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var schedulerNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"something\")]");
var dataPathNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"mycompany\")]");
var linkNodes = document.DocumentNode.SelectNodes("//a[contains(#href, \"Thisisanotherlink\")]");
foreach (var item in schedulerNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in dataPathNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
foreach (var item in linkNodes)
{
Debug.WriteLine(item.GetAttributeValue("href", ""));
Debug.WriteLine(item.InnerText);
}
Hope that helps !!
EDIT ::
var result = document.DocumentNode.SelectNodes("//div//text()[normalize-space()] | //a");
// select all textnodes and a tags
string sch = "Scheduler link :";
string dataLink = "Data path :";
string linkpath = "Link :";
foreach (var item in result)
{
if (item.InnerText.Trim().Contains(sch))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(sch)).Skip(1);
// skip the result till we reache to Scheduler.
Debug.WriteLine("====================Scheduler link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
// if href then add to list TODO
if (subitem.InnerText.Contains(dataLink)) // break when data link appears.
{
break;
}
}
}
if (item.InnerText.Trim().Contains(dataLink))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(dataLink)).Skip(1);
Debug.WriteLine("====================Data link=========================");
foreach (var subitem in processResult)
{
Debug.WriteLine(subitem.GetAttributeValue("href", ""));
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
if (item.InnerText.Trim().Contains("Link :"))
{
var processResult = result.SkipWhile(x => !x.InnerText.Trim().Equals(linkpath)).Skip(1);
Debug.WriteLine("====================Link=========================");
foreach (var subitem in processResult)
{
var hrefValue = subitem.GetAttributeValue("href", "");
Debug.WriteLine(hrefValue);
if (subitem.InnerText.Contains(dataLink))
{
break;
}
}
}
}
I have mentioned logic in code commments.
Hope that helps
Firstly, I tried a lot of ways but I couldn't solve my problem. I don't know how to place my node way in SelectSingleNode(?) method. I create a html path to reach my node in my c# code but if I run this code, I take NullReferenceException because of my html path. I just want you that how can I create my html way or any other solution?
This is example of html code:
<html>
<body>
<div id="container">
<div id="box">
<div class="box">
<div class="boxContent">
<div class="userBox">
<div class="userBoxContent">
<div class="userBoxElement">
<ul id ="namePart">
<li>
<span class ="namePartContent>
</span>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
And this my C# code:
namespace AgilityTrial
{
class Program
{
static void Main(string[] args)
{
Uri url = new Uri("https://....");
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string path = #"//html/body/div[#id='container']/div[#id='classifiedDetail']"+
"/div[#class='classifiedDetail']/div[#class='classifiedDetailContent']"+
"/div[#class='classifiedOtherBoxes']/div[#class='classifiedUserBox']"+
"/div[#class='classifiedUserContent']/ul[#id='phoneInfoPart']/li"+
"/span[#class='pretty-phone-part show-part']";
var tds = doc.DocumentNode.SelectSingleNode(path);
var date = tds.InnerHtml;
Console.WriteLine(date);
}
}
}
Take as an example your namePartContent span node. If you want to fetch that data you would simply do this:
doc.DocumentNode.SelectSingleNode(".//span[#class='namePartContent']")?.InnerText;
It will search/fetch a single span node with namePartContent as its class, begining at the root node, in your case <html>;
For examle:
<div id="outer">
<div id="a">
<div class="b"> 11111111111</div>
<div class="b"> 22222222222222</div>
</div>
</div>
Now I want to match the elements of id is a, and replace it to empty, but I found I can't, because id="a" is not the outer div.
This is my c# code ,it will match the last Tag.
Regex regex = new Regex(#"<div id=""a([\s\S]*) (<\/[div]>+)");
Try this:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var divs = doc.DocumentNode.Descendants().Where(x => x.Name == "div" && x.Id == "a");
foreach (var div in divs.ToArray())
{
div.InnerHtml = "";
}
var result = doc.DocumentNode.OuterHtml;
The result I get is:
<div id="outer">
<div id="a"></div>
</div>
I am trying to code a way using webBrowser1 to get a hold of of a download link via href, but the problem is I must find it using its class name.
<body>
<iframe scrolling="no" frameborder="0" allowtransparency="true" tabindex="0" name="twttrHubFrame" style="position: absolute; top: -9999em; width: 10px; height: 10px;" src="http://platform.twitter.com/widgets/hub.html">
¶
<div id="main">
¶→
<div id="header">
<div style="float:left;">
¶→
<div id="content">
¶→
<h1 style="background-image:url('http://static.mp3skull.com/img/bgmen.JPG'); background-repeat:repeat-x;">Rush·Mp3·Download</h1>
¶→
<a id="bitrate" onclick="document.getElementById('ofrm').submit(); return false;" rel="nofollow" href="">
<form id="ofrm" method="POST" action="">
¶→¶→→
<div id="song_html" class="show1">
¶→→→
<div class="left">
¶→→→
<div id="right_song">
¶→→→→
<div style="font-size:15px;">
¶→→→→
<div style="clear:both;"></div>
¶→→→→
<div style="float:left;">
¶→→→→→
<div style="float:left; height:27px; font-size:13px; padding-top:2px;">
¶→→→→→→
<div style="float:left; width:27px; text-align:center;">
¶→→→→→→
<div style="margin-left:8px; float:left;">
<a style="color:green;" target="_blank" rel="nofollow" href="http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3">Download</a>
</div>
·¶→→→→→→
<div style="margin-left:8px; float:left;">
¶→→→→→→
<div style="margin-left:8px; float:left;">
·¶→→→→→→
<div style="clear:both;"></div>
¶→→→→→
</div>
¶→→→→→
<div id="player155580779" class="player" style="float:left; margin-left:10px;"></div>
¶→→→→
</div>
→¶→→→→
<div style="clear:both;"></div>
¶→→→
</div>
¶→→→
<div style="clear:both;"></div>
¶→→
</div>
I looked and searched all over google, but I found PHP examples?
I understand you would do something along the lines of this
HtmlElement downloadlink = webBrowser1.Document.GetElementById("song_html").All[0];
URL = downloadlink.GetAttribute("href");
but I do not understand how to do it by the class "show1".
Please point me in the right direction with examples and/or a website I can visit so I can learn how to do this as I searched and have no clue.
EDIT: I pretty much need the href link ("http://dc182.4shared.com/img/1011303409/865387c9/dlink__2Fdownload_2F6QmedN8H_3Ftsid_3D20111211-54337-a79f8d10/preview.mp3"), so how would I obtain it?
There is nothing built-in in the WebBrowser control to retrieve an element by class name. Since you know it is going to be an a element the best you can do is get all a elements and search for the one you want:
var links = webBrowser1.Document.GetElementsByTagName("a");
foreach (HtmlElement link in links)
{
if (link.GetAttribute("className") == "show1")
{
//do something
}
}
Extension Method for HtmlDocument
Returns a list of elements with a particular tag, which coincides with the given className
It can be used to capture the elements only on the tag, or only by class name
internal static class Utils
{
internal static List<HtmlElement> getElementsByTagAndClassName(this HtmlDocument doc, string tag = "", string className = "")
{
List<HtmlElement> lst = new List<HtmlElement>();
bool empty_tag = String.IsNullOrEmpty(tag);
bool empty_cn = String.IsNullOrEmpty(className);
if (empty_tag && empty_cn) return lst;
HtmlElementCollection elmts = empty_tag ? doc.All : doc.GetElementsByTagName(tag);
if (empty_cn)
{
lst.AddRange(elmts.Cast<HtmlElement>());
return lst;
}
for (int i = 0; i < elmts.Count; i++)
{
if (elmts[i].GetAttribute("className") == className)
{
lst.Add(elmts[i]);
}
}
return lst;
}
}
Usage:
WebBrowser wb = new WebBrowser();
List<HtmlElement> lst_div = wb.Document.getElementsByTagAndClassName("div");// all div elements
List<HtmlElement> lst_err_elmnts = wb.Document.getElementsByTagAndClassName(String.Empty, "error"); // all elements with "error" class
List<HtmlElement> lst_div_err = wb.Document.getElementsByTagAndClassName("div", "error"); // all div's with "error" class
I followed up these answers and make my method to hide div by class name.
I shared for whom concern.
public void HideDivByClassName(WebBrowser browser, string classname)
{
if (browser.Document != null)
{
var byTagName = browser.Document.GetElementsByTagName("div");
foreach (HtmlElement element in byTagName)
{
if (element.GetAttribute("className") == classname)
{
element.Style = "display:none";
}
}
}
}