I am attempting to use the HTML Agility Pack to look up specific keywords on Google, then check through linked nodes until it find my websites string url, then parse the innerHTML of the node I am on for my Google ranking.
I am relatively new to the Agility Pack (as in, I started really looking through it yesterday) so I was hoping I could get some help on it. When I do the search below, I get Failures on my Xpath queries every time. Even if I insert something as simple as SelectNodes("//*[#id='rso']"). Is this something I am doing incorrectly?
private void GoogleScrape(string url)
{
string[] keys = keywordBox.Text.Split(',');
for (int i = 0; i < keys.Count(); i++)
{
var raw = "http://www.google.com/search?num=100&q=";
string search = raw + HttpUtility.UrlEncode(keys[i]);
var webGet = new HtmlWeb();
var document = webGet.Load(search);
loadtimeBox.Text = webGet.RequestDuration.ToString();
var ranking = document.DocumentNode.SelectNodes("//*[#id='rso']");
if (ranking != null)
{
googleBox.Text = "Something";
}
else
{
googleBox.Text = "Fail";
}
}
}
It's not the Agility pack's guilt -- it is tricky google's. If you inspect _text property of HtmlDocument with debugger, you'll find that <ol> that has id='rso' when you inspect it in a browser do not have any attributes for some reason.
I think, in this case you can just serach by "//ol", because there is only one <ol> tag in the google's result page at the moment...
UPDATE: I've done further checks. For example when I do this:
using (StreamReader sr =
new StreamReader(HttpWebRequest
.Create("http://www.google.com/search?num=100&q=test")
.GetResponse()
.GetResponseStream()))
{
string s = sr.ReadToEnd();
var m2 = Regex.Matches(s, "\\sid=('[^']+'|\"[^\"]+\")");
foreach (var x in m2)
Console.WriteLine(x);
}
The only ids that are returned are: "sflas", "hidden_modes" and "tbpr_12".
To conclude: I've used Html Agility Pack and it's coped pretty well even with malformed html (unclosed <p> and even <li> tags etc.).
Related
Okay so I have this list of URLs on this webpage, I am wondering how do I grab the URLs and add them to a ArrayList?
http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A
I only want the URLs which are in the list, look at it to see what I mean. I tried doing it myself and for whatever reason it takes all of the other URLs except for the ones I need.
http://pastebin.com/a7hJnXPP
Using Html Agility Pack
using (var wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
var links = doc.DocumentNode.SelectSingleNode("//div[#class='lst']")
.Descendants("a")
.Select(x => x.Attributes["href"].Value)
.ToArray();
}
If you want only the ones in the list, then the following code should work (this is assuming you have the page loaded into an HtmlDocument already)
List<string> hrefList = new List<string>(); //Make a list cause lists are cool.
foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(#href, 'id=')]"))
{
//Append animenewsnetwork.com to the beginning of the href value and add it
// to the list.
hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}
//a[contains(#href, 'id=')] Breaking this XPath down as follows:
//a Select all <a> nodes...
[contains(#href, 'id=')] ... that contain an href attribute that contains the text id=.
That should be enough to get you going.
As an aside, I would suggest not listing each link in its own messagebox considering there are around 500 links on that page. 500 links = 500 messageboxes :(
I'm currently writing a very basic program that'll firstly go through the html code of a website to find all RSS Links, and thereafter put the RSS Links into an array and parse each content of the links into an existing XML file.
However, I'm still learning C# and I'm not that familiar with all the classes yet. I have done all this in PHP by writing own class with get_file_contents() and as well been using cURL to do the work. I managed to get around it with Java also. Anyhow, I'm trying to accomplish the same results by using C#, but I think I'm doing something wrong here.
TLDR; What's the best way to write the regex to find all RSS links on a website?
So far, my code looks like this:
private List<string> getRSSLinks(string websiteUrl)
{
List<string> links = new List<string>();
MatchCollection collection = Regex.Matches(websiteUrl, #"(<link.*?>.*?</link>)", RegexOptions.Singleline);
foreach (Match singleMatch in collection)
{
string text = singleMatch.Groups[1].Value;
Match matchRSSLink = Regex.Match(text, #"type=\""(application/rss+xml)\""", RegexOptions.Singleline);
if (matchRSSLink.Success)
{
links.Add(text);
}
}
return links;
}
Don't use Regex to parse html. Use an html parser instead See this link for the explanation
I prefer HtmlAgilityPack to parse htmls
using (var client = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(client.DownloadString("http://www.xul.fr/en-xml-rss.html"));
var rssLinks = doc.DocumentNode.Descendants("link")
.Where(n => n.Attributes["type"] != null && n.Attributes["type"].Value == "application/rss+xml")
.Select(n => n.Attributes["href"].Value)
.ToArray();
}
Im using HTML Agility Pack, and Im trying to replace the InnerText of some Tags like this
protected void GerarHtml()
{
List<string> labels = new List<string>();
string patch = #"C:\EmailsMKT\" +
Convert.ToString(Session["ssnFileName"]) + ".html";
DocHtml.Load(patch);
//var titulos = DocHtml.DocumentNode.SelectNodes("//*[#class='lblmkt']");
foreach (HtmlNode titulo in
DocHtml.DocumentNode.SelectNodes("//*[#class='lblmkt']"))
{
titulo.InnerText.Replace("test", lbltitulo1.Text);
}
DocHtml.Save(patch);
}
the html:
<.div><.label id="titulo1" class="lblmkt">teste</label.><./Div>
Strings are immutable (you should be able to find much documentation on this).
Methods of the String class do not alter the instance, but rather create a new, modified string.
Thus, your call to:
titulo.InnerText.Replace("test", lbltitulo1.Text);
does not alter InnerText, but returns the string you want InnerText to be.
In addition, InnerText is read-only; you'll have to use Text as shown in Set InnerText with HtmlAgilityPack
Try the following line instead (assign the result of the string operation to the property again):
titulo.Text = titulo.Text.Replace("test", lbltitulo1.Text);
I was able get the result like this:
HtmlTextNode Hnode = null;
Hnode = DocHtml.DocumentNode.SelectSingleNode("//label[#id='titulo1']//text()") as HtmlTextNode;
Hnode.Text = lbltitulo1.Text;
I am developing a Windows Forms application which is interacting with a web site.
Using a WebBrowser control I am controlling the web site and I can iterate through the tags using:
HtmlDocument webDoc1 = this.webBrowser1.Document;
HtmlElementCollection aTags = webDoc1.GetElementsByTagName("a");
Now, I want to get a particular text from the tag which is below:
Show Assigned<br>
Like here I want to get the number 244 which is equal to assignedto in above tag and save it into a variable for further use.
How can I do this?
You can try splitting a string by ';' values, and then each string by '=' like this:
string aTag = ...;
foreach(var splitted in aTag.Split(';'))
{
if(splitted.Contains("="))
{
var leftSide = splitted.Split('=')[0];
var rightSide = splitted.Split('=')[1];
if(leftSide == "assignedto")
{
MessageBox.Show(rightSide); //It should be 244
//Or...
int num = int.Parse(rightSide);
}
}
}
Other option is to use Regexes, which you can test here: www.regextester.com. And some more info on regexes: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
Hope it helps!
If all cases are similar to this and you don't mind a reference to System.Web in your Windows Forms application, tou can do something like this:
using System;
public class Program
{
static void Main()
{
string href = #"issue?status=-1,1,2,3,4,5,6,7&
#sort=-activity&#search_text=&#dispname=Show Assigned&
#filter=status,assignedto&#group=priority&
#columns=id,activity,title,creator,status&assignedto=244&
#pagesize=50&#startwith=0";
href = System.Web.HttpUtility.HtmlDecode(href);
var querystring = System.Web.HttpUtility.ParseQueryString(href);
Console.WriteLine(querystring["assignedto"]);
}
}
This is a simplified example and first you need to extract the href attribute text, but that should not be complex. Having the href attribute text you can take advantage that is basically a querystring and reuse code in .NET that already parses query strings.
To complete the example, to obtain the href attribute text you could do:
HtmlElementCollection aTags = webBrowser.Document.GetElementsByTagName("a");
foreach (HtmlElement element in aTags)
{
string href = element.GetAttribute("href");
}
I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model.
I looked at the link example, but did not find any table data this way.
Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. (HTML::TableParser).
I am also happy if one can just shed a light on the right object order for the parsing.
How about something like:
Using HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
Note that you can make it prettier with LINQ-to-Objects if you want:
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};
foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}
The most simple what I've found to get the XPath for a particular Element is to install FireBug extension for Firefox go to the site/webpage press F12 to bring up firebug; right select and right click the element on the page that you want to query and select "Inspect Element" Firebug will select the element in its IDE then right click the Element in Firebug and choose "Copy XPath" this function will give you the exact XPath Query you need to get the element you want using HTML Agility Library.
I know this is a pretty old question but this was my solution that helped with visualizing the table so you can create a class structure. This is also using the HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
var table = doc.DocumentNode.SelectSingleNode("//table");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{
for (int e = 0; e < columns.Count; e++)
{
var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");
Console.Write(columns[e].InnerText + ":" + value.InnerText);
}
Console.WriteLine();
}
In my case, there is a single table which happens to be a device list from a router. If you wish to read the table using TR/TH/TD (row, header, data) instead of a matrix as mentioned above, you can do something like the following:
List<TableRow> deviceTable = (from table in document.DocumentNode.SelectNodes(XPathQueries.SELECT_TABLE)
from row in table?.SelectNodes(HtmlBody.TR)
let rows = row.SelectSingleNode(HtmlBody.TR)
where row.FirstChild.OriginalName != null && row.FirstChild.OriginalName.Equals(HtmlBody.T_HEADER)
select new TableRow
{
Header = row.SelectSingleNode(HtmlBody.T_HEADER)?.InnerText,
Data = row.SelectSingleNode(HtmlBody.T_DATA)?.InnerText}).ToList();
}
TableRow is just a simple object with Header and Data as properties.
The approach takes care of null-ness and this case:
<tr>
<td width="28%"> </td>
</tr>
which is row without a header. The HtmlBody object with the constants hanging off of it are probably readily deduced but I apologize for it even still. I came from the world where if you have " in your code, it should either be constant or localizable.
Line from above answer:
HtmlDocument doc = new HtmlDocument();
This doesn't work in VS 2015 C#. You cannot construct an HtmlDocument any more.
Another MS "feature" that makes things more difficult to use. Try HtmlAgilityPack.HtmlWeb and check out this link for some sample code.