I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project.
I am doing this as a C# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags <table class="data"> and </table>.
My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By, out of the page and send the data to a SQL table.
One twist is that there is also a small PNG picture that also need to be grabbed from the src="/partcode/number.
I do not have any completed code that woks. I thought this bit of code would tell me if I am heading in the right direction. Even stepping into the debug I can’t see that it does anything. Could someone possibly point me in the right direction on this. The more detailed the better since it is apparent I have a lot to learn.
Thank you I would really appreciate it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
namespace Stats
{
class PartParser
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
//My understanding this reads the entire page in?
var tables = doc.DocumentNode.SelectNodes("//table");
// I assume that this sets up the search for words containing table
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
Console.ReadKey();
}
}
}
The web code is:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Part Number Database: Item Record</title>
<table class="data">
<tr><td>Part-Num</td><td width="50"></td><td>
<img src="/partcode/number/072140" alt="072140"/></td></tr>
<tr><td>Manu-Number</td><td width="50"></td><td>
<img src="/partcode/manu/00721408" alt="00721408" /></td></tr>
<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
<tr><td>Manu-Country</td><td></td><td>United States</td></tr>
<tr><td>Last Modified</td><td></td><td>26 Jan 2009, 8:08 PM</td></tr>
<tr><td>Last Modified By</td><td></td><td>Manu</td></tr>
</table>
<head/>
</html>
The beginning part is off:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
LoadHtml(html) loads an html string into the document, I think you want something like this instead:
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load("http://stackoverflow.com");
A working code, according to the HTML source you provided. It can be factorized, and I'm not checking for null values (in rows, cells, and each value inside the case). If you have the page in 127.0.0.1, that will work. Just paste it inside the Main method of a Console Application and try to understand it.
HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");
var rows = doc.DocumentNode.SelectNodes("//table[#class='data']/tr");
foreach (var row in rows)
{
var cells = row.SelectNodes("./td");
string title = cells[0].InnerText;
var valueRow = cells[2];
switch (title)
{
case "Part-Num":
string partNum = valueRow.SelectSingleNode("./img[#alt]").Attributes["alt"].Value;
Console.WriteLine("Part-Num:\t" + partNum);
break;
case "Manu-Number":
string manuNumber = valueRow.SelectSingleNode("./img[#alt]").Attributes["alt"].Value;
Console.WriteLine("Manu-Num:\t" + manuNumber);
break;
case "Description":
string description = valueRow.InnerText;
Console.WriteLine("Description:\t" + description);
break;
case "Manu-Country":
string manuCountry = valueRow.InnerText;
Console.WriteLine("Manu-Country:\t" + manuCountry);
break;
case "Last Modified":
string lastModified = valueRow.InnerText;
Console.WriteLine("Last Modified:\t" + lastModified);
break;
case "Last Modified By":
string lastModifiedBy = valueRow.InnerText;
Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
break;
}
}
Related
I need to pull part of html from external url to another page using agility-pack. I am not sure if i can select a node/element based on id or classname using agility pack. So far i manage to pull complete page but i want to target on node/element with specific id and all its contents.
protected void WebScrapper()
{
HtmlDocument doc = new HtmlDocument();
var url = #"https://www.itftennis.com/en/tournament/w15-valencia/esp/2022/w-itf-esp-35a-2022/acceptance-list/";
var webGet = new HtmlWeb();
doc = webGet.Load(url);
var baseUrl = new Uri(url);
//doc.LoadHtml(doc);
Response.Write(doc.DocumentNode.InnerHtml);
//Response.Write(doc.DocumentNode.Id("acceptance-list-container"));
//var innerContent = doc.DocumentNode.SelectNodes("/div").FirstOrDefault().InnerHtml;
}
When i use Response.Write(doc.DocumentNode.Id("acceptance-list-container")) it generates error.
When i use below code it generates error System.ArgumentNullException: Value cannot be null.
doc.DocumentNode.SelectNodes("/div[#id='acceptance-list-container']").FirstOrDefault().InnerHtml;
so far nothing works if you fix one issue other issue shows up.
The error you get indicates that the SelectNodes() call didn't find any nodes and returned null. In cases like this, it is useful to inspect the actual HTML by using doc.DocumentNode.InnerHtml.
Your code sample is somewhat messy and you are probably trying to do too many things at once (what is Response.Write() for example?). You should try to focus on one thing at a time if possible.
Here is a simple unit test that can get you started:
using HtmlAgilityPack;
using Xunit;
using Xunit.Abstractions;
namespace Scraping.Tests
{
public class ScrapingTests
{
private readonly ITestOutputHelper _outputHelper;
public ScrapingTests(ITestOutputHelper outputHelper)
{
_outputHelper = outputHelper;
}
[Fact]
public void Test()
{
const string url = #"https://www.itftennis.com/en/tournament/w15-valencia/esp/2022/w-itf-esp-35a-2022/acceptance-list/";
var webGet = new HtmlWeb();
HtmlDocument doc = webGet.Load(url);
string html = doc.DocumentNode.InnerHtml;
_outputHelper.WriteLine(html); // use this if you just want to print something
Assert.Contains("acceptance-list-container", html); // use this if you want to automate an assertion
}
}
}
When I tried that the first time, I got some HTML with an iframe. I visited the page in a browser and I was presented with a google captcha. After completing the captcha, I was able to view the page in the browser, but the HTML in the unit test was still different from the one I got in the browser.
Interestingly enough, the HTML in the unit test contains the following:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
It is obvious that this website has some security measures in place in order to block web scrapers. If you manage to overcome this obstacle and get the actual page's HTML in your program, parsing it and getting the parts that you need will be straightforward.
fairly new to the HTML Agility Pack. I've been searching and trying many examples but didn't get to a conclusion yet.. must be doing something wrong.. hope you can assist me.
My goal is to parse the latest news from a website, including image, title and date - pretty simple. I managed to get the image (background attribute) from the div but the divs are nested and for some reason I can't access their values. Here is my code
using System;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var html = #"https://pristontale.eu/";
HtmlWeb web = new HtmlWeb();
var doc = web.Load(html);
var news = doc.DocumentNode.SelectNodes("//div[contains(#class,'index-article-wrapper')]");
foreach (var item in news){
var image = Regex.Match(item.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
var title = item.SelectSingleNode("//div[#class='article-title']").InnerText;
var date = item.SelectSingleNode("//div[#class='article-date']").InnerText;
Console.WriteLine(image, title, date);
}
}
}
This is what the HTML looks like
<div class="index-article-wrapper" onclick="location.href='article.php?id=2';" style="background-image: url(https://cdn.discordapp.com/attachments/765749063621935104/884439050562461696/1_1.png)">
<div class="meta-wrapper">
div class="article-date">5 Sep, 2021</div>
<div class="article-title">Server merge v1.264 update</div>
</div>
</div>
Currently it correctly grabs me all the 4 news articles but only the image - how do i get title and date of each? I have a fiddle here https://dotnetfiddle.net/BVcAmH
Appreciate the help
I just realized the code has been correct all along, the only flaw was the Console.WriteLine
Wrong
Console.WriteLine(image, title, date);
Correct
Console.WriteLine(image + " " + " " + title + " " + date);
I am using HTML Agility Pack for validating my html. Below is what I am using,
public class MarkupErrors
{
public string ErrorCode { get; set; }
public string ErrorReason { get; set; }
}
public static List<MarkupErrors> IsMarkupValid(string html)
{
var document = new HtmlAgilityPack.HtmlDocument();
document.OptionFixNestedTags = true;
document.LoadHtml(html);
var parserErrors = new List<MarkupErrors>();
foreach(var error in document.ParseErrors)
{
parserErrors.Add(new MarkupErrors
{
ErrorCode = error.Code.ToString(),
ErrorReason = error.Reason
});
}
return parserErrors;
}
So say my input is something like the one shown below :
<h1>Test</h1>
Hello World</h2>
<h3>Missing close h3 tag
So my current function return a list of following errors
- Start tag <h2> was not found
- End tag </h3> was not found
which is fine...
My problem is that I want the entire html to be valid, that is with a proper <head> and <body> tags, because this html will later be available for preview, download as .html files.
So I was wondering if I could check for this using HTML Agility Pack ?
Any ideas or other options will be appreciated. Thanks
You can check there is a HEAD element or a BODY element under an HTML element like this for example:
bool hasHead = doc.DocumentNode.SelectSingleNode("html/head") != null;
bool hasBody = doc.DocumentNode.SelectSingleNode("html/body") != null;
These would fail if there is no HTML element, or if there is no BODY element under the HTML element.
Note I don't use this kind of XPATH expression "//head" because it would give a result even if the head was not directly under the HTML element.
I am attempting to fill in an ASP.NET page textbox with some predefined text so that when it is displayed the value is predefined. I have tried
protected void Page_PreRender ()
{
mytextbox.Text = somestring;
}
which works fine in the development environment but on the server produces...
System.NullReferenceException: Object reference not set to an instance of an object
The same applies when I try this in Page_Load. As I read the answers to this question, what I am trying should work (in at least one of these places).
Can anyone see what I am doing wrong?
EDIT more code, as suggested. The C# looks like this:-
protected void Page_PreRender (Object sender, EventArgs e)
{
try
{
string [] file_list;
int i = 0;
file_list = Directory.GetFiles(MyProg.Common.GetDirectory(),
MyProg.Common.GetFileNameRoot() + "*.*");
foreach (string filename in file_list)
{
string filenameonly = Path.GetFileName (filename);
if (filenameonly == MyProg.Common.GetFileNameRoot() + "runlog.log")
{
nametextbox.Text = filenameonly;
}
}
}
catch (Exception ex)
{
string mystring = ex.ToString();
errorMessage.Text = "Page Load Error : " + mystring;
}
}
and the ASP.NET page like this...
<%# Page Language="C#"
AutoEventWireup="true"
CodeBehind="MyDialogue.aspx.cs"
Inherits="MyDialogue" %>
<%# Register assembly="ComponentArt.Web.UI"
namespace="ComponentArt.Web.UI"
tagprefix="ComponentArt" %>
<%# Register assembly="ComponentArt.Web.Visualization.Charting"
namespace="ComponentArt.Web.Visualization.Charting"
tagprefix="cc1" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head runat="server">
</head>
<body>
<form id="myForm" runat="server">
<div style="visibility:hidden">
<asp:TextBox ID="nametextbox"
TextMode="MultiLine"
runat="server"
Visible="true" />
</div>
</form>
</body>
</html>
Did you publish your site but did the filerefence to the codebehind stay in the aspx page?
are you sure the dll in the bin folder?
This should work without complaint. Does the mytextbox control have the runat="server" attribute? You can only access from the codebehind stuff with the runat="server" attribute.
There could be several areas that are causing this problem. How are you sure that you've narrowed it down to the textbox itself? Was this code completely bug-free before adding the textbox message? I'll post your code below with where I think potential null references may be occurring (in comments):
string [] file_list;
int i = 0;
file_list = Directory.GetFiles(MyProg.Common.GetDirectory(),
MyProg.Common.GetFileNameRoot() + "*.*");
// it is possible that file_list is null
// potentially due to an invalid path (missing / perhaps?)
foreach (string filename in file_list)
{
string filenameonly = Path.GetFileName (filename);
// It's possible that the MixedZone.Kernel.Common library
// is experiencing the null reference exception because it
// may not understand what file to get the name root of or
// maybe it is not capable of getting the root for some
// other reason (permissions perhaps?)
if (filenameonly == MixedZone.Kernel.Common.GetFileNameRoot() + "runlog.log")
{
nametextbox.Text = filenameonly;
}
Some possible solutions or safer code:
string [] file_list;
int i = 0;
file_list = Directory.GetFiles(MyProg.Common.GetDirectory(),
MyProg.Common.GetFileNameRoot() + "*.*");
if (file_list == null) throw new Exception("File List is null. Something is wrong.");
foreach (string filename in file_list)
{
string filenameonly = Path.GetFileName (filename);
string fileroot = MixedZone.Kernel.Common.GetFileNameRoot();
if(string.IsNullOrEmpty(fileroot) throw new Exception("MixedZone Library failed.");
if (filenameonly.Equals(fileroot + "runlog.log", StringComparison.OrdinalIgnoreCase)) // Choose your own string comparison here
{
nametextbox.Text = filenameonly;
}
Run with Antivirus disabled on the Production Server?
Compare .Net versions between Production and Development?
"which works fine in the development environment but on the server produces" - so, permissions or missing files perhaps?
I have the following scenario:
Some text <b>is bolded</b> some is <b>not</b>
Now, how do I get the "test.com" part and the anchor of the text, without having the bolded parts?
Assuming the following markup:
<html>
<head>
<title>Test</title>
</head>
<body>
Some text <b>is bolded</b> some is <b>not</b>
</body>
</html>
You could perform the following:
class Program
{
static void Main()
{
var doc = new HtmlDocument();
doc.Load("test.html");
var anchor = doc.DocumentNode.SelectSingleNode("//a");
Console.WriteLine(anchor.Attributes["href"].Value);
Console.WriteLine(anchor.InnerText);
}
}
prints:
test.com
Some text is bolded some is not
Of course you probably wanna adjust your SelectSingleNode XPath selector by providing an unique id or a classname to the anchor you are trying to fetch:
// assuming Some text <b>is bolded</b> some is <b>not</b>
var anchor = doc.GetElementbyId("foo");