Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to me in the output the links are just extensions of the actual website but not the full url link like http://www.foo/bar/foobar.com. All I will get is "/foobar". Is there a way to get the full links of the url with the code below?
Thanks!
static void Main(string[] args)
{
List<string> linksToVisit = ParseLinks("https://www.facebook.com");
}
public static List<string> ParseLinks(string email)
{
WebClient webClient = new WebClient();
byte[] data = webClient.DownloadData(email);
string download = Encoding.ASCII.GetString(data);
HashSet<string> list = new HashSet<string>();
var doc = new HtmlDocument();
doc.LoadHtml(download);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var n in nodes)
{
string href = n.Attributes["href"].Value;
list.Add(href);
}
return list.ToList();
}
You can check the HREF value if it's relative URL or absolute.
Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.
static void Main(string[] args)
{
List<string> linksToVisit = ParseLinks("https://www.facebook.com");
}
public static List<string> ParseLinks(string urlToCrawl)
{
WebClient webClient = new WebClient();
byte[] data = webClient.DownloadData(urlToCrawl);
string download = Encoding.ASCII.GetString(data);
HashSet<string> list = new HashSet<string>();
var doc = new HtmlDocument();
doc.LoadHtml(download);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var n in nodes)
{
string href = n.Attributes["href"].Value;
list.Add(GetAbsoluteUrlString(urlToCrawl, href));
}
return list.ToList();
}
Function to convert Relative URL to Absolute
static string GetAbsoluteUrlString(string baseUrl, string url)
{
var uri = new Uri(url, UriKind.RelativeOrAbsolute);
if (!uri.IsAbsoluteUri)
uri = new Uri(new Uri(baseUrl), uri);
return uri.ToString();
}
You can't get the complete url because in the href attribute there isn't the complete url. Example:
In your case the page contains relative urls. You need to do this:
string href = email + n.Attributes["href"].Value;
In this way you will have the full url. The better solution is to check if url is relative or absolute and, if the url is relative, add email at the beginning of the url otherwise no.
Related
I am doing a small web scraping project, and I am having a problem with the function that takes the html code. The web that I inspect in the browser is different from the web that downloads the method (for the same URL).
I have tried to improve the coding process, but to no avail. The same thing happens for "i=2".
static void Main(string[] args)
{
string prefixurl = "https://www.aaabbbcccdddeee.de/en/do-business-with-finland/finnish-suppliers/finnish-suppliers-results?query=africa";
for (int i = 1; i < 18; i++)
{
string url = prefixurl;
if (i > 1)
{
url = prefixurl + "&page=" + i;
}
var doc = GetDocument(url);
var links = GetBusinessLinks(url);
List<Empresa> empresas = GetBusiness(links);
Export(empresas);
}
}
static List<string> GetBusinessLinks(string url)
{
var doc = GetDocument(url);
var linkNodes = doc.DocumentNode.SelectNodes("/html/body/section/div/div/div/div[2]/div[2]//a");
// //a[#class=\"btn bf-ghost-button\"]
var baseUri= new Uri(url);
var links = new List<string>();
//The problem its there, in the incomplete page the program haven't found nodes
foreach (var node in linkNodes)
{
var link = node.Attributes["href"].Value;
bool business = link.Contains("companies");
if (business)
{
link = new Uri(baseUri, link).AbsoluteUri;
links.Add(link);
}
}
return links;
}
static HtmlDocument GetDocument(string url)
{
var web = new HtmlWeb();
HtmlDocument doc = new HtmlDocument()
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
doc = web.Load(url);
return doc;
}
ยดยดยด
Your suggestion has made me suspect where I should continue looking, thanks.
I have used PupperSharp in non-headless mode.
https://betterprogramming.pub/web-scraping-using-c-and-net-d99a085dace2
I have not used the HtmlAgilityPack often and I'm stuck on the following issue.
I'm checking to see if the browser supports WebP, if yes I then append a new parameter to the src of the image.
I have that working, but I cannot work out how to return the updated HTML, any help will be appreciated.
public static HtmlString AppendWebPString(HtmlString htmlText)
{
bool browserSupportsWebP = BrowserSupportsWebPHelper.WebPSupported();
if (!browserSupportsWebP) return htmlText;
var h = new HtmlDocument();
h.LoadHtml(htmlText.ToString());
const string webP = "&quality=80&format=webp";
if (h.DocumentNode.SelectNodes("//img[#src]") == null) return htmlText;
string imgOuterHtml = string.Empty;
foreach (HtmlNode image in h.DocumentNode.SelectNodes("//img[#src]"))
{
var src = image.Attributes["src"].Value.Split('&');
image.SetAttributeValue("src", src[1] + string.Format(webP));
imgOuterHtml = image.OuterHtml;
}
//How do I return the updated html here
return new HtmlString(h.ParsedText);
}
Ok, I could not find anything that was built into the agility pack to do what I wanted.
I have managed to achieve what I was after using the code below
public static HtmlString AppendWebPString(HtmlString htmlText)
{
bool browserSupportsWebP = BrowserSupportsWebPHelper.WebPSupported();
if (!browserSupportsWebP) return htmlText;
var h = new HtmlDocument();
h.LoadHtml(htmlText.ToString());
const string webP = "&quality=80&format=webp";
if (h.DocumentNode.SelectNodes("//img[#src]") == null) return htmlText;
string modifiedHtml = htmlText.ToString();
List<ReplaceImageValues> images = new List<ReplaceImageValues>();
foreach (HtmlNode image in h.DocumentNode.SelectNodes("//img[#src]"))
{
var src = image.Attributes["src"].Value.Split('&');
string oldSrcValue = image.OuterHtml;
image.SetAttributeValue("src", src[0] + src[1] + string.Format(webP));
string newSrcValue = image.OuterHtml;
images.Add(new ReplaceImageValues(oldSrcValue,newSrcValue));
}
foreach (var newImages in images)
{
modifiedHtml = modifiedHtml.Replace(newImages.OldVal, newImages.NewVal);
}
return new HtmlString(modifiedHtml);
}
This is the page I'm using for documentation https://lichess.org/api#operation/player
I want to get player usernamename, rating, and title.
My code.
public class Player {
public string username;
public double rating;
public string title;
}
HttpClient client = new HttpClient();
client.BaseAddress = new Uri("https://lichess.org/");
HttpResponseMessage response = client.GetAsync("player/top/200/bullet").Result;
Here I'm getting response, But I have no clue how to take only properties that I need and store it in a list of players.
After a discussion with you on this problem, it was found that the response that you are receiving is a HTML string, therefore you need to deal with this case differently. I was playing around with the HTML that you have posted in the comments and I was able to parse the string with HTML Agility Pack which can be found here. You can also download this pack from the Nuget Package Manager in Visual Studio.
I am giving you a very basic example of the parsing process that I tried out:
public class ProcessHtml()
{
List<Player> playersList = new List<Player>();
//Get your HTML loaded from a URL. Giving me SSL exceptions so took a different route
//var url = "https://lichess.org/player/top/200/bullet";
//var web = new HtmlWeb();
//var doc = web.Load(url);
//Get your HTML loaded as a file in my case
var doc = new HtmlDocument();
doc.Load("C:\\Users\\Rahul\\Downloads\\CkBsZtvf.html", Encoding.UTF8);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//tbody"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
int i = 0;
Player player = new Player();
//Since there are 4 rounds per tr, hence get only what is required based on loop condition
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
if(i==1)
{
player.username = cell.InnerText;
}
if(i==2)
{
player.rating = Convert.ToDouble(cell.InnerText);
}
if(i==3)
{
player.title = cell.InnerText;
}
i++;
}
playersList.Add(player);
}
}
var finalplayerListCopy = playersList;
}
public class Player
{
public string username;
public double rating;
public string title;
}
After running this, your finalplayerListCopy has a count of 200 and an example data would look like:
Obviously, you would have to play with the data and tailor it as per your need. I hope this helps you out.
Cheers!
from what Ive read from the documentation
async Task<Player> getPlayerAsync(string path)
{
Player player= null;
HttpResponseMessage response = await client.GetAsync(path);
if (response.IsSuccessStatusCode)
{
player = await response.Content.ReadAsAsync<Player>();
}
return player;
}
getPlayerAsync("https://lichess.org/player/top/200/bullet");
I have a C# Windows Phone 8.1 app which I am building. Part of the app needs to go and look for information on a specific web page. One of the fields which I need is a URL which can be found on certain items on the page, however I am finding that the URL is in a relative-style format
FullArticle.aspx?a=323495
I am wondering if there is a way in C# using HtmlAgilityPack, HttpWebRequest etc etc to find the link to the actual page. Code snippet is below.
private static TileUpdate processSingleNewsItem(HtmlNode newsItemNode)
{
Debug.WriteLine("");
var articleImage = getArticleImage(getNode(newsItemNode, "div", "nw-container-panel-articleimage"));
var articleDate = getArticleDate(getNode(newsItemNode, "div", "nw-container-panel-articledate"));
var articleSummary = getArticleSummary(getNode(newsItemNode, "div", "nw-container-panel-textarea"));
var articleUrl = getArticleUrl(getNode(newsItemNode, "div", "nw-container-panel-articleimage"));
return new TileUpdate{
Date = articleDate,
Headline = articleSummary,
ImagePath = articleImage,
Url = articleUrl
};
}
private static string getArticleUrl(HtmlNode parentNode)
{
var imageNode = parentNode.Descendants("a").FirstOrDefault();
Debug.WriteLine(imageNode.GetAttributeValue("href", null));
return imageNode.GetAttributeValue("href", null);
}
private static HtmlNode getNode(HtmlNode parentNode, string nodeType, string className)
{
var children = parentNode.Elements(nodeType).Where(o => o.Attributes["class"].Value == className);
return children.First();
}
Would appreciate any ideas or solutions. Cheers!
In my web crawler here's what I do:
foreach (HtmlNode link in doc.DocumentNode.SelectNodes(#"//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
if (att == null) continue;
string href = att.Value;
if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue; // ignore javascript on buttons using a tags
Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
// Make it absolute if it's relative
if (!urlNext.IsAbsoluteUri)
{
urlNext = new Uri(urlRoot, urlNext);
}
...
}
I've managed to parse HTML (content) from a newssite (ryfylke.net) and displayed it in my WP8-app. But how can i parse content from subpages (the "Read more" links)?
For now, when I click the links the app launches IE and displays the actual site. But what I would like to do is parse the content from the site and display it in the app.
EDIT (This is my current MainPage.xaml.cs)
protected async override void OnNavigatedTo(NavigationEventArgs e)
{
base.OnNavigatedTo(e);
string htmlPage = "";
using (var client = new HttpClient())
{
htmlPage = await client.GetStringAsync("http://ryfylke.net/kategori/nyheter/");
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);
List<Nyheter> nyheter = new List<Nyheter>();
foreach (var div in htmlDocument.DocumentNode.SelectNodes("//article[starts-with(#class, 'post-')]"))
{
Nyheter newNyheter = new Nyheter();
newNyheter.Link = div.SelectSingleNode(".//a[#href]").Attributes["href"].Value;
newNyheter.Bilde = div.SelectSingleNode(".//img[#class='attachment-entry-medium wp-post-image']").Attributes["src"].Value;
newNyheter.Tittel = div.SelectSingleNode(".//h2[#class='entry-title entry-small-title']").InnerText.Trim();
newNyheter.Sammendrag = div.SelectSingleNode(".//p[#class='entry-excerpt']").InnerText.Trim();
nyheter.Add(newNyheter);
}
lstNyheter.ItemsSource = nyheter;
}
And I then use public strings like this to use the content...
public string Bilde { get; set; }