HTML Agility Pack cant get text content from div

HTML Agility Pack cant get text content from div - c#

I am new to C# and wanted to try to make a little scraper out of it to try out some things. I saw a YT video on it. I am trying to scrape bet365.dk (more specifically this link: https://www.bet365.dk/#/AC/B1/C1/D451/F2/Q1/F^12/).
This is my code:
using System;
using System.Net.Http;
using HtmlAgilityPack;
namespace Bet365Scraper
{
class Program
{
static void Main(string[] args)
{
GetHtmlAsync();
Console.ReadLine();
}
private static async void GetHtmlAsync()
{
var url = "https://www.bet365.dk/#/AC/B1/C1/D451/F2/Q1/F^12/";
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36");
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var htmlBody = htmlDocument.DocumentNode.SelectSingleNode("//body");
var node = htmlBody.Element("//div[#class='src-ParticipantFixtureDetailsHigher_TeamNames ']");
Console.WriteLine(node.InnerHtml);
}
}
}
I am not sure how to do this. And I find the documentation on HTML Agilty Pack's site a bit confusing, and I cannot seem to find what I exactly is looking for. Here is what I want to do. This little piece of the HTML on the bet365 site:
<div class="src-ParticipantFixtureDetailsHigher_TeamNames">
<div class="src-ParticipantFixtureDetailsHigher_TeamWrapper ">
<div class="src-ParticipantFixtureDetailsHigher_Team " style="">Færøerne</div>
</div>
<div class="src-ParticipantFixtureDetailsHigher_TeamWrapper ">
<div class="src-ParticipantFixtureDetailsHigher_Team ">Andorra</div>
</div>
</div>
How could I be able to print out both 'Færørne' and 'Andorra' from the divs in one go? I am aware of the fact, that I probably need to use a foreach, but as said, I'm not too certain how to do with the selectors and such.

I'm not familiar with XPath but i know JS query syntax, and suggest to install Fizzler.Systems.HtmlAgilityPack NuGet package additionally.
Then HtmlNode.QuerySelector() method will be available. It accepts JavaScript query syntax.
Also i fixed HttpClient usage.
namespace Bet365Scraper
{
class Program
{
private static readonly HttpClient httpClient = new HttpClient();
static async Task Main(string[] args)
{
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36");
await GetHtmlAsync("https://www.bet365.dk/#/AC/B1/C1/D451/F2/Q1/F^12/");
Console.ReadLine();
}
private static async Task GetHtmlAsync(string url)
{
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var nodes = htmlDocument.DocumentNode.QuerySelectorAll(".src-ParticipantFixtureDetailsHigher_Team");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
}
}

Related

HttpClient getAsync just works once

I'm trying to call a endpoint to get some info in my website but when I run the getAsync it's only works the first time in the current process.
static async Task Main(string[] args)
{
await GetUserInfo("srmilton"); //Working
await GetUserInfo("srmilton"); //Not Working
while (true)
{
var res = await GetUserInfo("srmilton");
}
}
public static async Task<(string,string)> GetUserInfo(string username)
{
string url = "https://www.mywebsite.com/api/user/detail";
var baseAddress = new Uri(url);
using (var handler2 = new HttpClientHandler { UseCookies = false })
using (var client2 = new HttpClient(handler2) { BaseAddress = baseAddress })
{
client2.DefaultRequestHeaders.Clear();
client2.DefaultRequestHeaders.Add("Cookie", "session=zgbEqIjfSC7M7QdTTpHDkpWLt");
client2.DefaultRequestHeaders.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36");
client2.DefaultRequestHeaders.Add("x-requested-with", "XMLHttpRequest");
client2.DefaultRequestHeaders.Add("referer", "https://www.mywebsite.com/");
var result = await client2.GetAsync("");
string responseString = await result.Content.ReadAsStringAsync();
dynamic jsonresponse = JObject.Parse(responseString);
id = jsonresponse.userInfo.user.id;
sec_id = jsonresponse.userInfo.user.secUid;
return (id, sec_id);
}
}
The first time the fuction GetUserInfo is called it's return the correct json response from the api but the second loop gets stuck in GetAsync. I have already tried .ConfigureAwait(false), .Result and even creating HttpClient just once and reusing but it's aways hang on the second loop.
I don't know what i'm doing wrong, if someone can explain and show the right way to make this works i'll be thankful.

Solved. I was missing a Accept header on my getAsync headers.
client2.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");

Scraping Website with C# using HTML Request Not Giving Table Data

I'm making a simple website scraper to retrieve party names of supreme court cases(this is public information) in C# like in this sample link: https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/19-8334.html
C# Code:
private static async void GetHtmlAsync(String docket)
{
var url = "https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/19-8334.html";
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.234");
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(html);
Console.WriteLine();
}
The problem is that whenever I run this, it successfully gives back the whole HTML file but without the data I need which is in enclosed in the element.
In browser:
In Runtime:

I don't know why but you should get proper response.
Try following you might get the answer.
var html = httpClient.GetAsync(url).GetAwaiter().GetResult();

C# WebClient receives 403 when getting html from a site

I am trying to download the HTML from a site and parse it. I am actually interested in the OpenGraph data in the head section only. For most sites using the WebClient, HttpClient or HtmlAgilityPack works, but some domains I get 403, for example: westelm.com
I have tried setting up the Headers to be absolutely the same as they are when I use the browser, but I still get 403. Here is some code:
string url = "https://www.westelm.com/m/products/brushed-herringbone-throw-t5792/?";
var doc = new HtmlDocument();
using(WebClient client = new WebClient()) {
client.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36";
client.Headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";
client.Headers["Accept-Encoding"] = "gzip, deflate, br";
client.Headers["Accept-Language"] = "en-US,en;q=0.9";
doc.Load(client.OpenRead(url));
}
At this point, I am getting a 403.
Am I missing something or the site administrator is protecting the site from API requests?
How can I make this work? Is there a better way to get OpenGraph data from a site?
Thanks.

I used your question to resolve the same problem. IDK if you're already fixed this but I tell you how it worked for me
A page was giving me 403 for the same reasons. The thing is: you need to emulate a "web browser" from the code, sending a lot of headers.
I used one of yours headers I wasn't using (like Accept-Language)
I didn't use WebClient though, I used HttpClient to parse the webpage
private static async Task<string> GetHtmlResponseAsync(HttpClient httpClient, string url)
{
using var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url));
request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, br");
request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36");
request.Headers.TryAddWithoutValidation("Accept-Charset", "UTF-8");
request.Headers.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.9");
using var response = await httpClient.SendAsync(request).ConfigureAwait(false);
if (response == null)
return string.Empty;
using var responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false);
using var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress);
using var streamReader = new StreamReader(decompressedStream);
return await streamReader.ReadToEndAsync().ConfigureAwait(false);
}
If it helps you, I'm glad. If not, I will leave this answer here to help someone else in the future!

HttpClient returns html from API

Trying to use the HttpClient to get a json response from an API but keep getting html response. In the browser and in Postman I get the result in json just by typing in the url. When using RestSharp I also get the response in json. What do I need to add to get the response in json? The variable responseString in a html string, not a json string.
I use .net core 3.1.
Here's the code:
class Program
{
static async Task Main(string[] args)
{
var response = await GetResponse();
System.Console.ReadKey();
}
public static async Task<string> GetResponse()
{
var client = new HttpClient();
client.BaseAddress = new Uri("https://musicbrainz.org/ws/2/");
client.DefaultRequestHeaders.Add("Accept",
"application/json");
using var response = await client.GetAsync((
"/artist/5b11f4ce-a62d-471e-81fc-a69a8278c7da?fmt=json&inc=url-rels+release-groups"));
response.EnsureSuccessStatusCode();
var responseString = await response.Content.ReadAsStringAsync();
return responseString;
}
}

I think the API is looking for a User-Agent.
Try to Add
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36")

As a side note, you might want to consider declaring your HttpClient as static. It is recommended in its documentation. Also if you want json, you need to use DeserializeObject method:
var result = JsonConvert.DeserializeObject<RootObject>(responseString);
You just need to make sure that you have already installed the Newtonsoft.Json from Nuget packages and also added the following to your using directives:
using Newtonsoft.Json;

C#: Can't download videos in bulk?

So I'm trying to download a list of strings, which are MP4 videos. This code works when theres only one video to download, yet doesn't if there are multiple on the list?
If there are multiple it downloads it, its always 154kb in size, and no length. Its esentially corrupted and can't be watched, that's exactly what windows also tells me when I open it.
Can anyone help? Am I not doing something I should be?
public static void DownloadFiles(IList<string> files)
{
foreach (var file in files)
{
DownloadFile(file, file.Location);
}
}
private static void DownloadFile(string url, string fileName)
{
using (var webClient = new WebClient())
{
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
webClient.DownloadFile(url, fileName);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML Agility Pack cant get text content from div - c#

Related

HttpClient getAsync just works once

Scraping Website with C# using HTML Request Not Giving Table Data

C# WebClient receives 403 when getting html from a site

HttpClient returns html from API

C#: Can't download videos in bulk?

Categories

Resources