Scraping Website with C# using HTML Request Not Giving Table Data

Scraping Website with C# using HTML Request Not Giving Table Data - c#

I'm making a simple website scraper to retrieve party names of supreme court cases(this is public information) in C# like in this sample link: https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/19-8334.html
C# Code:
private static async void GetHtmlAsync(String docket)
{
var url = "https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/19-8334.html";
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.234");
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(html);
Console.WriteLine();
}
The problem is that whenever I run this, it successfully gives back the whole HTML file but without the data I need which is in enclosed in the element.
In browser:
In Runtime:

I don't know why but you should get proper response.
Try following you might get the answer.
var html = httpClient.GetAsync(url).GetAwaiter().GetResult();

Related

HTML Agility Pack cant get text content from div

I am new to C# and wanted to try to make a little scraper out of it to try out some things. I saw a YT video on it. I am trying to scrape bet365.dk (more specifically this link: https://www.bet365.dk/#/AC/B1/C1/D451/F2/Q1/F^12/).
This is my code:
using System;
using System.Net.Http;
using HtmlAgilityPack;
namespace Bet365Scraper
{
class Program
{
static void Main(string[] args)
{
GetHtmlAsync();
Console.ReadLine();
}
private static async void GetHtmlAsync()
{
var url = "https://www.bet365.dk/#/AC/B1/C1/D451/F2/Q1/F^12/";
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36");
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var htmlBody = htmlDocument.DocumentNode.SelectSingleNode("//body");
var node = htmlBody.Element("//div[#class='src-ParticipantFixtureDetailsHigher_TeamNames ']");
Console.WriteLine(node.InnerHtml);
}
}
}
I am not sure how to do this. And I find the documentation on HTML Agilty Pack's site a bit confusing, and I cannot seem to find what I exactly is looking for. Here is what I want to do. This little piece of the HTML on the bet365 site:
<div class="src-ParticipantFixtureDetailsHigher_TeamNames">
<div class="src-ParticipantFixtureDetailsHigher_TeamWrapper ">
<div class="src-ParticipantFixtureDetailsHigher_Team " style="">Færøerne</div>
</div>
<div class="src-ParticipantFixtureDetailsHigher_TeamWrapper ">
<div class="src-ParticipantFixtureDetailsHigher_Team ">Andorra</div>
</div>
</div>
How could I be able to print out both 'Færørne' and 'Andorra' from the divs in one go? I am aware of the fact, that I probably need to use a foreach, but as said, I'm not too certain how to do with the selectors and such.

I'm not familiar with XPath but i know JS query syntax, and suggest to install Fizzler.Systems.HtmlAgilityPack NuGet package additionally.
Then HtmlNode.QuerySelector() method will be available. It accepts JavaScript query syntax.
Also i fixed HttpClient usage.
namespace Bet365Scraper
{
class Program
{
private static readonly HttpClient httpClient = new HttpClient();
static async Task Main(string[] args)
{
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36");
await GetHtmlAsync("https://www.bet365.dk/#/AC/B1/C1/D451/F2/Q1/F^12/");
Console.ReadLine();
}
private static async Task GetHtmlAsync(string url)
{
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var nodes = htmlDocument.DocumentNode.QuerySelectorAll(".src-ParticipantFixtureDetailsHigher_Team");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
}
}

C# WebClient receives 403 when getting html from a site

I am trying to download the HTML from a site and parse it. I am actually interested in the OpenGraph data in the head section only. For most sites using the WebClient, HttpClient or HtmlAgilityPack works, but some domains I get 403, for example: westelm.com
I have tried setting up the Headers to be absolutely the same as they are when I use the browser, but I still get 403. Here is some code:
string url = "https://www.westelm.com/m/products/brushed-herringbone-throw-t5792/?";
var doc = new HtmlDocument();
using(WebClient client = new WebClient()) {
client.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36";
client.Headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";
client.Headers["Accept-Encoding"] = "gzip, deflate, br";
client.Headers["Accept-Language"] = "en-US,en;q=0.9";
doc.Load(client.OpenRead(url));
}
At this point, I am getting a 403.
Am I missing something or the site administrator is protecting the site from API requests?
How can I make this work? Is there a better way to get OpenGraph data from a site?
Thanks.

I used your question to resolve the same problem. IDK if you're already fixed this but I tell you how it worked for me
A page was giving me 403 for the same reasons. The thing is: you need to emulate a "web browser" from the code, sending a lot of headers.
I used one of yours headers I wasn't using (like Accept-Language)
I didn't use WebClient though, I used HttpClient to parse the webpage
private static async Task<string> GetHtmlResponseAsync(HttpClient httpClient, string url)
{
using var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url));
request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, br");
request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36");
request.Headers.TryAddWithoutValidation("Accept-Charset", "UTF-8");
request.Headers.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.9");
using var response = await httpClient.SendAsync(request).ConfigureAwait(false);
if (response == null)
return string.Empty;
using var responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false);
using var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress);
using var streamReader = new StreamReader(decompressedStream);
return await streamReader.ReadToEndAsync().ConfigureAwait(false);
}
If it helps you, I'm glad. If not, I will leave this answer here to help someone else in the future!

How to get json from Secured API

I did an api, for the exemple i will call it: https://testapp.azurewebsites.net.
I did the Authentication / Authorization in Azure for google Facebook and Miscrosoft account.
Then i want to consume it in xamarin for my Android/iOS app, so i did the login button etc but when I'm authentify i can't get the json of my api URL: "https://testapp.azurewebsites.net/api/Test/allCoordinates".
It work's perfectly in my browser and in postman... This is my code in C#:
var requestCoord = new OAuth2Request("GET", new Uri(URL),null , e.Account);
var responseCoord = await requestCoord.GetResponseAsync(); //it works for google userinfo but not for my Api...
string coordJson = await responseCoord.GetResponseTextAsync();
var mapTest = JsonConvert.DeserializeObject<List<CustomPin>>(coordJson);
In Postman it works and i can see this code for C# from postman:
var client = new RestClient("https://testapp.azurewebsites.net/api/Test/allCoordinates?access_token=ya29.a0AfH6SMCEGy4tP_zngNEhAcpf31d3O_ZYl7NE9QJjbKrW0KPh-dC7PjNmz-KOCbkySRtuwDJdg2ckhiTaTdIEsONxVhFhK3NpnUk9iITyCB1BnwpWJwNEEivxg0pL93UPP9r4UYf1dHEiTVd63eydfV7HoKlxExMFtS8");
client.Timeout = -1;
var request = new RestRequest(Method.GET);
request.AddHeader("User-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36");
request.AddHeader("Cookie", "ARRAffinity=d2e047f134af60dd8e0802593ad5206002e99e56a6231fee0e85747cfa96ea6f; AppServiceAuthSession=Dh0GnGQjaNoBXKv8r4lM9BoJkAA1UFSLuoDDAVP1qGrPP3ICauM1Glsb+Q7NhU+4m+IuPh5ZqGv2bzU6FtvEqri4Io88RuP6ZzKPayXSJKn4WbkzteU59if76yVY/KSjmwjbdUTC47yO+XO2snKygYlGZ9+pVlgaF/UdmW6OLWDlqPvJ069oSXkkZb/gGV5m6dHzYvfn3PcJ4HJmfEPQDclsRvRYUmpIY11hWcRUiiVx26o/SE+IaytRfWxkGk4g/thMFW3IOFtw09DdGXma/Qik8ANybClwXZ7G/3i1VyHQLM9TnU3UGcjtArLUFVj4T3jNkdaVioxtNQWJcDvwN54OL24eNFMM4Ov7Rbo7t2QtQrW73KxOrG/RyJHvBTHTyhjmAw6Hb7wg7VwcJvpKwcJKFBWH5ntvouFhj/DmCrzBuG/Cz6K+81ocEnHBLsHcx9qHrEBXCU3FlMQbogDcRRo1om78IwK+OxKoY+CzDWAJW3taJLl+jVO6QgFtbyqZKErzxEX1jeVcHTWTBdTImYaiA6zs1KKCSgo+rR3G0GWxvyWt9XCZwZD/5E+MYK3pxWFduKmmsEjSYrCgQ7Yhwe2bQg2bvX2HPScfo+yKVoIzQHNArqDr2NVTaWRUt2zN3GoLzSDxe5YgDjHXyo0ES6mEbEKsKy4dYDD7uRS/rdHRTTHUdih5i169sHvJlj0UFyaU8MV+J/dxbuMmNysqOmzVUU18oWQntE48RN35/Js=");
IRestResponse response = client.Execute(request);
Console.WriteLine(response.Content);
But how to find the cookie data in Xamarin ? and call this url ?
If you have any idea to make this Http Get from protected webapi i will help me so much !!!
Thanks a lot
If you need more precision let me know;)

WebScraping from aspx page using WebClient C#

I'm trying to crawl data from a aspx page, which have three dropdowns: State, District, and City.They are implemented as the dependency dropdowns with the server side post back.
I Have all the ids of the State, District, and the City.I am writing a Console application using WebClient to Post all three drop-down ids as a form data to the page. But every time it is redirecting to an error page. Can anyone help me to set all the drop-down values at a time with single post call?
Code Snippet:
var formValues = new NameValueCollection();
formValues["__VIEWSTATE"] = Extract("__VIEWSTATE", responseString);
formValues["__EVENTVALIDATION"] = Extract("__EVENTVALIDATION", responseString);
formValues["ddlSelectLanguage"] = "en-US";
formValues["ddlState"] = "19";
formValues["DDLDistrict"] = "237";
formValues["DDLVillage"] = "bcab59fd-35d2-e111-882d-001517f1d35c";
client.Headers.Set(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36");
var responseData = client.UploadValues(firstPage, formValues);
responseString = Encoding.ASCII.GetString(responseData);

Call a Web Service reference Request in C#

I am currently trying to figure out how to setup some testing for web services in C#.
I have referenced the web services in my project and have populated the request, I am just wondering how I can call the request method?
Below is the existing code, and I am trying to simulate using the AddNewResponder web service. All of the items that the web service asks for are populated below, I just can't seem to figure out how to execute the web service code.
static void Main(string[] args)
{
int testID = 0;
//populate the test user with user data
TestUser tUser = GetUserData(testID);
//Create Request Body
RCWS.AddNewResponderRequestBody respRequestBody = new RCWS.AddNewResponderRequestBody();
respRequestBody.PriorityCode = tUser.PriCode;
respRequestBody.ClientCode = "TestData";
respRequestBody.Domain = "TestDomain";
respRequestBody.IPAddress = "192.168.2.1";
respRequestBody.Source = "web";
respRequestBody.OS = "WinNT";
respRequestBody.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36";
respRequestBody.Browser = "Chrome";
//Create Request
RCWS.AddNewResponderRequest addNewResp = new RCWS.AddNewResponderRequest(respRequestBody);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Scraping Website with C# using HTML Request Not Giving Table Data - c#

I don't know why but you should get proper response. Try following you might get the answer. var html = httpClient.GetAsync(url).GetAwaiter().GetResult();

Related

HTML Agility Pack cant get text content from div

C# WebClient receives 403 when getting html from a site

How to get json from Secured API

WebScraping from aspx page using WebClient C#

Call a Web Service reference Request in C#

Categories

Resources