HTMLAgilityPack get class innerText - c#

I am trying to get the innerText of a class.
This is my code:
using (HttpClient clientduplicate = new HttpClient())
{
clientduplicate.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident / 6.0)");
using (HttpResponseMessage responseduplicate = await clientduplicate.GetAsync(#"https://www.investing.com/news/stock-market-news/warren-buffett:-i-bought-$12-billion-of-stock-after-trump-won-456954")
using (HttpContent contentduplicate = responseduplicate.Content)
{
try
{
string resultduplicate = await contentduplicate.ReadAsStringAsync();
var websiteduplicate = new HtmlDocument();
websiteduplicate.LoadHtml(resultduplicate);
var titlesduplicate = websiteduplicate.DocumentNode.Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "arial_14 clear WYSIWYG newsPage");
var match = Regex.Match(titlesduplicate.InnerText, #"(.*?)<!--", RegexOptions.Singleline).Groups[1].Value;
Debug.WriteLine(match.TrimStart());
}
catch(Exception ex1)
{
var dialog2 = new MessageDialog(ex1.Message);
await dialog2.ShowAsync();
}
}
}
Now the problem is that this will also return me the text on the picture. I can find a workaround but I was wondering if there is an other approach on this.
Something simpler/faster.
Plus when I use this on other articles/URLs there are other minor bugs.

There are many ways to do this. One way is to remove the carousel div before getting innerText:
doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Id.Equals("imgCarousel"))?.Remove();

Related

Unable to fetch data using HttpWebRequest or HtmlAgilityPack

I am trying to make web scraper in C# for NSE. The code works with other sites but when ran on https://www.nseindia.com/ it gives error - An error occurred while sending the request. Unable to read data from the transport connection: Operation timed out.
I have tried with two different approaches Try1() & Try2().
Can anyone please tell what I am missing in my code?
class Program
{
public void Try1() {
HtmlWeb web = new HtmlWeb();
HttpStatusCode statusCode = HttpStatusCode.OK;
web.UserAgent = GetUserAgent();
web.PostResponse = (request, response) =>
{
if (response != null)
{
statusCode = response.StatusCode;
Console.WriteLine("Status Code: " + statusCode);
}
};
Task<HtmlDocument> task = web.LoadFromWebAsync(GetURL());
HtmlDocument document = task.Result;
}
public void Try2() {
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(GetURL());
request.UserAgent = GetUserAgent();
request.Accept= "*/*;";
using (var response = (HttpWebResponse)(request.GetResponse()))
{
HttpStatusCode code = response.StatusCode;
if (code == HttpStatusCode.OK)
{
using (StreamReader streamReader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(streamReader);
Console.WriteLine("Document Loaded.");
}
}
}
}
private string GetURL() {
// return "https://html-agility-pack.net/";
return "https://www.nseindia.com/";
}
private string GetUserAgent() {
return "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36";
}
}
Your are lack of headers towards Accept and others so it couldn't response back.
Besides that, I would recommend you using HttpClient instead of HttpWebRequest
public static async Task GetHtmlData(string url)
{
HttpClient httpClient = new HttpClient();
using (var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url)))
{
request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml, charset=UTF-8, text/javascript, */*; q=0.01");
request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, br");
request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137");
request.Headers.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
request.Headers.TryAddWithoutValidation("X-Requested-With", "XMLHttpRequest");
using (var response = await httpClient.SendAsync(request).ConfigureAwait(false))
{
response.EnsureSuccessStatusCode();
using (var responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false))
using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(decompressedStream))
{
var result = await streamReader.ReadToEndAsync().ConfigureAwait(false);
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(result);
Console.WriteLine(result);
Console.WriteLine("Document Loaded.");
}
}
}
Use it by
await GetHtmlData("https://www.nseindia.com/");

Browser content different from HttpRequest

Hello I'm trying to simulate a browser and this specific URL is not working well:
https://www.btgpactualdigital.com/fundos-de-investimento/spx-nimitz-fic-fim-access
It returns always the same response code... I'm using for test purpose this simple code:
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0";
client.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
client.Headers[HttpRequestHeader.AcceptLanguage] = "en-us,en;q=0.5";
var responseStream = new GZipStream(client.OpenRead(url), CompressionMode.Decompress);
var reader = new StreamReader(responseStream);
textResponse = reader.ReadToEnd();
}
But it is always returning a different code from I can see on browser... I got all Headers from browser to try simulate same scenario.
Any suggestion ? Thanks !

Get all images from website has block calls

I'm trying get all images from page:
public async Task<PhotoURL> GetImagePortal()
{
strLinkPage = "http://www.propertyguru.com.sg/listing/19077438";
var lstString = new List<string>();
int itotal = default(int);
HttpClient client = new HttpClient();
var doc = new HtmlAgilityPack.HtmlDocument();
string strHtml = await client.GetStringAsync(strLinkPage);
doc.LoadHtml(strHtml);
var pageHtml = doc.DocumentNode;
if (pageHtml != null)
{
var projectRoot = pageHtml.SelectSingleNode("//*[contains(#class,'submain')]");
//var projectChild = projectRoot.SelectSingleNode("div/div[2]");
var imgRoot = projectRoot.SelectSingleNode("//*[contains(#class,'white-bg-padding')]");
var imgChilds = imgRoot.SelectNodes("div[1]/div[1]/ul[1]/li");
itotal = imgChilds.Count();
foreach (var imgItem in imgChilds)
{
string linkImage = imgItem.SelectSingleNode("img").Attributes["src"].Value;
lstString.Add(linkImage);
}
}
return await Task.Run(() => new PhotoURL { total = itotal, URL = lstString });
}
at line
string strHtml = await client.GetStringAsync(strLinkPage);
i have error 405 method not allowed.
I try using
WebClient, HTTPWebRequest.
help me, please!
The site required a user-agent and since you are just using a HttpClient without any options, the site does not think it is a correct request (It does not look like it's coming from a browser without the user agent).
Try this:
HttpClient client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");
Or if you prefer any other user agents strings.

Server returning 500 error when using WebRequest to get XML document

Here is my code to get the xml document from a url that is passed in.
var request = WebRequest.Create(url);
request.Method = "GET";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = 0;
var response = request.GetResponse(); // Error is thrown here
When I copy and paste the url into my browser it works just fine.
Here is the complete xml that is returned
<Model>
<Item>
<Id>7908</Id>
</Item>
</Model>
Is the xml in an incorrect format? I have tried changing the content type to be application/xml but I still get this error.
EDIT=======================================================
I am trying to use webclient using this code:
using (var wc = new System.Net.WebClient())
{
wc.Headers["Method"] = "GET";
wc.Headers["ContentType"] = "text/xml;charset=\"utf-8\"";
wc.Headers["Accept"] = "text/xml, */*";
wc.Headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-us";
wc.Headers["KeepAlive"] = "true";
wc.Headers["AutomaticDecompression"] = (DecompressionMethods.Deflate | DecompressionMethods.GZip).ToString();
var response = wc.DownloadString(url);
}
The response string is empty!!! Any ideas why this is not returning any result but pasting the url into the browser returns the xml?
I finally got it working. I had to use this code:
using (var wc = new System.Net.WebClient())
{
wc.Headers["Method"] = "GET";
wc.Headers["Accept"] = "application/xml";
var response = wc.DownloadString(url);
}
The key was using the accept header of "application/xml" otherwise the response would come back empty.
This should hopefully do the trick:
try
{
using(var response = (HttpWebResponse)request.GetResponse())
{
// Do things
}
}
catch(WebException e)
{
// Handled!...
}
Try what Joel Lee suggested if this fails.
Why not use a WebClient instead.
public class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
var request = base.GetWebRequest(address);
if (request.GetType() == typeof(HttpWebRequest)){
((HttpWebRequest)request).UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36";
}
return request;
}
}
using(var wc = new MyWebClient()){
var response = wc.DownloadString(url);
//do stuff with response
}

C# WebClient login to accounts.google.com

I have very difficult time trying to authenticate to accounts.google.com using webclient
I'm using C# WebClient object to achieve following.
I'm submitting form fields to https://accounts.google.com/ServiceLoginAuth?service=oz
Here is POST Fields:
service=oz
dsh=-8355435623354577691
GALX=33xq1Ma_CKI
timeStmp=
secTok=
Email=test#test.xom
Passwd=password
signIn=Sign in
PersistentCookie=yes
rmShown=1
Now when login page loads before I submit data it has following headers:
Content-Type text/html; charset=UTF-8
Strict-Transport-Security max-age=2592000; includeSubDomains
Set-Cookie GAPS=1:QClFh_dKle5DhcdGwmU3m6FiPqPoqw:SqdLB2u4P2oGjt_x;Path=/;Expires=Sat, 21-Dec-2013 07:31:40 GMT;Secure;HttpOnly
Cache-Control no-cache, no-store
Pragma no-cache
Expires Mon, 01-Jan-1990 00:00:00 GMT
X-Frame-Options Deny
X-Auto-Login realm=com.google&args=service%3Doz%26continue%3Dhttps%253A%252F%252Faccounts.google.com%252FManageAccount
Content-Encoding gzip
Transfer-Encoding chunked
Date Thu, 22 Dec 2011 07:31:40 GMT
X-Content-Type-Options nosniff
X-XSS-Protection 1; mode=block
Server GSE
OK now how do I use WebClient Class to include those headers?
I have tried webClient_.Headers.Add(); but it has limited effect and always returns login page.
Below is a class that I use. Would appreciate any help.
Getting login page
public void LoginPageRequest(Account acc)
{
var rparams = new RequestParams();
rparams.URL = #"https://accounts.google.com/ServiceLoginAuth?service=oz";
rparams.RequestName = "LoginPage";
rparams.Account = acc;
webClient_.DownloadDataAsync(new Uri(rparams.URL), rparams);
}
void webClient__DownloadDataCompleted(object sender, DownloadDataCompletedEventArgs e)
{
RequestParams rparams = (RequestParams)e.UserState;
if (rparams.RequestName == "LoginPage")
{
ParseLoginRequest(e.Result, e.UserState);
}
}
Now getting form fields using HtmlAgilityPack and adding them into Parameters collection
public void ParseLoginRequest(byte[] data, object UserState)
{
RequestParams rparams = (RequestParams)UserState;
rparams.ClearParams();
ASCIIEncoding encoder = new ASCIIEncoding();
string html = encoder.GetString(data);
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode form = doc.GetElementbyId("gaia_loginform");
rparams.URL = form.GetAttributeValue("action", string.Empty);
rparams.RequestName = "LoginPost";
var inputs = form.Descendants("input");
foreach (var element in inputs)
{
string name = element.GetAttributeValue("name", "undefined");
string value = element.GetAttributeValue("value", "");
if (!name.Equals("undefined")) {
if (name.ToLower().Equals("email"))
{
value = rparams.Account.Email;
}
else if (name.ToLower().Equals("passwd"))
{
value = rparams.Account.Password;
}
rparams.AddParam(name,value);
Console.WriteLine(name + "-" + value);
}
}
webClient_.UploadValuesAsync(new Uri(rparams.URL),"POST", rparams.GetParams,rparams);
After I post the data I get login page rather than redirect or success message.
What am I doing wrong?
After some fiddling around, it looks like the WebClient class is not the best approach to this particular problem.
To achieve following goal I had to jump one level below to WebRequest.
When making WebRequest (HttpWebRequest) and using HttpWebResponse it is possible to set CookieContainer
webRequest_ = (HttpWebRequest)HttpWebRequest.Create(rparams.URL);
webRequest_.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
CookieContainer cookieJar = new CookieContainer();
webRequest_.CookieContainer = cookieJar;
string html = string.Empty;
try
{
using (WebResponse response = webRequest_.GetResponse())
{
using (var streamReader = new StreamReader(response.GetResponseStream()))
{
html = streamReader.ReadToEnd();
ParseLoginRequest(html, response,cookieJar);
}
}
}
catch (WebException e)
{
using (WebResponse response = e.Response)
{
HttpWebResponse httpResponse = (HttpWebResponse)response;
Console.WriteLine("Error code: {0}", httpResponse.StatusCode);
using (var streamReader = new StreamReader(response.GetResponseStream()))
Console.WriteLine(html = streamReader.ReadToEnd());
}
}
and then when making post use the same Cookie Container in following manner
webRequest_ = (HttpWebRequest)HttpWebRequest.Create(rparams.URL);
webRequest_.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
webRequest_.Method = "POST";
webRequest_.ContentType = "application/x-www-form-urlencoded";
webRequest_.CookieContainer = cookieJar;
var parameters = new StringBuilder();
foreach (var key in rparams.Params)
{
parameters.AppendFormat("{0}={1}&",HttpUtility.UrlEncode(key.ToString()),
HttpUtility.UrlEncode(rparams.Params[key.ToString()]));
}
parameters.Length -= 1;
using (var writer = new StreamWriter(webRequest_.GetRequestStream()))
{
writer.Write(parameters.ToString());
}
string html = string.Empty;
using (response = webRequest_.GetResponse())
{
using (var streamReader = new StreamReader(response.GetResponseStream()))
{
html = streamReader.ReadToEnd();
}
}
So this works, this code is not for production use and can be/should be optimized.
Treat it just as an example.
This is a quick example written in the answer pane and untested. You will probably need to parse some values out of an initial request for some form values to go in to formData. A lot of my code is based on this type of process unless we need to scrape facebook spokeo type sites in which case the ajax makes us use a different approach.
using System;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;
using System.Text;
namespace GMailTest
{
class Program
{
private static NameValueCollection formData = new NameValueCollection();
private static CookieAwareWebClient webClient = new CookieAwareWebClient();
static void Main(string[] args)
{
formData.Clear();
formData["service"] = "oz";
formData["dsh"] = "-8355435623354577691";
formData["GALX"] = "33xq1Ma_CKI";
formData["timeStmp"] = "";
formData["secTok"] = "";
formData["Email"] = "test#test.xom";
formData["Passwd"] = "password";
formData["signIn"] = "Sign in";
formData["PersistentCookie"] = "yes";
formData["rmShown"] = "1";
byte[] responseBytes = webClient.UploadValues("https://accounts.google.com/ServiceLoginAuth?service=oz", "POST", formData);
string responseHTML = Encoding.UTF8.GetString(responseBytes);
}
}
public class CookieAwareWebClient : WebClient
{
public CookieAwareWebClient() : this(new CookieContainer())
{ }
public CookieAwareWebClient(CookieContainer c)
{
this.CookieContainer = c;
this.Headers.Add("User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5");
}
public CookieContainer CookieContainer { get; set; }
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = this.CookieContainer;
}
return request;
}
}
}

Categories

Resources