HttpClient returning special characters but nothing readable - c#

I am trying to download a webpage using async and await and HttpClient, but am getting only a string full of special characters... Code is like..
static async void DownloadPageAsync(string url)
{
HttpClient client = new HttpClient();
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
HttpResponseMessage response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
var responseStream = await response.Content.ReadAsStreamAsync();
var streamReader = new StreamReader(responseStream);
var str = streamReader.ReadToEnd();
}
and url is
url = #"http://www.nseindia.com/live_market/dynaContent/live_watch/live_index_watch.htm";
When i did
client.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2;
WOW64; Trident/6.0)");
in place of those four DefaultRequestHeaders, I got a 403 error, but this is nse site and is free for all. Please help friends get me correct response..
regards
Srivastava

client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
With this you tell the server that you allow it to compress the response gzip/deflate. So the response is actually compressed which explains why you get the kind of response text you get.
If you want plain text, you shouldn’t add the header, so the server won’t compress the response. If you remove above line, you get a normal HTML response text.
Alternatively, you can of course keep that header in and decompress the response using GZipStream after receiving it. That would work like this:
using (var responseStream = await response.Content.ReadAsStreamAsync())
using (var deflateStream = new GZipStream(responseStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(deflateStream))
{
var str = streamReader.ReadToEnd();
Console.WriteLine(str);
}
Ideally, you should check the value of response.Content.Headers.GetValues("Content-Encoding") to make sure that the encoding is gzip. Since you also accepted deflate as a possible encoding, you could then use DeflateStream to decode that; or don’t decode anything in case the Content-Encoding header is missing.

Related

C# WebClient receives 403 when getting html from a site

I am trying to download the HTML from a site and parse it. I am actually interested in the OpenGraph data in the head section only. For most sites using the WebClient, HttpClient or HtmlAgilityPack works, but some domains I get 403, for example: westelm.com
I have tried setting up the Headers to be absolutely the same as they are when I use the browser, but I still get 403. Here is some code:
string url = "https://www.westelm.com/m/products/brushed-herringbone-throw-t5792/?";
var doc = new HtmlDocument();
using(WebClient client = new WebClient()) {
client.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36";
client.Headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";
client.Headers["Accept-Encoding"] = "gzip, deflate, br";
client.Headers["Accept-Language"] = "en-US,en;q=0.9";
doc.Load(client.OpenRead(url));
}
At this point, I am getting a 403.
Am I missing something or the site administrator is protecting the site from API requests?
How can I make this work? Is there a better way to get OpenGraph data from a site?
Thanks.
I used your question to resolve the same problem. IDK if you're already fixed this but I tell you how it worked for me
A page was giving me 403 for the same reasons. The thing is: you need to emulate a "web browser" from the code, sending a lot of headers.
I used one of yours headers I wasn't using (like Accept-Language)
I didn't use WebClient though, I used HttpClient to parse the webpage
private static async Task<string> GetHtmlResponseAsync(HttpClient httpClient, string url)
{
using var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url));
request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, br");
request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36");
request.Headers.TryAddWithoutValidation("Accept-Charset", "UTF-8");
request.Headers.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.9");
using var response = await httpClient.SendAsync(request).ConfigureAwait(false);
if (response == null)
return string.Empty;
using var responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false);
using var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress);
using var streamReader = new StreamReader(decompressedStream);
return await streamReader.ReadToEndAsync().ConfigureAwait(false);
}
If it helps you, I'm glad. If not, I will leave this answer here to help someone else in the future!

HttpClient can't parse "UTF-8" Content-Type

I am experiencing a known bug in the HttpClient. Anytime the server response contains "UTF-8" (including quotes), an exception is triggered:
The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set. ---> System.ArgumentException: '"utf-8"' is not a supported encoding name.
Example code:
HttpClient _client = new HttpClient();
HttpRequestMessage requestMessage = new HttpRequestMessage(HttpMethod.Get, "https://www.facebook.com");
requestMessage.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.4044.55 Safari/537.36");
HttpResponseMessage response = _client.SendAsync(requestMessage).GetAwaiter().GetResult();
What is the usual workaroud? I am using .NETFramework 4.6.1.
To workaround the referenced issue:
using (var client = new HttpClient())
{
HttpRequestMessage requestMessage = new HttpRequestMessage(HttpMethod.Get,
"https://www.facebook.com");
HttpResponseMessage response = await client.SendAsync(requestMessage);
byte[] buf = await response.Content.ReadAsByteArrayAsync();
string content = Encoding.UTF8.GetString(buf);
}

Not able to access a page from US as a country

I want to use the US as a country to access this
url = http://www.tillys.com/product/Say-What/Short-Dresses/SAY-WHAT--Ribbed-Tank-Midi-Dress/Heather-Grey/285111595,
I've tried with cookies and all but the url still it redirects to the site's home page.
I want to see if there is any way i can access this page. Below is the code with which i am trying.
Below is the function with which i am trying to do this:
public static string getUrlContent (string url)
{
var myHttpWebRequest = (HttpWebRequest)WebRequest.Create(url);
myHttpWebRequest.Method = "GET";
myHttpWebRequest.AllowAutoRedirect = true;
myHttpWebRequest.ContentLength = 0;
myHttpWebRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
myHttpWebRequest.Headers.Add("Cookie", "=en%5FUS;");
myHttpWebRequest.UserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36";
//myHttpWebRequest.Headers.Add("Accept-Encoding", "gzip, deflate, sdch");
myHttpWebRequest.Headers.Add("Accept-Language", "en-US,en;q=0.8");
myHttpWebRequest.Headers.Add("Cookie", "wlcme=true");
//myHttpWebRequest.CookieContainer = new CookieContainer();
//myHttpWebRequest.Headers.Add("X-Macys-ClientId", "NavApp");
var response = (HttpWebResponse)myHttpWebRequest.GetResponse();
var rmyResponseHeaders = response.Headers;
Console.WriteLine ("Content length is {0}", response.ContentLength);
Console.WriteLine ("Content type is {0}", response.ContentType);
// Get the stream associated with the response.
Stream receiveStream = response.GetResponseStream ();
// Pipes the stream to a higher level stream reader with the required encoding format.
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);
//Console.WriteLine ("Response stream received.");
Console.WriteLine (readStream.ReadToEnd ());
var josnStr = readStream.ReadToEnd ();
Console.WriteLine (josnStr);
return josnStr;
//Encoding enc1 = Encoding.GetEncoding(1252);
}
If the site www.tillys.com is using Geo-fencing it will show you different content based on a lookup of the requesting IP address. In this case there's nothing C# or other languages can do.
You'll need to either proxy your request through a VPN (see How to send WebRequest via proxy?) or deploy your code to a data center in the US. For example, if you use Azure you can deploy to several different data centers through out the world including several data centers in the US. Once your code is running in the US it should be able to access the US version of the page.

c# httpwebrequest POST and GET methods (with cookieContainer) c#

I'm trying to login on a xbox live page, and got some problems with that, have no idea why, I think I set everything properly... here is my code
CookieCollection cookies = new CookieCollection();
HttpWebRequest Request = (HttpWebRequest)WebRequest.Create("https://account.xbox.com/en-US/PaymentAndBilling/RedeemCode");
Request.CookieContainer = new CookieContainer();
Request.CookieContainer.Add(cookies);
//Request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko";
HttpWebResponse Response = (HttpWebResponse)Request.GetResponse();
Response.Cookies.Add(cookies);
Response.Close();
HttpWebRequest getRequest = (HttpWebRequest)WebRequest.Create("https://login.live.com/");
getRequest.Method = "POST";
getRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko";
getRequest.CookieContainer = new CookieContainer();
getRequest.CookieContainer.Add(cookies);
string postData = String.Format("login=/*dd*/&passwd=/*pass*/");
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
getRequest.ContentLength = byteArray.Length;
Stream newStream = getRequest.GetRequestStream(); //open connection
newStream.Write(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close();
HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();
getResponse.Cookies = cookies;
StreamReader sr1 = new StreamReader(getResponse.GetResponseStream());
string sourceCode = sr1.ReadToEnd();
richTextBox1.Text = sourceCode;
sr1.Close();
I would really apprieciate any help, or any info where can I find some explain of cookiecontainers, http protocols in c# etc.... as it's my first program working with WebRequests, really thank u for help in advance.
Okay, you're going to hate me, but if I had any choice I would not use HttpwebResponse, I would use http://www.seleniumhq.org/projects/webdriver/
It's so easy because it uses the full blown browser instead of needing to maintain cookies. And if you need to run it interactively, or without the gui you can use SimpleDriver()

.Net C# : Read attachment from HttpWebResponse

Is it possible to read an image attachment from System.Net.HttpWebResponse?
I have a url to a java page, which generates images.
When I open the url in firefox, the download dialog appears. Content-type is application/png.
Seems to work.
When I try this in c#, and make a GET request I retrieve the content-type: text/html and no content-disposition header.
Simple Code:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
response.GetResponseStream() is empty.
A try with java was successful.
Do I have to prepare webrequest or something else?
You probably need to set a User-Agent header.
Run Fiddler and compare the requests.
Writing something in the UserAgent property of the HttpWebRequest does indeed make a difference in a lot of cases. A common practice for web services seem to be to ignore requests with an empty UserAgent.
See: Webmasters: Interpretation of empty User-agent
Simply set the UserAgent property to a non-empty string. You can for example use the name of your application, assembly information, impersonate a common UserAgent, or something else identifying.
Examples:
request.UserAgent = "my example program v1";
request.UserAgent = $"{System.Reflection.Assembly.GetExecutingAssembly().GetName().Name.ToString()} v{System.Reflection.Assembly.GetExecutingAssembly().GetName().Version.ToString()}";
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36";
And just to give a full working example:
using System.IO;
using System.Net;
void DownloadFile(Uri uri, string filename)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(uri);
request.Timeout = 10000;
request.Method = "GET";
request.UserAgent = "my example program v1";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (Stream receiveStream = response.GetResponseStream())
{
using (FileStream fileStream = File.Create(filename))
{
receiveStream.CopyTo(fileStream);
}
}
}
}

Categories

Resources