Webservice sourced images show as ? in certain versions of stock Android Browser - c#

I am attempting to dynamically source images, using an ID rendered into the path when the page binds it data. However, the images are showing as blue question marks in a box [?]. The images load fine on iOS, Mobile Chrome, older versions of Android browser (2.3), newer versions of Android browser (4.2.2) and IE/Firefox/Chrome on desktop. This issue appears (so far) only on Android 4.0 and 4.1.
This is how i'm trying to load the images:
Ex. <img src="../services/getImage?id=f6c799b2-ff31-4fbc-abc9-31f20d5e69c8">
This request hits a .NET webservice (IHttpAsyncHandler implementation) which looks like this
public virtual UploadedImage getImage(Guid imageId) {
string eTag;
Entities.Image.DTO image = null;
if
(
image = //get image entity
)
{
eTag = Delta.Crypto.CreateMD5Hash(image.ModifiedDate.ToEpoch().ToString());
if (Request.Headers[HTTPRequestHeaderKeys.IfNoneMatch].IsNotNullOrEmpty() && Request.Headers[HTTPRequestHeaderKeys.IfNoneMatch] == eTag)
{
this.RespondWithNoUpdate();
return null;
}
if (image.ImageUrl.IsNullOrEmpty() || image.ImageContent == null || image.ImageContent.Length == 0)
{
this.RespondWithNotFound();
return null;
}
Response.AddHeader(HTTPResponseHeaderKeys.ETag, eTag);
return new UploadedImage()
{
contentType = "image/" + System.IO.Path.GetExtension(image.ImageUrl).ToLower().Substring(1),
fileContents = image.ImageContent,
fileName = image.ImageUrl
};
}
return null;
So we're setting the mime type using the file extensions, which is maybe not 100% reliable, but i have confirmed to be correct in these cases.
Here is a copy of the Request and successful Response on my desktop Chrome browser
Request:
Accept:image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8,es;q=0.6
Connection:keep-alive
Host:localhost
Referer:http://localhost/delta/events/bigevent/app/event.html
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36
Response:
Cache-Control:no-cache
Content-Disposition:inline; filename="2dab739b-a06c-4579-8555-0598d738f858_eventApplayoutContainerEventApplicationlandingScreenImageContainer_background-image.png"
Content-Length:236
Content-Type:image/png
Date:Tue, 11 Feb 2014 19:53:31 GMT
ETag:1c79507d4969ea7534f3068ca1e60be4
Expires:-1
Pragma:no-cache
My only guess is that when requesting an image in this way, the img control does not know the mime type when rendered, and thus is complaining.
Note: The request does succeed on the Android browser when accessing directly (in a separate tab).
Does anyone have any idea what may be causing the [?] and a potential solution? I haven't been able to locate much, if any documentation on the stock browser. If you have a link to some documentation, that would also be much appreciated. Thanks!
EDIT: I should note that resource images with relative paths are loading fine
Ex. <img src="../images/EmptyProfile.png">

I was actually able to figure this one out.
The root of the issue is that the Android browser on those versions does not send an Accept header with the request.
My webservice tries to negotiate a content delivery type based upon the client preferences. There was no default.
Hope this helps someone in the future!

Related

GetElementById always returns null, no matter what object/website i try

Trying to automate actions on a website. I followed a tutorial and still no luck, no matter what website, tag or class I try it always gets null.
Saw a similar problem somewhere else where they suggested the website hadn't loaded so should add "while" section, which still didn't work for me.
webBrowser1.Navigate("https://www.simplesite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
var myElement = webBrowser1.Document.GetElementById("_ctl0_Header2017_btnLogin");
myElement.InvokeMember("Click");
This is just a simple windows forms application. I've tried webBrowser1_Navigated and webBrowser1_DocumentCompletedmthods too.
Inspect element
The site is responding based on how it sees your browser's capabilities, which it determines based on the User Agent value passed in the request header. You need to set the User Agent string. The one the web browser control is sending is different than your Chrome/Firefox/IE sends.
I've added the Chrome User Agent string... you can find other User Agent strings here, under the Software section...
https://developers.whatismybrowser.com/useragents/explore/
// call navigate and pass the latest Chrome User Agent string
webBrowser1.Navigate("https://www.simplesite.com/", null, null,
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36\r\n");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
// let's just wait a few milliseconds but it would be better
// if we used the DocumentCompleted event
System.Threading.Thread.Sleep(100);
}
var myElement = webBrowser1.Document.GetElementById("_ctl0_Header2017_btnLogin");
myElement.InvokeMember("Click");

C# HttpClient request fails to scrape (both on System.Net and Windows.Web http requests)

I am trying to scrape the news off this site: https://www.livescore.com/soccer/news/
using (Windows.Web.Http.HttpClient client = new Windows.Web.Http.HttpClient())
{
client.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident / 6.0)");
using (Windows.Web.Http.HttpResponseMessage response = await client.GetAsync(new Uri(pageURL)))
using (Windows.Web.Http.IHttpContent content = response.Content)
{
try
{
string result = await content.ReadAsStringAsync();
Debug.WriteLine(result);
}
}
}
I see that I am getting a response containing Your browser is out of date or some of its features are disabled
I moved to Windows.Web to add certificates since I am on UWP and tried adding the following certificates
HttpBaseProtocolFilter filter = new HttpBaseProtocolFilter();
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.Untrusted);
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.Expired);
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.IncompleteChain);
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.WrongUsage);
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.InvalidName);
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.RevocationInformationMissing);
filter.IgnorableServerCertificateErrors.Add(ChainValidationResult.RevocationFailure);
but I am still getting the same response from the server.
Any idea how to bypass this?
Edit: They do have the old server, unsecured, http://www.livescore.com/, where I guess I can scrape everything but news aren't there.
I think that the problem is the user-agent string. you are telling to site that the browser are you using is Internet Explorer 10.
Look this page http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer and try to use the user agent for internet Explorer 11 (before make this open the page from your ie11 browser to check that function properly)

WinForms WebBrowser control not loading whole site

I'm using WebBrowser in C# to display the website that contains login form. After I logged in successfully some parts of the website are not shown in the WebBrowser but in the default browser the same site is showing correctly.
Another issue is when I click on the link in WebBrowser it opens up in the default browser like (Firefox,chrome...etc ).
My application set the username and password automatically for login .
ChangeUserAgent();
var request = (HttpWebRequest)WebRequest.Create(loadLinks(namewebsite));
string useragent = request.UserAgent = "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393";
webBrowser2.Navigate(request.RequestUri,null,null, "User-Agent:Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393");
loadauthintication(namewebsite, _username, _password);
it is the method loadauthintication
using (var webClient = new System.Net.WebClient())
{
var json = webClient.DownloadString(#"http://example.com/getauthintication.php?name="+name+"&user="+user+"&pass="+pass);
// Now parse with JSON.Net
// MessageBox.Show(json);
string[] array = json.Split('~');
username = array[0];
password = array[1];
}
method
public void ChangeUserAgent()
{
ua = "Googlebot/2.1 (+http://www.google.com/bot.html)";
UrlMkSetSessionOption(URLMON_OPTION_USERAGENT_REFRESH, null, 0, 0);
UrlMkSetSessionOption(URLMON_OPTION_USERAGENT, ua, ua.Length, 0);
}
The problem with the WebBrowser control is that by default it renders output in IE7 rendering mode. Rick Strahl has written a very detailed post that covers some of the ways you can work around this.
The way I've usually used is to modify the registry by creating a new key at HKEY_CURRENT_USER\SOFTWARE\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION which is named for the executable for the application (e.g. ConsoleApplication1.exe) and setting the value to be that of the version of Internet Explorer I know will be installed in the target environment - for me that's always been IE11 - a value of 11001.
NOTE: As Rick mentions in his post, bear in mind that when run through Visual Studio, the name of your exe may/will be ConsoleApplication1.vshost.exe so you'll need to add a registry key for that

How do I get protocol version from System.Web.HttpRequest?

I have an Nginx server reverse-proxying my IIS server running af .NET Runtime 4 Web Forms application.
I'm trying to find out what HTTP version (1.0 or 1.1) Nginx is using when making requests to the IIS.
How do I get the HTTP version from the current request?
I've tried searching and looking through the documentation. Closest I've found is the ProtocolVersion of System.Net.HttpWebRequest but HttpRequest doesn't have the same property and it seems I can't cast it.
(BTW: I realise that the HTTP version rarely matter these days but it's in relation to some debugging I'm doing)
This value is in Request.ServerVariables["SERVER_PROTOCOL"] and holds, for example, "HTTP/1.1".
I was trying to get and analyse extended information from various websites, the only sustainable solution I could find is to using Selenium, with following peace of could get any information, including protocol version: "H2", "HTTP/1.1" etc.
ChromeOptions options = new ChromeOptions();
//Following Logging preference helps in enabling the performance logs
options.SetLoggingPreference("performance", LogLevel.All);
//Creating Chrome driver instance
IWebDriver driver = new ChromeDriver(options);
driver.Navigate().GoToUrl(/*your url goes here*/);
System.Threading.Thread.Sleep(20000);
//Extracting the performance logs
var logs = driver.Manage().Logs.GetLog("performance");
string str = string.Empty;
for (int i = 0; i < logs.Count; i++)
{
if (logs[i].Message.ToString().IndexOf("\"method\":\"Network.responseReceived\"") > -1 &&
logs[i].Message.ToString().IndexOf("\"url\":") > -1)
{
MessageBox.Show(
str.Substring(str.IndexOf("\"protocol\":"),
str.IndexOf(",", str.IndexOf("\"protocol\":")) - str.IndexOf("\"protocol\":"))
);
}
}
As you can see from following peace of response, you will be able to get pretty much any kind of information you could possibly need (including protocol version):
{"message":{"method":"Network.responseReceived","params":{"frameId":"755A984985C3F1263469B348C78A4AA5","loaderId":"2116D4C83A7C6EFD017CC5BC6814FCAB","requestId":"2116D4C83A7C6EFD017CC5BC6814FCAB","response":{"connectionId":41,"connectionReused":false,"encodedDataLength":9595,"fromDiskCache":false,"fromPrefetchCache":false,"fromServiceWorker":false,"headers":{"cache-control":"no-cache,
must-revalidate","content-encoding":"gzip","content-language":"en","content-type":"text/html; charset=utf-8","date":"Sun, 19 Apr 2020 12:37:09 GMT","expires":"Sun,
19 Nov 1978 05:00:00
GMT","server":"nginx","status":"200","vary":"Accept-Encoding","x-content-type-options":"nosniff\nnosniff","x-frame-options":"SAMEORIGIN","x-generator":"Drupal
7
(http://drupal.org)","x-powered-by":"PleskLin"},"mimeType":"text/html","protocol":"h2","remoteIPAddress":"195.210.46.29","remotePort":443,"requestHeaders":{":authority":"www.WEBSITENAME.com",":method":"GET",":path":"/",":scheme":"https","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9","accept-encoding":"gzip,
deflate,
br","accept-language":"en-US,en;q=0.9","sec-fetch-dest":"document","sec-fetch-mode":"navigate","sec-fetch-site":"none","sec-fetch-user":"?1","upgrade-insecure-requests":"1","user-agent":"Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/81.0.4044.113
Safari/537.36"},"securityDetails":{"certificateId":0,"certificateTransparencyCompliance":"compliant","cipher":"AES_128_GCM","issuer":"Let's
Encrypt Authority
X3","keyExchange":"ECDHE_RSA","keyExchangeGroup":"P-256","protocol":"TLS
1.2","signedCertificateTimestampList":[{"hashAlgorithm":"SHA-256","logDescription":"Let's
Encrypt 'Oak2020'
log","logId":"E712F2B0377E1A62FB8EC90C6184F1EA7B37CB561D11265BF3E0F34BF241546E","origin":"Embedded
in
certificate","signatureAlgorithm":"ECDSA","signatureData":"30440220600B7A7DE2155D200AE2179CE5E297DC6AB9118E57934440C20E25E33C420ADC02201DC0B323CDCA6BF85100E4816B1405BA5BBB2F41EB225CABCBA4CB5C0513E449","status":"Verified","timestamp":1.586426567863e+12},{"hashAlgorithm":"SHA-256","logDescription":"Google
'Argon2020'
log","logId":"B21E05CC8BA2CD8A204E8766F92BB98A2520676BDAFA70E7B249532DEF8B905E","origin":"Embedded
in
certificate","signatureAlgorithm":"ECDSA","signatureData":"3046022100AF260074C39A0F1294C8038BAEE0B85F984C7EC80D10203D6AAAC1BB8B5CDF1D022100ECE351015B9375A3F85CA84EC5CB606A5453AF34AFFDC25C5D32BC938A01FD67","status":"Verified","timestamp":1.586426567862e+12}],"subjectName":"WEBSITENAME.com","validFrom":1586422967,"validTo":1594198967},"securityState":"secure","status":200,"statusText":"","timing":{"connectEnd":399.532,"connectStart":163.118,"dnsEnd":163.118,"dnsStart":163.073,"proxyEnd":-1,"proxyStart":-1,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":2137.345,"requestTime":51622.890236,"sendEnd":399.969,"sendStart":399.759,"sslEnd":399.526,"sslStart":281.492,"workerReady":-1,"workerStart":-1},"url":"https://www.WEBSITENAME.com/"},"timestamp":51625.02969,"type":"Document"}},"webview":"755A984985C3F1263469B348C78A4AA5"}

Retrieve web page content like a browser

After I learned some things about differents technologies, I wanted to make a small project using UWP+NoSQL. I wanted to do a small UWP app that grabs the horoscope and display it on my raspberry Pi every morning.
So I took a WebClient, and I do the following:
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");
But it seems that it detect that this request isn't coming from a browser, since the interesting part is not in the content(and when I check with the browser, it is in the initial HTML, according to fiddler).
I also tried with ScrapySharp but I got the same result. Any idea why?
(I've already done the UWP part, so I don't want to change the topic of my personal project just because it is detected as a "bot")
EDIT
It seems I wasn't clear enough. The issue is **not* that I'm unable to parse the HTML, the issue is that I don't receive expected HTML when using ScrapySharp/WebClient
EDIT2
Here is what I retrieve: http://pastebin.com/sXi4JJRG
And, I don't get(by example) the "Star ratings by domain" + the related images for each stars
You can read the entire content of the web page using the code snippet shown below:
internal static string ReadText(string Url, int TimeOutSec)
{
try
{
using (HttpClient _client = new HttpClient() { Timeout = TimeSpan.FromSeconds(TimeOutSec) })
{
_client.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("text/html"));
using (HttpResponseMessage _responseMsg = _client.GetAsync(Url))
{
using (HttpContent content = _responseMsg.Content)
{
return content.ReadAsString();
}
}
}
}
catch { throw; }
}
Or in a simple way:
public static void DownloadString (string address)
{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);
Console.WriteLine (reply);
}
(re: https://msdn.microsoft.com/en-us/library/fhd1f0sw(v=vs.110).aspx)
yes, WebClient won't give you expected result. many sites have scripts to load content. so to emulate browser you also should run page scripts.
I have never did similar things, so my answer pure theoretical.
To solve the problem you need "headless browser".
I know two project for this (I have never try ony of it):
http://webkitdotnet.sourceforge.net/ - it seems to be outdated
http://www.awesomium.com/
Ok, I think I know what's going on: I compared the real output (no fancy user agent strings) to the output as supplied by your pastebin and found something interesting. On line 213, your pastebin has:
<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hov...ck">Forecast Tarot Readings</div>
Mind the data-hov...ck near the end. In the real output, this was:
<li class="dropdown">Astrology
followed by about 600 lines of code, including the aforementioned 'interesting part'. On line 814, it says:
<div class="bot-explore-col-subtitle f14 blocksubtitle black">Forecast Tarot Readings</div>
which, starting with the ck in black, matches up with the rest of the pastebin output. So, either pastebin has condensed the output or the original output was.
I created a new console application, inserted your code, and got the result I expected, including the 600 lines of html you seem to miss:
static void Main(string[] args)
{
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");
File.WriteAllText(#"D:\Temp\source-mywebclient.html", downloadString);
}
My WebClient is from System.Net. And changing the UserAgent hardly has any effect, a couple of links are a bit different.
So, to sum it up: Your problem has nothing to do with content that is inserted dynamically after the initial get, but possibly with webclient combined with UWP. There's another question regarding webclient and UWP on the site: (UWP) WebClient and downloading data from URL in that states you should use HttpClient. Maybe that's a solution?
Some time ago I used http://www.nrecosite.com/phantomjs_wrapper_net.aspx it worked well, and as Anton mentioned it is a headless browser. Maybe it will be some help.
I'm wondering if all the 'interesting part' you expect to see 'in the content' are images? You are aware of the fact you have to retrieve any images separately? The fact that a html page contains <image.../> tags does not magically display them as well. As you can see with Fiddler, after retrieving a page, the browser then retrieves all images, style sheets, javascript and all other items that are specified, but not included in the page. (you might need to clear the browser cache to see this happen...)

Categories

Resources