AG_E_NETWORK_ERROR while downloading some images from web in WindowsPhone - c#

I am using HtmlAgilityPack.
I am downloading articles and images from one web site. 80% images downloading without problem. But some images throwing error. I can see name of error in image_failed event.
I am downloading image like that:
Image = new BitmapImage(new Uri(img.Attributes["src"].Value));
I have searched google and found that this is really WTF problem.

There's a good chance the referrer header is screwing you up. You need to issue the calls yourself (instead of relying on BitmapImage to download the file).
There's a handy snippet/utility that 'extends' xaml and makes it easier to do.
http://blogs.msdn.com/b/swick/archive/2011/08/04/wp7-mango-image-download-with-custom-referer-header.aspx
Edit: Explanation
A lot of sites block requests for images not coming from their sites. That way, if you have http://mysite.com and you link to images in http://cnn.com, they can block images directly linked and redirect them or something.
Now, the reason it works is that the browser controls all calls made from the tag (or from any other mechanism such as AJAX) and it adds the REFERRER HTTP header saying where the request is coming from (http://mysite.com) - and then the cnn.com code can block it.
In .NET desktop, the Referrer header is not automatically added to the request - that means that the call would be blocked by some site that checks for an empty referrer and not for others that don't.
Switch to WP7/8 which is based on Silverlight. In Silverlight, the referrer is the site on which the Silverlight control is hosted. So if you have a SL control running on http://mysite.com and it makes [any] http request, the referrer header will be automatically set for you to http://mysite.com. There's no way to control that afaik (for security reasons). Windows Phone, however, while based on SL, does not need to be bound by the same security constraints. However, when they "ported" the code to Windows Phone, they put some value into referrer into it - the value is actually the package location inside the phone (you can see this by using fiddler). It's literally some path (/apps/storage/[guid]) or something like that - I don't recall the exact value. To fix that, you go and set the referrer to the site on the HTTP headers making the request.
Hope that makes it clear.

Related

Web scraping attempt at website with flash plugin

I am attempting to scrape a website which has some kind of flash plugin which is loading data after i retrieve the html. The following object is received in the page
<OBJECT classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" WIDTH="250" HEIGHT="20" id="Preloader"><PARAM NAME="movie" VALUE="/images/preloader.swf">
<PARAM NAME="quality" VALUE="high">
<PARAM NAME**strong text**="bgcolor" VALUE="#FFFFFF"><EMBED src="/images/preloader.swf" quality="high" bgcolor="#FFFFFF" WIDTH="250" HEIGHT="20" NAME="Preloader" ALIGN="" TYPE="application/x-shockwave-flash" PLUGINSPAGE="http://www.macromedia.com/go/getflashplayer"></EMBED></OBJECT>
Ive attempted to locate the data being received on wireshark but no luck. My knowledge of this flash plugin or how it works is nil. Im guessing the worst case scenario that I will not be able to do this.
HttpWebRequest mainRequest = (HttpWebRequest)(WebRequest.Create(URL));
mainRequest.Method = "GET";
mainRequest.Proxy = null;
WebResponse mainResponse = mainRequest.GetResponse();
StreamReader dataReader = new StreamReader(mainResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string data = dataReader.ReadToEnd();
dataReader.Close();
mainResponse.Close();
return data;
Does anyone know a way I can receive this data or make the webresponse wait for the data to be injected to the html before it is received. Any help would be greatly appreciated.
UPDATE:
It seems I may have jumped the gun a little with the flash object. I think this is just a loading animation while the table populates. I've been using fiddler to see what is going on. The page is returned after a request with a loading div and the flash object contained inside. A few seconds later when the data is ready another page is returned with the data. From what I can rememebr (im not at home so cannot confirm right now) the new page has the same request header as the original. Theres no json or ajax data in fiddler. Theres no script on the client to cause a refresh that I can see. I do not understand what is causing this to update.
Ive briefly looked at the web browser object but I imagine this will be quite a performance hit when im scraping about 200 pages, currently taking a minute or so. I will try the amf viewer later to confirm that the flash object is not the source of the update.
Im guessing that the server is causing this page to be resent when it has the table ready.
If the server is finding the loading div and replacing this with the table of data, would this cause the whole page to be resent? Or wouldnt this show up in ajax/json data? If it is the server resending the data, how can I keep the response open until it is ready to send the new page?
Thanks. JM.
If the content is being loaded dynamically into the Flash movie it's very likely occurring over a standard HTTP request. Wire Shark may be a little overkill for detecting something like this. I'd recommend using a utility that will capture HTTP, such as Charles, HttpFox, or screen-scraper. Using one of those tools, watch the HTTP requests that occur while the content is loading. Once you determine which request it is it's likely you can just replicate it in your code.
That said, I've also seen cases (though not very common) where the data loaded into the Flash movie is done with a binary protocol, which makes things a little more difficult. AMF is often the protocol used in these cases. Charles proxy will detect this protocol, so that may be the tool to use in this case. A while back I wrote a blog post on extracting data that's delivered via AMF. It deals with a Java library, but you may be able to find something equivalent in .NET.
You won't be able to do that with a plain HttpWebRequest because the Flash content isn't running. The response you get back is just the HTML. It requires a browser (or a browser-like object) to actually execute, load that object, and pull down the content. I know there are libraries for executing Javascript, but I don't know of anything that will let you run a Flash plugin outside of a browser.
You might be better off using a WebBrowser object. But even if it will execute the Flash content (I honestly don't know if it will), you might not be able to access it. You'll have to look at the DOM and see.
Use Firebug and / or TamperData, load your page with flash as usual, and wait until Flash makes the HTTP POST/GET for getting the data.
Flash has three options to get data:
Sockets
HTTP GET
HTTP POST
You can fool this thing any day. Just have to make sure your request contains all this little things:
Method (GET or POST)
Cookies
Form Values (why? session state, for example)
URL Referrer
User Agent
Custom HTTP-Headers? (some guys might put this in the HTTP request so no one can "fool" the server)
This could make difference of having a response with data a default html error page.
One last thing:
If the content is delivered via HTTPS, then, don't worry, it's just an extra layer somewhere but still possible.
If the content is delivered via sockets, then forget it.

ASHX renders as broken image

I've got a really vexxing problem with an ASHX handler that renders a captcha image. The thing that makes it really vexxing is that it was working fine two months ago and when I went back to it again today it had stopped working.
What I've got is a page that throws in a captcha every so often. This is the markup from an example of a challenge:
<img class="challengedtl" src="Challenge.ashx?tkn=0057ea27-4d35-4850-9c6f-7a6fdc9818e2"/>
The GUID references a record in a SQL table that contains the actual content of the captcha as well as the status of the captcha challenge, i.e. has it been processed and if so did the user get it right etc.
On the page where this markup is found, the image displays as a broken jpeg. When I drop a breakpoint in the ASHX ProcessRequest() method I can see that the ASHX is never being called.
When I take the URL out of the source attribute and run it directly from the address bar in my browser, then I hit my break point in ProcessRequest and the captch image is rendered just fine.
I don't believe that my ASHX code is the problem, since it works when I call it directly. The problem seems to be with why the ASHX isn't being called by the main page. Given that this was working in February I am at a loss to explain what is going on.
I know that something has happened to my machine since then. I suspect a Windows Update or a service pack for something. The reason for this is that my captcha processing includes tracking the IP address of the caller. Back when this was working my local host was being registered as 127.0.0.1 (IPv4) but now it is being registered as ::1 (IPv6). Probably a red herring.
Does anyone know what might be causing this or do you have any suggestions for how to troubleshoot this problem?
Is the handler in the same folder as the page containing the html you posted above?
Here are the two key parts:
When I drop a breakpoint in the ASHX ProcessRequest() method I can see that the ASHX is never being called.
and
src="Challenge.ashx?tkn=0057ea27-4d35-4850-9c6f-7a6fdc9818e2"
Put those together, and what we can surmise that the path in your src attribute is wrong.
It's just an image tag. If the html loads it will send a request for that resource. Since your breakpoint is not hit, it can only mean that either you aren't testing somewhere that allows breakpoints or that it's sending the request to the wrong place.
It could be as simple as sending the request to the production version of the site, using the wrong schema (ie: https vs http), or missing a folder or port number somewhere. The browser should be able to give you the entire path of the resource -- make sure this matches what you expect.

Reading the url from the browsers address bar using C# and MVC

I am trying to use domain masking to simulate multi-tenant access to my application. The plan right now is to read the subdomain portion of the domain ie: demo.mydomain.com and load settings from the DB using that name.
The issue I'm having is that request.url is getting the request url - NOT the url in the browser.
So if I have http://demo.mydomain.com forwarding to http://www.mydomain.com/controllername with masking, request.url is grabbing the latter, simply because of how masking works, i assume - by putting the masked site inside of a frame.
Is it even possible to read the url in the browsers address bar? Thanks.
You probably can get the url you want, but at the client side...
So, do this:
Get the browser's url by using a javascript call, like window.location.href.
Post that url to the server-side.
Cons:
This is a javascript dependent solution, it will not work with javascript disabled.
This is ugly as hell.
Pros:
You probably do not have any other option.

Getting data from a webpage

I have an idea for an App that would really help me out in work but I'm not sure if it's possible.
I want to run a C# desktop application that will ask for a value. When a value is supplied, the application will open a browswer, go to a webpage and add the value into a form on an online website. The form is then submitted and a new page is loaded that contains a table of results. I then want to extract the table of results from the page source and write code to parse the result values.
It is not important that the user see's this happen in an actual browser. In other words if there's a way to do it by reading HTTP requests then thats great.
The biggest problem I have is getting the values into the form and then retrieving the page source after the form is submitted and the next page loads.
Any help really appreciated.
Thanks
Provided that you're only using this in a legal context:
Usually, web forms are sent via POST request to the web server, specifically some script that handles it. You can look at the HTML code for the form's page and find out the destination for the form (form's action).
You can then use a HttpWebRequest in C# to "pretend you are the form", sending a POST request with all the required parameters (adding them to the HTTP header).
As a result you will get the source code of the destination page as it would be sent to the browser. You can parse this.
This is definitely possible and you don't need to use an actual web browser for this. You can simply use a System.Net.WebClient to send your HTTP request and get an HTTP response.
I suggest to use wireshark (or you can use Firefox + Firebug) it allows you to see HTTP requests and responses. By looking at the HTTP traffic you can see exactly how you should pass your HTTP request and which parameters you should be setting.
You don't need to involve the browser with this. WebClient should do all that you require. You'll need to see what's actually being posted when you submit the form with the browser, and then you should be able to make a POST request using the WebClient and retrieve the resulting page as a string.
The docs for the WebClient constructor have a nice example.
See e.g. this question for some pointers on at least the data retrieval side. You're going to know a lot more about the http protocol before you're done with this...
Why would you do this through web pages if you don't even want the user to do anything?
Web pages are purely for interaction with users, if you simply want data transfer, use WCF.
#Brian using Wireshark will result in a very angry network manager, make sure you are actually allowed to use it.

C# WebClient - View source question

I'm using a C# WebClient to post login details to a page and read the all the results.
The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines???
The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see.
(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)
Currently after I have posted my login details, and loaded the HTML back, I see the page which does NOT show the flash HTML (as if I had viewed source for the whole page).
Thanks in advance,
Jim
PS: I should point out that the POST is actually working, my log in is successful.
Fiddler (or similar tool) is invaluable to track down screen-scraping problems like this. Using a normal browser and with fiddler active, look at all the requests being made as you go through the login and navigation process to get to the data you want. In between, you will likely see one or more things that your code is doing differently which the server is responding to and hence showing you different HTML than a real client.
The list of stuff below (think of it as "scraping 101") is what you want to look for. Most of the stuff below is probably stuff you're already doing, but I included everything for completeness.
In order to scrape effectively, you may need to deal with one or more of the following:
cookies and/or hidden fields. when you show up at any page on a site, you'll typically get a session cookie and/or hidden form field which (in a normal browser) would be propagated back to the server on all subsequent requests. You will likely also get a persistent cookie. On many sites, if a requests shows up without a proper cookie (or form field for sites using "cookieless sessions"), the site will redirect the user to a "no cookies" UI, a login page, or another undesirable location (from the scraper app's perspective). always make sure you capture the cookies set on the initial request and faithfully send them back to the server on subsequent requests, except if one of those subsequent requests changes a cookie (in which case propagate that new cookie instead).
authentication tokens a special case of above is forms-authentication cookies or hidden fields. make sure you're capturing the login token (usually a cookie) and sending it back.
POST vs. GET this is obvious, but make sure you're using the same HTTP method that a real browser does.
form fields (esp. hidden ones!) I'm sure you're doing this already, but make sure to send all form fields that a real browser does, not just the visible fields. make sure fields are HTML-encoded properly.
HTTP headers. you already checked this, but it may make sense to check again just to make sure the (non-cookie) headers are identical. I always start with the exact same headers and then start pulling out headers one by one, and only keep the ones that cause the request to fail or return bogus data. this approach simplifies your scraping code.
redirects. These can either come from the server, or from client script (e.g. "if user doesn't have flash plug-in loaded, redirect to a non-flash page"). See WebRequest: How to find a postal code using a WebRequest against this ContentType="application/xhtml+xml, text/xml, text/html; charset=utf-8"? for a crazy example of how redirection can trip up a screen-scraper. Note that if you're using .NET for scraping, you'll need to use HttpWebRequest (not WebClient) for redirect-dependent scraping, because by default WebClient doesn't provide a way for your code to attach cookies and headers to the second (post-redirect) request. See the thread above for more details.
sub-requests (frames, ajax, flash, etc.) - often, page elements (not the main HTTP requests) will end up fetching the data you want to scrape. you'll be able to figure this out by looking which HTTP response contains the text you want, and then working backwards until you find what on the page is actually making the request for that content. A few sites do really crazy things in sub-requests, like requesting compressed or encrypted text via ajax, and then using client-side script to decrypt it. if this is the case, you'll need to do a bit more work like reverse-engineering what the client script is doing.
ordering - this one is obvious: make HTTP requests in the same order that a browser client does. that doesn't mean you need to make every request (e.g. images). Typically you only need to make the requests which return text/html content type, unless the data you want is not in the HTML and is in an ajax/flash/etc. request.
(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)
This usually means that the discrepancy is caused by some DOM manipulations via javascript after the page has loaded. Try turning off javascript and see what it looks like.

Categories

Resources