Abot not decoding web response properly - c#

I'm using Abot (C#) to crawl a website using the standard settings in their getting started documentation.
After retrieving a web page I can't read the content - it doesn't appear to have been decoded correctly.
If I comment out the Abot code and just use the standard (HttpWebResponse)request.GetResponse() .net method I can see the page content correctly.
I want to use Abot for its scraping capabilities though. But as you can see below I get a load of incorrectly decoded content.
Has anyone got any ideas on how I can fix the problem?
EDIT: I'm pretty sure its something to do with the website as I don't have the same problem if I run against http://www.google.com
EDIT 2: Here are the headers
WebRequest
User-Agent: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
Accept: */*
Host: www.<website>.com
Connection: Keep-Alive
WebResponse
Transfer-Encoding: chunked
Connection: keep-alive
Content-Type: text/html; charset=UTF-8
Date: Wed, 29 Jul 2015 12:28:53 GMT
Set-Cookie: __cfduid=de5028c9ea76b127d7aebe40617a7a6b51438172932; expires=Thu, 28-Jul-16 12:28:52 GMT; path=/; domain=.<website>.com; HttpOnly,PHPSESSID=e2ekece8flgs000h6u6kvf66k6; path=/,ct_cookies_test=7a1a1460017221ec70f96f0f2a3cdaac; path=/
X-Powered-By: W3 Total Cache/0.9.4.1
Expires: Wed, 29 Jul 2015 13:28:53 GMT
Cache-Control: max-age=3600, public, must-revalidate, proxy-revalidate
Pragma: public
X-Pingback: http://www.<website>.com/<file>.php
Link: <http://wp.me/P2xmvI-a>; rel=shortlink
Last-Modified: Wed, 29 Jul 2015 12:28:53 GMT
Vary: Accept-Encoding,User-Agent
Server: cloudflare-nginx
CF-RAY: 20d8d37b9fc406be-LHR

If you remove the User-Agent: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko header your response will probably be more readable. I'm not sure but it looks like the web server encodes responses sent to this user agent in some way. (I'm not an expert either)
I can recommend you to use Fiddler (http://www.telerik.com/fiddler) to check how web requests are handled. (Which is quite nice for debugging this kind of problems)
Bad content seen in fiddler
Correct content seen in fiddler

Related

File downloads failing on Android

I need to allow users to download files from our server, and I'd like to serve these files via an ASP.NET MVC 5 controller action. My action looks like this:
public FileContentResult Download(int fileId)
{
var myContent = GetContentForFile(fileId);
var myFileMeta = GetFileMeta(fileId);
if (myContent == null || myFileMeta == null)
throw new FriendlyException("The file or its associated data could not be found.");
return File(myContent.Content, myContent.MediaType, myFileMeta.FileName);
}
The above is as simple as I could get it, it works fine on PC and iPhone, but not on Android. Using Fiddler, I can see that the following response headers when I try to download one of my files - in this case a JPG file called "1447114384146-643143584.jpg":
HTTP/1.1 200 OK
Cache-Control: private, s-maxage=0
Content-Type: image/jpeg
Server: Microsoft-IIS/8.5
X-AspNetMvc-Version: 5.2
Content-Disposition: attachment; filename=1447114384146-643143584.jpg
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 12 Nov 2015 23:09:00 GMT
Content-Length: 1682868
Note that I don't have any reliable way to know the correct MIME-type - is this an issue and could it explain why the file isn't being downloaded in Android?
To clarify, when I attempt to download any file from the database using Android, I get a toast notification telling me "Download started", but then the download sits in the queue for a while on 0% before eventually just changing to "Failed".
What I've tried
I've tried all manner of things that people have suggested in similar questions, most of which are to do with the content-disposition header or the content-type header. I've tried forcing the content-type header to application/octet-stream for every file, I've tried sending the correct content-type header for the particular file. I've tried manually sending the content-disposition header. I've tried forcing the filename extension to uppercase.
None of the above has worked, in fact none of them have had any impact at all on the problem, for better OR worse. I'm amazed that this is so hard - I feel like I must be missing something obvious?
Additional information
Browser: latest Chrome on Android
OS: Android 5.1 (also occurs for a coworker on their Android phone which is at an earlier Android version (not sure which specifically), so I don't think this is tied to a specific Android version).
Update
After reading this blog entry: http://www.digiblog.de/2011/04/android-and-the-download-file-headers/ I tried following the advice and set my headers exactly as suggested:
HTTP/1.1 200 OK
Cache-Control: private, s-maxage=0
Content-Type: application/octet-stream
Server: Microsoft-IIS/8.5
X-AspNetMvc-Version: 5.2
Content-Disposition: attachment; filename="1447114384146-643143584.JPG"
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 12 Nov 2015 23:42:18 GMT
Content-Length: 1682868
Again, this had no impact on the problem at all.
Futher update
I have been able to test on a Marshmallow (Android v6.0) device and the download works. It seems to be a pre-Marshmallow issue.
Sadly this was caused by something very specific to my environment, but I'd like to put the answer here in case anyone else stumbles across this same problem.
It turns out the Android download manager doesn't like underscores in domain names, and our local domain address had an underscore in it. I used the server's IP address instead and everything worked as expected.
For example this: http://www.my_domain.com.au/file.png won't work. This: http://192.168.x.x/file.png does work.
Found as an answer on this question: Trouble downloading file from browser on Android
Disclaimer: I don't have enough rep to add to the comments so I am forced to comment here.
Have you tried different versions of Android using the emulator or
have you only tried using an actual device?
If only on a device, is the code in production or are using
connecting to your local development system through a local wireless
connection?
Have you tried to use Chrome Remote Debugging on the device?
https://developers.google.com/web/tools/chrome-devtools/debug/remote-debugging/remote-debugging?hl=en
As a way to rule out issues with the setup on your device would be to write a small Android app using Xamarin + RestSharp that does nothing but hits your download url to see if that works. If it does, then that helps to point the finger at Chrome itself. If it doesn't then at least you can run the app with the debugger attached to get better insight as to what is happening on the other end.
https://xamarin.com/
https://github.com/restsharp/RestSharp
UPDATE: Response headers as seen by Fiddler when calling a test served by my local machine
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: application/octet-stream
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Content-Disposition: attachment; filename=profile.jpg
Date: Fri, 13 Nov 2015 02:09:23 GMT
Content-Length: 218143
Update: Here are the incoming request server variable
ALL_HTTP=HTTP_CACHE_CONTROL:max-age=0
HTTP_CONNECTION:keep-alive
HTTP_ACCEPT:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
HTTP_ACCEPT_ENCODING:gzip, deflate, sdch
HTTP_ACCEPT_LANGUAGE:en-US,en;q=0.8
HTTP_COOKIE:_ga=GA1.1.420021277.1447377172
HTTP_HOST:192.168.1.2
HTTP_USER_AGENT:Mozilla/5.0 (Linux; Android 5.0.2; HTC One Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36
HTTP_UPGRADE_INSECURE_REQUESTS:1
HTTP_DNT:1
ALL_RAW=Cache-Control: max-age=0
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Cookie: _ga=GA1.1.420021277.1447377172
Host: 192.168.1.2
User-Agent: Mozilla/5.0 (Linux; Android 5.0.2; HTC One Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36
Upgrade-Insecure-Requests: 1
DNT: 1
APPL_MD_PATH=/LM/W3SVC/2/ROOT
APPL_PHYSICAL_PATH=C:\development\rumble-strip\projects\net-framework\RumbleStrip.Website\
AUTH_TYPE=
AUTH_USER=
AUTH_PASSWORD=
LOGON_USER=
REMOTE_USER=
CERT_COOKIE=
CERT_FLAGS=
CERT_ISSUER=
CERT_KEYSIZE=
CERT_SECRETKEYSIZE=
CERT_SERIALNUMBER=
CERT_SERVER_ISSUER=
CERT_SERVER_SUBJECT=
CERT_SUBJECT=
CONTENT_LENGTH=0
CONTENT_TYPE=
GATEWAY_INTERFACE=CGI/1.1
HTTPS=off
HTTPS_KEYSIZE=
HTTPS_SECRETKEYSIZE=
HTTPS_SERVER_ISSUER=
HTTPS_SERVER_SUBJECT=
INSTANCE_ID=2
INSTANCE_META_PATH=/LM/W3SVC/2
LOCAL_ADDR=192.168.1.2
PATH_INFO=/
PATH_TRANSLATED=C:\development\rumble-strip\projects\net-framework\RumbleStrip.Website
QUERY_STRING=&REMOTE_ADDR=192.168.1.5&REMOTE_HOST=192.168.1.5
REMOTE_PORT=54748
REQUEST_METHOD=GET
SCRIPT_NAME=/
SERVER_NAME=192.168.1.2
SERVER_PORT=80
SERVER_PORT_SECURE=0
SERVER_PROTOCOL=HTTP/1.1
SERVER_SOFTWARE=Microsoft-IIS/10.0
URL=/
HTTP_CACHE_CONTROL=max-age=0
HTTP_CONNECTION=keep-alive
HTTP_ACCEPT=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
HTTP_ACCEPT_ENCODING=gzip, deflate, sdch
HTTP_ACCEPT_LANGUAGE=en-US,en;q=0.8
HTTP_COOKIE=_ga=GA1.1.420021277.1447377172
HTTP_HOST=192.168.1.2
HTTP_USER_AGENT=Mozilla/5.0 (Linux; Android 5.0.2; HTC One Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36
HTTP_UPGRADE_INSECURE_REQUESTS=1
HTTP_DNT=1
IS_LOGIN_PAGE=1

Get html result from web page

I am planning create a movil application (for fun) that should use the result from this web page (http://consultawebvehiculos.carabineros.cl/index.php). is there any ways to create a instance of a browser in my Net code and read this result and publish it using a web service..
something like:
var IE= new broswer("http://consultawebvehiculos.carabineros.cl/index.php");
var result=IE.FindElementByID("txtIdentityCar").WriteText(YourIdentityCar);
publicToWebSerivce(result);
Update:
Using Fiddler i can see that http post is somthing like that:
POST http://consultawebvehiculos.carabineros.cl/index.php HTTP/1.1
Host: consultawebvehiculos.carabineros.cl
Connection: keep-alive
Content-Length: 61
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Origin: http://consultawebvehiculos.carabineros.cl
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
Content-Type: application/x-www-form-urlencoded
Referer: http://consultawebvehiculos.carabineros.cl/index.php
Accept-Encoding: gzip,deflate,sdch
Accept-Language: es-ES,es;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
accion=buscar&txtLetras=CL&txtNumeros1=sk&txtNumeros2=12&vin=
May be i need some .Net class like webclient in order connect with the php page...no sure.
UPDATE: I finally i found the solution using Fiddler to know the total parameters and I've used the code from http://www.hanselman.com/blog/HTTPPOSTsAndHTTPGETsWithWebClientAndCAndFakingAPostBack.aspx
If your are just interested in scraping the page, I suggest using Html Agility Pack.
If you also want to display the page, then you could use the WebBrowser control.
We've been using http://htmlunit.sourceforge.net/ for similair tasks. It allows you to send requests, receive response/status code/etc.
(it's a Java lib, so you could either google for a .Net port or use a converter to convert Java assembly into .Net assembly - see http://blog.stevensanderson.com/2010/03/30/using-htmlunit-on-net-for-headless-browser-automation/ for guidance. We've used the convertion approach).

HttpListenerReponse and PHP POST requests

Here is how the packets look
HTTP/1.1 200 OK
Content-Encoding: gzip
Vary: Accept-Encoding
Date: Thu, 18 Oct 2012 13:52:49 GMT
Server: LiteSpeed
Connection: close
X-Powered-By: PHP/5.3.10
Content-Type: text/html
Content-Length: 35
And
HTTP/1.1 200 OK
Content-Length: 35
Content-Type: text/html
Content-Encoding: gzip
Vary: Accept-Encoding
Server: Microsoft-HTTPAPI/2.0
Date: Thu, 18 Oct 2012 14:17:13 GMT
Connection: close
The GZIP output for both is the same yet the top one which is generated with PHP works and the bottom one which is HttpListenerResponse generated doesn't with a POST request, even though you can view both in a browser. I also do not call cross domain/port.
How do I make the second request work?
Added some headers removed some headers everything started working! You need to allow access control origin or it will only work on IE.

How do I return a specific SOAP response from an ASP.Net site?

An external development partner has a service that will post a SOAP request to one of our services. The format of their request is fixed (by them).
We are required to respond with a SOAP message of a fixed format (fixed by them again).
I have created a Generic Handler in ASP.Net that successfully receives their request (which we parse manually and process).
However, they want a response that looks like this:
HTTP/1.1 200 OK
Date: Thu, 01 Apr 2010 09:30:25 GMT
Server: Jetty/5.1.4 (Windows XP/5.1 x86 java/1.5.0_15
Content-Type: multipart/related; boundary=soaptestserver; type="text/xml"; start="<theenvelope>"
SOAPAction: ""
Content-Length: 796
Connection: close
--soaptestserver
Content-ID: <theenvelope>
Content-Transfer-Encoding: 8bit
Content-Type: text/xml; charset=utf-8
Content-Length: 442
<?xml version="1.0" encoding="UTF-8"?><SOAP-ENV:Envelope xmlns:SOAPENV="
http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/1999/XMLSchema"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"><SOAPENV:
Body><ns1:processResponse xmlns:ns1="urn:TripFlow" SOAPENV:
encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><message
href="cid:thecontentmessage"/></ns1:processResponse></SOAP-ENV:Body></SOAPENV:
Envelope>
--soaptestserver
Content-ID: <thecontentmessage>
Content-Transfer-Encoding: 8bit
Content-Type: text/xml; charset=utf-8
Content-Length: 65
<?xml version="1.0" encoding="UTF-8"?><STATUSLVL>00</STATUSLVL>
--soaptestserver--
I have been so sheltered from raw SOAP by using .Net Webservices / WCF for years, that I have no clue about how to go about making a response like this.
What should I do?

Removing a line feed from the header in of a c# web request

I need to remove the last line feed from a http webrequest in order to communicate with an json-rpc service.
The request which .net generates looks like this.
POST http://localhost.:8332/ HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; MS Web Services Client Protocol 4.0.30319.1)
Authorization: Basic dGlwa2c6dGlwa2c=
Host: localhost.:8332
Content-Length: 42
Expect: 100-continue
Connection: Keep-Alive
{"id":1,"method":"getinfo","params":[]}
What I would need would be this (notice the missing line feed after last header value and the begin of the json content):
POST http://localhost.:8332/ HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; MS Web Services Client Protocol 4.0.30319.1)
Authorization: Basic dGlwa2c6dGlwa2c=
Host: localhost.:8332
Content-Length: 42
Expect: 100-continue
Connection: Keep-Alive
{"id":1,"method":"getinfo","params":[]}
I can't find anything where I could manipulate the header which is actually sent to service.
See http://www.bitcoin.org/smf/index.php?topic=2170.0 for more background on the problem...
finally resolved my (core) issue. the problem with my communication with the rpc service, was that I had not set content-type. The service was requiring a content-type of "application/json-rpc" to work properly.

Categories

Resources