I'm trying to scrape a web page hosted on a device on my network. I've done this dozens of times with other model devices on the same network. When I browse to the page in IE or Chrome, it's formatted properly and I see the source I'm expecting.
However, when I try to read the response stream in .Net or try running it in Fiddler, I'm given source for what looks like Javascript and session generating scripting rather than the numbers I care about.
I think this page is now hitting a javascript powered landing page, calling back to the printer, then formatting and outputting back into my browser. I think my difference is that calls from inside of Fiddler and .Net GetResponseStream() calls aren't letting the javascript do what it needs to to get all of the data.
Sample
WebRequest ConReq = WebRequest.Create(consumablePage);
WebRequest UseReq = WebRequest.Create(usagePage);
ConReq.Timeout = 15000;
UseReq.Timeout = 20000;
WebResponse ConResp = ConReq.GetResponse();
WebResponse UseResp = UseReq.GetResponse();
Stream Constream = ConResp.GetResponseStream();
StreamReader Consr = new StreamReader(Constream);
Stream Usestream = UseResp.GetResponseStream();
StreamReader Usesr = new StreamReader(Usestream);
string conRead = Consr.ReadToEnd();
string useRead = Usesr.ReadToEnd();
At the end, conRead and useRead both contain:
"<html>\r\n<head>\r\n<script language=\"JavaScript\" type=\"text/javascript\">\r\n<!-- \r\nfunction SetCookie ( inCookieName, inCookieValue, inCookieExpiration)\r\n{\r\n\tdocument.cookie\t\t= inCookieName + \"=\" + escape( inCookieValue ) + \r\n\t\t\t\t\t\t\t( inCookieExpiration ? \"; expires=\" + getExpiryDate(inCookieExpiration) : \"\" ) + \r\n\t\t\t\t\t\t\t\t\"; path=/\";\r\n}\r\n\r\nfunction getExpiryDate(nodays)\r\n{\r\n\tvar UTCstring;\r\n\tToday = new Date();\r\n\tnomilli=Date.parse(Today);\r\n\tToday.setTime(nomilli+nodays*24*60*60*1000);\r\n\tUTCstring = Today.toUTCString();\r\n\treturn UTCstring;\r\n}\r\n\r\nfunction generateSessionID()\r\n{\r\n\tvar \tgetTcpIpAddr = \"10.210.13.138\";\r\n\tvar SESSION_ID =\"SESSION_ID\";\r\n\tvar ipArray = getTcpIpAddr.split(\".\");\r\n\tvar ip = parseInt(ipArray[0], 10) + parseInt(ipArray[1], 10) + parseInt(ipArray[2], 10) + parseInt(ipArray[3], 10);\r\n\tvar d = new Date();\r\n\tID = parseInt((d.getMilliseconds()*ip)/32, 10);\r\n\tSetCookie(SESSION_ID, ID,365);\t//365 - expiry date is 1 year\r\n\twindow.location=window.location.toString();\r\n}\r\n-->\r\n</script>\r\n</head>\r\n<body onLoad=\"generateSessionID()\">\r\n</body>\r\n</html>\r\n"
This picture is an example of a Fiddler GET, and an IE instance of the same page. Note the Fiddler response is only 1075K and the IE response is 6602K.
How can I get a fully parsed response stream back in .Net?
Related
I am trying to get a table from the web page https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/ using HtmlAgilityPack.
My code so far is
WebClient webClient = new WebClient();
string page = webClient.DownloadString("https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[#class='list_result Result']")
.Descendants("tr")
.Skip(1)
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
My problem is that the webpage creates the table by using JavaScript and when I try to read it it throws a null exception because the web page is showing that I must enable JavaScript.
I also tried to use "GET" method
string Url = "https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
with the same results.
I already enable JavaScript in Internet Explorer and change registry as well
if (Environment.Is64BitOperatingSystem)
Regkey = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(#"SOFTWARE\\Wow6432Node\\Microsoft\\Internet Explorer\\MAIN\\FeatureControl\\FEATURE_BROWSER_EMULATION", true);
else //For 32 bit machine
Regkey = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(#"SOFTWARE\\Microsoft\\Internet Explorer\\Main\\FeatureControl\\FEATURE_BROWSER_EMULATION", true);
If I use a WebBrowser component I can see the web page without problem but I still can't get the table to list.
F12 is your friend in any browser.
Select the Network tab and you'll notice that all of the info is in this file :
https://www.belastingdienst.nl/data/douane_wisselkoersen/wks.douane.wisselkoersen.dd201806.xml
(I suppose that the data for july 2018 will be held in a url named *.dd201807.xml)
Using C# you will need to do a GET for that URL and parse it as XML, no need to use HtmlAgilityPack. You will need to construct the current year concatenated with the current month to pick the right URL.
Leuker kan ik het niet maken!
WebClient is an http client, not a web browser, so it won't execute JavaScript. What is need is a headless web browser. See this page for a list of headless web browsers. I have not tried any of them though, so I cannot give you a recommendation here:
Headless browser for C# (.NET)?
I am new here and I hope someone can help me. I try to connect to twitch.tv I am trying to get an oauth2 authentication on twitch.tv with a small C# program. I am using the twitch.tv authentication request. Here is my C# code:
var loginURL = "https://api.twitch.tv/kraken/oauth2/authorize?
response_type=code&"+
client_id="+ clientID+"
"&redirect_uri=http://localhost&"+
"state=TWStreamingStateAuthenticated";
this.richTextBox1.Text = loginURL;
string code = get_DownLoadString(loginURL);
this.richTextBox1.Text = code;
This is the part, which does not work. It gives me the Error 400: Bad Request.
WebRequest request = WebRequest.Create("https://api.twitch.tv/kraken/oauth2/token");
request.Method = "POST";
string postData = "client_id=" + clientID +
"&client_secret=" + clientSecret +
"&grant_type=authorization_code" +
"&redirect_uri=http://localhost" +
"&code=" + code +
"&state=TWStreamingStateAuthenticated";
ASCIIEncoding encoding = new ASCIIEncoding();
postData = HttpUtility.UrlEncode(postData);
byte[] byteArray = encoding.GetBytes(postData);
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = byteArray.Length;
Stream datatream = request.GetRequestStream();
datatream.Write(byteArray, 0, byteArray.Length);
datatream.Close();
WebResponse respone = request.GetResponse();
MessageBox.Show(((HttpWebResponse)respone).StatusDescription);
I hope someone can help me.
And here is the Get_DownloadString(string URL) Method.
private static string get_DownLoadString(string URL)
{
try
{
string temp = (new WebClient().DownloadString(URL));
return temp;
}
catch (WebException)
{
return null;
}
}
This code doesn't look right to me:
string postData = "client_id=" + clientID +
"&client_secret=" + clientSecret +
"&grant_type=authorization_code" +
"&redirect_uri=http://localhost" +
"&code=" + code +
"&state=TWStreamingStateAuthenticated";
ASCIIEncoding encoding = new ASCIIEncoding();
postData = HttpUtility.UrlEncode(postData);
byte[] byteArray = encoding.GetBytes(postData);
// ...
You are URL-encoding the entire post-data string. This has the effect of converting the & and = signs in the post data to %26 and %3d respectively. When the remote server receives this data, it will scan through it looking for the & and = signs in order to separate out the parameter names and values. Of course, it won't find any, so it will assume you have one big parameter name with no value. The server is probably expecting values for each of the six parameters you are attempting to send, but seeing values for none of them, and this may be why you are getting a 400 Bad Request error.
Instead of URL-encoding the whole string, URL-encode parameter values that may contain characters other than letters and numbers. I would try the following instead:
string postData = "client_id=" + HttpUtility.UrlEncode(clientID) +
"&client_secret=" + HttpUtility.UrlEncode(clientSecret) +
"&grant_type=authorization_code" +
"&redirect_uri=" + HttpUtility.UrlEncode("http://localhost") +
"&code=" + HttpUtility.UrlEncode(code) +
"&state=TWStreamingStateAuthenticated";
ASCIIEncoding encoding = new ASCIIEncoding();
byte[] byteArray = encoding.GetBytes(postData);
// ...
This way, the remote server will still see the & and = characters, and so will be able to pull out the parameter names and values. Because we've URL-encoded the client ID, client secret, URL and code, any characters they contain that may have meaning in a URL will not have that meaning and will be received by the remote server as intended.
Also, if you are still getting a 400 Bad Request error response, try reading the contents of the response stream, obtained by calling GetResponseStream() on the response. Often that will contain a message that will help you figure out what's gone wrong.
Having had a closer look at your code, it seems you have a misunderstanding about how OAuth authentication works. Your getDownload_String method will not get the access code you want, it will only get the HTML text of a Twitch login page.
This is how OAuth authentication works:
Your app sends the user to a login URL, to allow the user to log in to Twitch.
In the web browser, the user then enters their login credentials and submits the page to Twitch.
The Twitch API then responds by redirecting the user's web browser to the redirect URL, with a code appended. Your web app then reads this code out of the URL.
If your code is in a web app it will be able to respond to the URL redirected to in step 3. Alternatively, you may be able to use a WebBrowser control (Windows Forms, WPF) to handle the Twitch login, and handle a Navigating event. If the URL being navigated to begins with the redirect URL, grab the code out of the URL, cancel the navigation and hide the login web-browser control.
The presence of what appears to be a RichTextBox control, along with your comment about your code being a 'small C# application', makes me think that your code is a Windows Forms or WPF application. If this is the case, then you will need to either:
use a WebBrowser control as I described above,
replace your WinForms/WPF app with a web app, or
get in contact with Twitch to request the use of the password flow (which appears not to require a redirect), and use that instead.
First of all what I want to do is legal (since they let you download the pdf).
I just wanted to make a faster and automatic method of downloading the pdf.
For example: http://www.lasirena.es/article/&path=10_17&ID=782
It has an embedded flash pdf and when I download that page source code, the link to the pdf:
http://issuu.com/lasirena/docs/af_fulleto_setembre_andorra_sense_c?e=3360093/9079351
Doesn't show up, the only thing that I have on the source code is this: 3360093/9079351
I tried to find a way to build the pdf link from it, but I can't find the name "af_fulleto_setembre_andorra_sense_c" anywhere...
I've made plenty of automatic downloads like this, but it's the first time that I can't build or get the pdf link and I can't seem to find a way, is it even possible?
I tried to try and find jpg's links but without success either. Either way (jpg or pdf) is fine...
PS: the Document ID doesn't show on the downloaded source code either.
Thank you.
I thought a workaround for this, some might not consider this a solution but in my case works fine because it depends on the ISSUU publisher account.
The Solution itself is making a Request to ISSUU API connected with the publisher account I'm looking for.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://api.issuu.com/query?action=issuu.documents.list" +
"&apiKey=Inser Your API Key" +
"&format=json" +
"&documentUsername=User of the account you want to make a request" +
"&pageSize=100&resultOrder=asc" +
"&responseParams=name,documentId,pageCount" +
"&username=Insert your ISSUU username" +
"&token=Insert Your Token here");
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.Accept = "application/json";
try
{
using (WebResponse response = request.GetResponse())
{
var responseValue = string.Empty;
// grab the response
using (var responseStream = response.GetResponseStream())
{
using (var reader = new StreamReader(responseStream))
{
responseValue = reader.ReadToEnd();
}
}
if (responseValue != "")
{
List<string> lista_linkss = new List<string>();
JObject ApiRequest = JObject.Parse(responseValue);
//// get JSON result objects into a list
IList<JToken> results = ApiRequest["rsp"]["_content"]["result"]["_content"].Children()["document"].ToList();
for (int i = 0; i < results.Count(); i++)
{
Folheto folheto = new Folheto();
folheto.name = results[i]["name"].ToString();
folheto.documentId = results[i]["documentId"].ToString();
folheto.pageCount = Int32.Parse(results[i]["pageCount"].ToString());
string _date = Newtonsoft.Json.JsonConvert.SerializeObject(results[i]["uploadTimestamp"], Formatting.None, new IsoDateTimeConverter() { DateTimeFormat = "yyyy-MM-dd hh:mm:ss" }).Replace(#"""", string.Empty);
folheto.uploadTimestamp = Convert.ToDateTime(_date);
if (!lista_nomes_Sirena.Contains(folheto.name))
{
list.Add(folheto);
}
}
}
}
}
catch (WebException ex)
{
// Handle error
}
You have to pay attention to the Parameter "pageSize" the maximum permitted by the API is 100, this means the maximum number of results you get is 100, since the account I'm following has around 240 pdf's, I used this request once with the Parameter "resultOrder = asc" and another time with the value "resultOrder=desc".
This allowed me to get the first 100 pdfs and the latest 100 pdfs inserted.
Since I didn't need a history but just the pdf's they will be sending out from now, it didn't make a difference.
Finalizing my code I'm sending all the document's ID's to a sql database I made, and when I start the program, I make a check to see if the ID was already downloaded, if not it downloads the pdf, if yes it doesn't.
Hope someone can find this work around useful
I am trying to upload from an HTTP stream directly to S3, without storing in memory or as a file first. I am already doing this with Rackspace Cloud Files as HTTP to HTTP, however the AWS authentication is beyond me so am trying to use the SDK.
The problem is the upload stream is failing with this exception:
"This stream does not support seek operations."
I've tried with PutObject and TransferUtility.Upload, both fail with the same thing.
Is there any way to stream into S3 as the stream comes in, rather than buffering the whole thing to a MemoryStream or FileStream?
or is there any good examples of doing the authentication into S3 request using HTTPWebRequest, so I can duplicate what I do with Cloud Files?
Edit: or is there a helper function in the AWSSDK for generating the authorization header?
CODE:
This is the failing S3 part (both methods included for completeness):
string uri = RSConnection.StorageUrl + "/" + container + "/" + file.SelectSingleNode("name").InnerText;
var req = (HttpWebRequest)WebRequest.Create(uri);
req.Headers.Add("X-Auth-Token", RSConnection.AuthToken);
req.Method = "GET";
using (var resp = req.GetResponse() as HttpWebResponse)
{
using (Stream stream = resp.GetResponseStream())
{
Amazon.S3.Transfer.TransferUtility trans = new Amazon.S3.Transfer.TransferUtility(S3Client);
trans.Upload(stream, config.Element("root").Element("S3BackupBucket").Value, container + file.SelectSingleNode("name").InnerText);
//Use EITHER the above OR the below
PutObjectRequest putReq = new PutObjectRequest();
putReq.WithBucketName(config.Element("root").Element("S3BackupBucket").Value);
putReq.WithKey(container + file.SelectSingleNode("name").InnerText);
putReq.WithInputStream(Amazon.S3.Util.AmazonS3Util.MakeStreamSeekable(stream));
putReq.WithMetaData("content-length", file.SelectSingleNode("bytes").InnerText);
using (S3Response putResp = S3Client.PutObject(putReq))
{
}
}
}
And this is how I do it successfully from S3 to Cloud Files:
using (GetObjectResponse getResponse = S3Client.GetObject(new GetObjectRequest().WithBucketName(bucket.BucketName).WithKey(file.Key)))
{
using (Stream s = getResponse.ResponseStream)
{
//We can stream right from s3 to CF, no need to store in memory or filesystem.
var req = (HttpWebRequest)WebRequest.Create(uri);
req.Headers.Add("X-Auth-Token", RSConnection.AuthToken);
req.Method = "PUT";
req.AllowWriteStreamBuffering = false;
if (req.ContentLength == -1L)
req.SendChunked = true;
using (Stream stream = req.GetRequestStream())
{
byte[] data = new byte[32768];
int bytesRead = 0;
while ((bytesRead = s.Read(data, 0, data.Length)) > 0)
{
stream.Write(data, 0, bytesRead);
}
stream.Flush();
stream.Close();
}
req.GetResponse().Close();
}
}
As no-one answering seems to have done it, I spent the time working it out based on guidance from Steve's answer:
In answer to this question "is there any good examples of doing the authentication into S3 request using HTTPWebRequest, so I can duplicate what I do with Cloud Files?", here is how to generate the auth header manually:
string today = String.Format("{0:ddd,' 'dd' 'MMM' 'yyyy' 'HH':'mm':'ss' 'zz00}", DateTime.Now);
string stringToSign = "PUT\n" +
"\n" +
file.SelectSingleNode("content_type").InnerText + "\n" +
"\n" +
"x-amz-date:" + today + "\n" +
"/" + strBucketName + "/" + strKey;
Encoding ae = new UTF8Encoding();
HMACSHA1 signature = new HMACSHA1(ae.GetBytes(AWSSecret));
string encodedCanonical = Convert.ToBase64String(signature.ComputeHash(ae.GetBytes(stringToSign)));
string authHeader = "AWS " + AWSKey + ":" + encodedCanonical;
string uriS3 = "https://" + strBucketName + ".s3.amazonaws.com/" + strKey;
var reqS3 = (HttpWebRequest)WebRequest.Create(uriS3);
reqS3.Headers.Add("Authorization", authHeader);
reqS3.Headers.Add("x-amz-date", today);
reqS3.ContentType = file.SelectSingleNode("content_type").InnerText;
reqS3.ContentLength = Convert.ToInt32(file.SelectSingleNode("bytes").InnerText);
reqS3.Method = "PUT";
Note the added x-amz-date header as HTTPWebRequest sends the date in a different format to what AWS is expecting.
From there it was just a case of repeating what I was already doing.
Take a look at Amazon S3 Authentication Tool for Curl. From that web page:
Curl is a popular command-line tool for interacting with HTTP
services. This Perl script calculates the proper signature, then calls
Curl with the appropriate arguments.
You could probably adapt it or its output for your use.
I think the problem is that according to the AWS Documentation Content-Length is required and you don't know what the length is until the stream has finished.
(I would guess the Amazon.S3.Util.AmazonS3Util.MakeStreamSeekable routine is reading the whole stream into memory to get around this problem which makes it unsuitable for your scenario.)
What you can do is read the file in chunks and upload them using MultiPart upload.
PS, I assume you know the C# source for the AWSSDK for dotnet is on Github.
This is a true hack (which would probably break with a new implementation of the AWSSDK), and it requires knowledge of the length of the file being requested, but if you wrap the response stream as shown with this class (a gist) as shown below:
long length = fileLength;
you can get file length in several ways. I am uploading from a dropbox link, so they give me the
length along with the url. Alternatively, you can perform a HEAD request and get the Content-Length.
string uri = RSConnection.StorageUrl + "/" + container + "/" + file.SelectSingleNode("name").InnerText;
var req = (HttpWebRequest)WebRequest.Create(uri);
req.Headers.Add("X-Auth-Token", RSConnection.AuthToken);
req.Method = "GET";
using (var resp = req.GetResponse() as HttpWebResponse)
{
using (Stream stream = resp.GetResponseStream())
{
//I haven't tested this path
Amazon.S3.Transfer.TransferUtility trans = new Amazon.S3.Transfer.TransferUtility(S3Client);
trans.Upload(new HttpResponseStream(stream, length), config.Element("root").Element("S3BackupBucket").Value, container + file.SelectSingleNode("name").InnerText);
//Use EITHER the above OR the below
//I have tested this with dropbox data
PutObjectRequest putReq = new PutObjectRequest();
putReq.WithBucketName(config.Element("root").Element("S3BackupBucket").Value);
putReq.WithKey(container + file.SelectSingleNode("name").InnerText);
putReq.WithInputStream(new HttpResponseStream(stream, length)));
//These are necessary for really large files to work
putReq.WithTimeout(System.Threading.Timeout.Infinite);
putReq.WithReadWriteTimeout(System.Thread.Timeout.Infinite);
using (S3Response putResp = S3Client.PutObject(putReq))
{
}
}
}
The hack is overriding the Position and Length properties, and returning 0 for Position{get}, noop'ing Position{set}, and returning the known length for Length.
I recognize that this might not work if you don't have the length or if the server providing the source does not support HEAD requests and Content-Length headers. I also realize it might not work if the reported Content-Length or the supplied length doesn't match the actual length of the file.
In my test, I also supply the Content-Type to the PutObjectRequest, but I don't that that is necessary.
As sgmoore said, the problem is that your content length is not seekable from the HTTP response. However HttpWebResponse does have a content length property available. So you can actually form your Http post request to S3 yourself instead of using the Amazon library.
Here's another Stackoverflow question that managed to do that with what looks like full code to me.
I am creating a HttpWebRequest object from another aspx page to save the response stream to my data store. The Url I am using to create the HttpWebRequest object has querystring to render the correct output. When I browse to the page using any old browser it renders correctly. When I try to retrieve the output stream using the HttpWebResponse.GetResponseStream() it renders my built in error check.
Why would it render correctly in the browser, but not using the HttpWebRequest and HttpWebResponse objects?
Here is the source code:
Code behind of target page:
protected void PageLoad(object sender, EventsArgs e)
{
string output = string.Empty;
if(Request.Querystring["a"] != null)
{
//generate output
output = "The query string value is " + Request.QueryString["a"].ToString();
}
else
{
//generate message indicating the query string variable is missing
output = "The query string value was not found";
}
Response.Write(output);
}
Code behind of page creating HttpWebRequest object
string url = "http://www.mysite.com/mypage.aspx?a=1";
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url)
//this if statement was missing from original example
if(User.Length > 0)
{
request.Credentials = new NetworkCredentials("myaccount", "mypassword", "mydomain");
request.PreAuthenticate = true;
}
request.UserAgent = Request.UserAgent;
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
Stream resStream = response.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
StreamReader readStream = new StreamReader(resStream, encode, true, 2000);
int count = readStream.Read(read, 0, read.Length);
string str = Server.HtmlEncode(" ");
while (count > 0)
{
// Dumps the 256 characters on a string and displays the string to the console.
string strRead = new string(read, 0, count);
str = str.Replace(str, str + Server.HtmlEncode(strRead.ToString()));
count = readStream.Read(read, 0, 256);
}
// return what was found
result = str.ToString();
resStream.Close();
readStream.Close();
Update
#David McEwing - I am creating the HttpWebRequest with the full page name. The page is still generating the error output. I updated the code sample of the target page to demonstrate exactly what I am doing.
#Chris Lively - I am not redirecting to an error page, I generate a message indicating the query string value was not found. I updated the source code example.
Update 1:
I tried using Fiddler to trace the HttpWebRequest and it did not show up in the Web Sessions history window. Am I missing something in my source code to get a complete web request and response.
Update 2:
I did not include the following section of code in my example and it was culprit causing the issue. I was setting the Credentials property of the HttpWebRequest with a sevice account instead of my AD account which was causing the issue.
I updated my source code example
What webserver are you using? I can remember at one point in my past when doing something with IIS there was an issue where the redirect between http://example.com/ and http://example.com/default.asp dropped the query string.
Perhaps run Fiddler (or a protocol sniffer) and see if there is something happening that you aren't expecting.
Also check if passing in the full page name works. If it does the above is almost certainly the problem.
Optionally, you can try to use the AllowAutoRedirect property of the HttpRequestObject.
I need to replace the following line of code:
request.Credentials = new NetworkCredentials("myaccount", "mypassword", "mydomain");
with:
request.Credentials = System.Net.CredentialCache.DefaultNetworkCredentials;