Here is the code down below
List<int> j = new List<int>();
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(result.SiteURL);
webRequest.AllowAutoRedirect = false;
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
j.Add((int)response.StatusCode);
What i want to do is, get all the response codes, seperate them(like 2xx,3xx 4xx-5xx) and put them in different lists. Because i need their numbers like how many 4xx responses are there or how many 200 responses are there. Or is there another way to do it?
result.SiteURL is the URL that for the responses. The problem is the last line of the code doesn't return or get anything. What am i missing here?
edit: The main problem is that whatever i try i only get 1 response code and that is mostly 200:OK. But, for youtube.com(ect) there must be 74 OK(200) responses, 1 No Content(204) response and 2 Moved Permanently(301) responses according to https://tools.pingdom.com/#!/fMjhr/youtube.com. How am i going to get them?
You misunderstand the result shown by pingdom.
Pingdom requests a web page just like a browser would: It loads the page itself, as well as all resources references by the page: style sheets, scripts, images, etc.
Your code only loads the main HTML page, which has great availability and always returns 200 OK.
If you want to reproduce pingdom's results, you'll need to parse the HTML page and load the page's resources as well. Keep in mind that parsing HTML is a non-trivial task (browser vendors put a lot of effort in it), so you might want to reconsider whether this is worth your time.
Related
I am working with the shift4shop web api. These guys used to be known as threeDCart if that helps anyone. Its an eCommerce platform.
we are trying to apply a promotion code to an open cart.
support has verified there is no api-way to do that.
there is an url that will apply the promotion. This is often emailed to customers so they can apply the promo if they choose to.
we can paste the correct url in chrome, brave, edge, or firefox and it correctly applies the promotion.
We used private tabs for the different browser tests and the browsers were 'cold'. we launched the browser and immediately entered the URL.
We are thinking this eliminates the possibility that there are cookies that are necessary.
https://www.mywebsite.com/continue_order.asp?orderkey=CDC886A7O4Srgyn278668&ApplyPromo=40pro
However, when I try to do this in C#, i get a response that is redirected a page that says 'The cart is empty'.
The promotion is not applied
I am stumped as to how the website would respond differently to the same URL when it comes from a browser as opposed to the c# system.net library.
here is the c# code I am using
using System.Net;
//i really create this using my data, but this is the resulting url
string url = "https://www.mywebsite.com/continue_order.asp?orderkey=CDC886A7O4Srgyn278668&ApplyPromo=40pro"
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
string result = "";
using (StreamReader rdr = new StreamReader(response.GetResponseStream()))
{
result = rdr.ReadToEnd();
}
You can also call ".view_cart.asp" w the same parameters and the browsers will cause the promo to be applied.
I have tried setting the method to [ , GET, get ]
There has to be something about the request settings that are preventing this from working.
I do not know what else to try.
Any thoughts are appreciated.
As per shift4shop support, the continue_order.asp has a 302.
the browsers land on continue_order.asp and process that page.
They then continue on to view_order.asp.
The 2 pages together perform functionality that you can not get by just calling continue_order.asp
Thanks to savoy w/ shift4Shop for helping on that.
My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.
In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.
Click here for first picture
I'm trying to accomplish this using following path:
"//div[#class='kal']//table//tr[2]/td[1]/div[#class='cipars']"
But I'm getting following Error:
Click here for Error message picture
Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.
So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.
Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.
Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.
Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:
//div[#class='kal']//table//descendant::div[#class='cipars']
This will return all of the calendar items (ie 1 through 30).
However, to get all the items in a particular row, you can just stick that tr into the query:
//div[#class='kal']//table//tr[3]/descendant::div[#class='cipars']
This would return 2 to 8 (the second row of calendar items).
To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:
//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']
Hopefully this is enough to show the issue at least.
Edit
Although you do have an XPath problem, you also have another issue.
The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.
Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).
Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.
The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?
Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.
So to send some POST data, we generate a method like so.....
public static string SendPost(string url, string postData)
{
string webpageContent = string.Empty;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = byteArray.Length;
using (Stream webpageStream = webRequest.GetRequestStream())
{
webpageStream.Write(byteArray, 0, byteArray.Length);
}
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
{
webpageContent = reader.ReadToEnd();
}
}
return webpageContent;
}
We can call it like so:
string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");
How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.
The responseBody that is returned is the physical HTML of just the table for the calendar.
What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.
var document = new HtmlDocument();
document.LoadHtml(webpageContent);
Now, we stick that original XPath in:
var node = document.DocumentNode.SelectSingleNode("//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']");
Now, we print out what should hopefully be "3":
Console.WriteLine(node.InnerText);
My output, running it locally, is indeed: 3.
However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.
The goal of my program is to grab a webpage and then generate a list of Absolute links with the pages it links to.
The problem I am having is when a page redirects to another page without the program knowing, it makes all the relative links wrong.
For example:
I give my program this link: moodle.pgmb.si/moodle/course/view.php?id=1
On this page, if it finds the link href="signup.php" meaning signup.php in the current directory, it errors because there is no directory above the root.
However this error is invalid because the page's real location is:
moodle.pgmb.si/moodle/login/index.php
Meaning that "signup.php" is linking to moodle.pgmb.si/signup.php which is a valid page, not moodle.pgmb.si/moodle/course/signup.php like my program thinks.
So my question is how is my program supposed to know that the page it received is at another location?
I am doing this in C Sharp using the follownig code to get the HTML
WebRequest wrq = WebRequest.Create(address);
WebResponse wrs = wrq.GetResponse();
StreamReader strdr = new StreamReader(wrs.GetResponseStream());
string html = strdr.ReadToEnd();
strdr.Close();
wrs.Close();
You should be able to use ResponseUri method of WebResponse class. This will contain the URI of the internet resource that actually provided the response data, as opposed to the resource that was requested. You can then use this URI to build correct links.
http://msdn.microsoft.com/en-us/library/system.net.webresponse.responseuri.aspx
What I would do is first check if each link is absolute or relative by searching for an "http://" within it. If it's absolute, you're done. If it's relative, then you need to append the path to the page you're scanning in front of it.
There are a number of ways you could get the current path: you could Split() it on the slashes ("/"), then recombine all but the last one. Or you could search for the last occurrence of a slash and then take a substring of up to and including that position.
Edit: Re-reading the question, I'm not sure I am understanding. href="signup.php" is a relative link, which should go to the /signup.php. So the current behavior you mentioned is correct "moodle.pgmb.si/moodle/course/signup.php."
The problem is that, if the URL isn't a relative or absolute URL, then you have no way of knowing where it goes unless you request it. Even then, it might not actually be being served from where you think it is located. This is because it might actually be implemented as a HTTP Redirect or similar server side.
So if you want to be exhaustive, what you can do is:
Use your current technique to grab a list of all links on the page.
Attempt to request each of those pages. Then if you:
Get a 200 responce code then all is good - it's there.
Get a 404 response code you know the page does not exist
Get a 3XX response code then you know where the web server
expects that content to actually orginate form.
Your (Http)WebResponse object should have a ResponseCode property. Note that you should also handle any possible WebException errors - these too will have a WebResponse with a ResponseCode in (usually 5xx).
You can also look at the HttpWebResponse Headers property - the Location header.
Someone please help - been struggling with this lousy problem!
What I'm doing - I have an ASPX page from which I originate a GET and then a POST to a HTTPS page with a view to login to it. I have spent quite a bit of time comparing my GET and POST construction to a browser GET/POST using fiddler (protocol analyzer) and my requests are fine.
However, when I try login through the browser, everything works fine and it logs in. When I run my page, I can see the correct GET and POST, but I get a 302 found 'object moved error'
Originally I thought this was a cookie issue, but after much experimentation I'm pretty sure this has nothing to do with cookies. I have disabled cookies AND javascript on the browser and tried, and the pages work fine without either. I then simulated the exact GET/POST.
This is my situation:
My GET and the browsers GET are EXACTLY THE SAME
The 200 OK response from the site is EXACTLY the same EXCEPT three VIEWSTATE variables which have slightly different lengths (why? why different even if GET is same?)
My POST and the browsers POST are EXACTLY the same EXCEPT the 3 Viewstate variables (I fill it correctly from the GET)
And yet, the browser logs in, while I get a 302 found / object moved errror.
A couple of other things -
a) I copied the POST response from a recent browser POST and replaced my POST params with this browser POST and that got me the right response! This indicates that
- my headers are just fine
- my coding setup / environment etc. are fine
- something fishy in the VIEWSTATE values, which can only be because the browser sent it to me in the first place (there is no corruption in my parsing the GET VIEWSTATE variables and using it in POST, it's perfectly fine)
update I have also tried WebClient just to check - no difference, same 302.
update The object moved basically points to a error page which says 'a serious error occurred blah blah' - the POST is causing a error at the server, and the ONLY difference between the good POST (of the browser) and my POST are the Viewstate variables.
So - WHAT AM I DOING WRONG? Why is this cruel world tormenting me?!!
(PS - one other difference in the browser sequence, not sure how it matters)
Browser:
CONNECT
GET
GET (for a favicon, which returns an error)
CONNECT
POST (success)
Me:
CONNECT
GET
POST (flaming failure, 302 - page moved)
and for those who care, my POST header construction code
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(URL);
myRequest.UserAgent = chromeUserAgent;
//myRequest.CookieContainer = cCookies;
myRequest.ContentType = "application/x-www-form-urlencoded";
myRequest.Accept = chromeAccept;
myRequest.Referer = url;
myRequest.AllowAutoRedirect = false;
myRequest.Host = "thesitethatskillingme.com";
myRequest.Headers.Add("Origin", "https://thesitethatskillingme.com");
myRequest.Headers.Add("Accept-Encoding", chromeAcceptEncoding);
myRequest.Headers.Add("Accept-Language", chromeAcceptLanguage);
myRequest.Headers.Add("Accept-Charset", chromeAcceptCharset);
myRequest.Headers.Add("Cache-Control", "max-age=0");
myRequest.ServicePoint.Expect100Continue = false;
myRequest.Method = "POST";
myRequest.KeepAlive = true;
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] bData = ascii.GetBytes(data);
myRequest.ContentLength = bData.Length;
using (Stream oStream = myRequest.GetRequestStream())
oStream.Write(bData, 0, bData.Length);
...and then read stream etc. no cookies.
I finally figured it out - and hopefully someone else who chances upon the same problem does not have to go through this again. It's possible that most HTTP gurus and people familiar with WWW development would never hit it, but a newbie quite well could.
So what was the problem? I had narrowed down the problem to VIEWSTATE which I always suspected (see my post above...). It turns out that all I had to do was to Server.UrlEncode the parsed VIEWSTATE values before putting them onto POST - that's it. It took me all day to get to that.
SO, as a learning to other newcomers
If you are trying to POST to a page through code and need to send it VIEWSTATE variables that you parsed from GET, then first Server.UrlEncode it before creating the parameters - for e.g.
do GET
get the response stream into a string
parse the string (I use HtmlAgilityPack- fabulous)
param1 = name +"="+Server.UrlEncode(value)+"&"
POST param = param1+param2+...
-send this in POST - voila, it works
because I have never, ever programmed with HttpWebRequest etc., I started by narrowing down my problem by eliminating cookies, javascript, GET construction, POST construction one-by-one using fiddler (great analyzer tool, free) and then finally did byte-comparison using BeyondCompare, and that's when I caught the VIEWSTATE variable modifications.
I learnt a lesson on URL encoding, and hopefully you won't have to!
We are using ExpertPDF to take URLs and turn them into PDFs. Everything we do is through memory, so we build up the request and then read the stream into ExpertPDF and then write the bits to file. All the files we have been requesting so far are just plain HTML documents. Our designers update CSS files or change the HTML and rerequest the documents as PDFs, but often times, things are getting cached. Take, for example, if I rename the only CSS file and view the HTML page through a web browser, the page looks broke because the CSS doesn't exist. But if I request that page through the PDF Generator, it still looks ok, which means somewhere the CSS is cached. Here's the relevant PDF creation code:
// Create a request
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "IE 8.0";
request.ContentType = "application/x-www-form-urlencoded";
request.Method = "GET";
// Send the request
HttpWebResponse resp = (HttpWebResponse)request.GetResponse();
if (resp.IsFromCache) {
System.Web.HttpContext.Current.Trace.Write("FROM THE CACHE!!!");
} else {
System.Web.HttpContext.Current.Trace.Write("not from cache");
}
// Read the response
pdf.SavePdfFromHtmlStream(resp.GetResponseStream(), System.Text.Encoding.UTF8, "Output.pdf");
When I check the trace file, nothing is being loaded from cache. I checked the IIS log file and found a 200 response coming from the request, even after a file had been updated (I would expect a 302). We've tried putting the No-Cache attribute on all HTML pages, but still no luck. I even turned off all caching at the IIS level. Is there anything in ExpertPDF that might be caching somewhere or something I can do to the request object to do a hard refresh of all resources?
UPDATE
I put ?foo at the end of my style href links and this updates the CSS everytime. Is there a setting someplace that can prevent stylesheets from being cached so I don't have to do this inelegant solution?
Actually this is a perfectly normal solution, though I would recommend using something like the current date and time attached to the PDF link/file name (like you did for the css sheet
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url + DateTime.Now.ToString().Replace(":", "").Replace("-", "").Replace(" ", ""));)
rather than foo on your stlye sheet. As the date and time will ALWAYS change, you will force the download each time.
I would venture to guess that the caching is not the CSS style sheet, but rather the PDF is being cached by the client. Adding the URL variable to your stylesheet is preventing it from being cached. (I think you fixed the problem, but probably not, in my opinion, the best way) Try the above tip, and you should not have any file caching problems.
PS. I know you can use DateTime.Now.ToString(formathere) but I am too lazy to look it up right now ;)