ExpertPDF and Caching of URLs

ExpertPDF and Caching of URLs - c#

We are using ExpertPDF to take URLs and turn them into PDFs. Everything we do is through memory, so we build up the request and then read the stream into ExpertPDF and then write the bits to file. All the files we have been requesting so far are just plain HTML documents. Our designers update CSS files or change the HTML and rerequest the documents as PDFs, but often times, things are getting cached. Take, for example, if I rename the only CSS file and view the HTML page through a web browser, the page looks broke because the CSS doesn't exist. But if I request that page through the PDF Generator, it still looks ok, which means somewhere the CSS is cached. Here's the relevant PDF creation code:
// Create a request
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "IE 8.0";
request.ContentType = "application/x-www-form-urlencoded";
request.Method = "GET";
// Send the request
HttpWebResponse resp = (HttpWebResponse)request.GetResponse();
if (resp.IsFromCache) {
System.Web.HttpContext.Current.Trace.Write("FROM THE CACHE!!!");
} else {
System.Web.HttpContext.Current.Trace.Write("not from cache");
}
// Read the response
pdf.SavePdfFromHtmlStream(resp.GetResponseStream(), System.Text.Encoding.UTF8, "Output.pdf");
When I check the trace file, nothing is being loaded from cache. I checked the IIS log file and found a 200 response coming from the request, even after a file had been updated (I would expect a 302). We've tried putting the No-Cache attribute on all HTML pages, but still no luck. I even turned off all caching at the IIS level. Is there anything in ExpertPDF that might be caching somewhere or something I can do to the request object to do a hard refresh of all resources?
UPDATE
I put ?foo at the end of my style href links and this updates the CSS everytime. Is there a setting someplace that can prevent stylesheets from being cached so I don't have to do this inelegant solution?

Actually this is a perfectly normal solution, though I would recommend using something like the current date and time attached to the PDF link/file name (like you did for the css sheet
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url + DateTime.Now.ToString().Replace(":", "").Replace("-", "").Replace(" ", ""));)
rather than foo on your stlye sheet. As the date and time will ALWAYS change, you will force the download each time.
I would venture to guess that the caching is not the CSS style sheet, but rather the PDF is being cached by the client. Adding the URL variable to your stylesheet is preventing it from being cached. (I think you fixed the problem, but probably not, in my opinion, the best way) Try the above tip, and you should not have any file caching problems.
PS. I know you can use DateTime.Now.ToString(formathere) but I am too lazy to look it up right now ;)

Related

HTTP Response Codes C#

Here is the code down below
List<int> j = new List<int>();
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(result.SiteURL);
webRequest.AllowAutoRedirect = false;
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
j.Add((int)response.StatusCode);
What i want to do is, get all the response codes, seperate them(like 2xx,3xx 4xx-5xx) and put them in different lists. Because i need their numbers like how many 4xx responses are there or how many 200 responses are there. Or is there another way to do it?
result.SiteURL is the URL that for the responses. The problem is the last line of the code doesn't return or get anything. What am i missing here?
edit: The main problem is that whatever i try i only get 1 response code and that is mostly 200:OK. But, for youtube.com(ect) there must be 74 OK(200) responses, 1 No Content(204) response and 2 Moved Permanently(301) responses according to https://tools.pingdom.com/#!/fMjhr/youtube.com. How am i going to get them?

You misunderstand the result shown by pingdom.
Pingdom requests a web page just like a browser would: It loads the page itself, as well as all resources references by the page: style sheets, scripts, images, etc.
Your code only loads the main HTML page, which has great availability and always returns 200 OK.
If you want to reproduce pingdom's results, you'll need to parse the HTML page and load the page's resources as well. Keep in mind that parsing HTML is a non-trivial task (browser vendors put a lot of effort in it), so you might want to reconsider whether this is worth your time.

How to get the content type of a file from a redirecting page?

I have a application, which logs you on to a website and then downloads some files from that website.
Although, I have been able to download all the type of files and save them properly except whose link redirects to another page.
for ex if in the source code of the web page , the link address is written as :-
"http://someurl.com/view.php", then this link redirects and the download immediately starts(when we click on the link in a web brwoser).
I have been able to download this file programmatically using HttpWebRequest
and then setting the AllowAutoRedirect = true.
The problem arises while saving, I need to have the extension of the downloaded file(whether it is a word document, pdf file or some other file)
How should I check that?
Some of the code which I am using is :-
HttpWebRequest request = WebRequest.Create(UriObj) as HttpWebRequest;
request.Proxy = null;
request.CookieContainer = CC;
request.AllowAutoRedirect = true ;
request.ContentType = "application/x-www-form-urlencoded";

When you get a redirected response, the ResponseUri property will give you the URI of the resource that actually responded.
So if http://someurl.com/view.php redirected to http://example.com/foo.doc, then ResponseUri will contain http://example.com/foo.doc.
Understand that the "extension" might not be what you expect. For example, I've seen a URL like http://example.com/document.php return a PDF file. I've seen URLs with ".mp3" extension return image files, etc.
You can check the ContentType header, which is usually a more reliable indicator of the actual content of the response, but not a guarantee.

How to use output caching when using Server.Transfer? [duplicate]

I'm using an .aspx page to serve an image file from the file system according to the given parameters.
Server.Transfer(imageFilePath);
When this code runs, the image is served, but no Last-Modified HTTP Header is created.
as opposed to that same file, being called directly from the URL on the same Server.
Therefor the browser doesn't issue an If-Modified-Since and doesn't cache the response.
Is there a way to make the server create the HTTP Headers like normally does with a direct request of a file (image in that case) or do I have to manually create the headers?

When you make a transfer to the file, the server will return the same headers as it does for an .aspx file, because it's basically executed by the .NET engine.
You basically have two options:
Make a redirect to the file instead, so that the browser makes the request for it.
Set the headers you want, and use Request.BinaryWrite (or smiiliar) to send the file data back in the response.

I'll expand on #Guffa's answer and share my chosen solution.
When calling the Server.Transfer method, the .NET engine treats it like an .aspx page, so It doesn't add the appropriate HTTP Headers needed (e.g. for caching) when serving a static file.
There are three options
Using Response.Redirect, so the browser makes the appropriate request
Setting the headers needed and using Request.BinaryWrite to serve the content
Setting the headers needed and calling Server.Transfer
I choose the third option, here is my code:
try
{
DateTime fileLastModified = File.GetLastWriteTimeUtc(MapPath(fileVirtualPath));
fileLastModified = new DateTime(fileLastModified.Year, fileLastModified.Month, fileLastModified.Day, fileLastModified.Hour, fileLastModified.Minute, fileLastModified.Second);
if (Request.Headers["If-Modified-Since"] != null)
{
DateTime modifiedSince = DateTime.Parse(Request.Headers["If-Modified-Since"]);
if (modifiedSince.ToUniversalTime() >= fileLastModified)
{
Response.StatusCode = 304;
Response.StatusDescription = "Not Modified";
return;
}
}
Response.AddHeader("Last-Modified", fileLastModified.ToString("R"));
}
catch
{
Response.StatusCode = 404;
Response.StatusDescription = "Not found";
return;
}
Server.Transfer(fileVirtualPath);

How to get the address of a redirected page?

The goal of my program is to grab a webpage and then generate a list of Absolute links with the pages it links to.
The problem I am having is when a page redirects to another page without the program knowing, it makes all the relative links wrong.
For example:
I give my program this link: moodle.pgmb.si/moodle/course/view.php?id=1
On this page, if it finds the link href="signup.php" meaning signup.php in the current directory, it errors because there is no directory above the root.
However this error is invalid because the page's real location is:
moodle.pgmb.si/moodle/login/index.php
Meaning that "signup.php" is linking to moodle.pgmb.si/signup.php which is a valid page, not moodle.pgmb.si/moodle/course/signup.php like my program thinks.
So my question is how is my program supposed to know that the page it received is at another location?
I am doing this in C Sharp using the follownig code to get the HTML
WebRequest wrq = WebRequest.Create(address);
WebResponse wrs = wrq.GetResponse();
StreamReader strdr = new StreamReader(wrs.GetResponseStream());
string html = strdr.ReadToEnd();
strdr.Close();
wrs.Close();

You should be able to use ResponseUri method of WebResponse class. This will contain the URI of the internet resource that actually provided the response data, as opposed to the resource that was requested. You can then use this URI to build correct links.
http://msdn.microsoft.com/en-us/library/system.net.webresponse.responseuri.aspx

What I would do is first check if each link is absolute or relative by searching for an "http://" within it. If it's absolute, you're done. If it's relative, then you need to append the path to the page you're scanning in front of it.
There are a number of ways you could get the current path: you could Split() it on the slashes ("/"), then recombine all but the last one. Or you could search for the last occurrence of a slash and then take a substring of up to and including that position.
Edit: Re-reading the question, I'm not sure I am understanding. href="signup.php" is a relative link, which should go to the /signup.php. So the current behavior you mentioned is correct "moodle.pgmb.si/moodle/course/signup.php."

The problem is that, if the URL isn't a relative or absolute URL, then you have no way of knowing where it goes unless you request it. Even then, it might not actually be being served from where you think it is located. This is because it might actually be implemented as a HTTP Redirect or similar server side.
So if you want to be exhaustive, what you can do is:
Use your current technique to grab a list of all links on the page.
Attempt to request each of those pages. Then if you:
Get a 200 responce code then all is good - it's there.
Get a 404 response code you know the page does not exist
Get a 3XX response code then you know where the web server
expects that content to actually orginate form.
Your (Http)WebResponse object should have a ResponseCode property. Note that you should also handle any possible WebException errors - these too will have a WebResponse with a ResponseCode in (usually 5xx).
You can also look at the HttpWebResponse Headers property - the Location header.

How is an image served from a URL with an ASPX extension?

Can Somebody tell how to create such kind of urls
for example if you see the url
http://office.microsoft.com/global/images/default.aspx?assetid=ZA103873861033
you will redirect to an image ..
my question is , though this url is an image..its extension is aspx..how is it possible.
how to create such kind of url's
Thanks

This is a common method for displaying an image that's stored as a binary object in a database. One tutorial, among many, can be found here.
Essentially, what they're doing is using the aspx page to accept the URL parameter which tells them what image to fetch from the database. Then in the response they clear all output and headers, set the headers for the image, write the binary data to the response stream, and close the response stream.
So it's not really "redirecting" you to an image. The "page" being requested turns out to be an image resource in the response.

By setting the ContentType in the response from the server
HttpContext.Response.ContentType = "image/jpeg";

easiest way is to add generic handler *.ashx and in ashx file u'll have code behind which u can get querystring and manipulate response eg. Response.WriteFile(...)

File extensions literally have no meaning on the WWW. The thing that correctly describes the content at a particular URL is the content-type/MIME-type. This is delivered in an HTTP header when the URL is requested prior to delivery of the main HTTP payload. Other answers describe how you might correctly set this in ASP.NET.

Aside from all other answers they may be doing a Server.Transfer() (so that you don't see it client-side) to the image file. This still means the response headers are being set to the appropriate MIME type but it also means the image isn't necesarilly coming from a database. This technique can be used to hide the actual image URL in attempts to prevent hotlinking.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

ExpertPDF and Caching of URLs - c#

Related

HTTP Response Codes C#

How to get the content type of a file from a redirecting page?

How to use output caching when using Server.Transfer? [duplicate]

How to get the address of a redirected page?

How is an image served from a URL with an ASPX extension?

Categories

Resources