I'm working on an existing large site that uses querystings in ID for different sections (representing physical stores) of the website.
I'd like to be able to implement pathinfo requests for SEO purposes so I'm looking at URLS like:
http://www.domain.com/cooking-classes.aspx?ID=5 (where 5 would be the ID of the local store)
Is there a way to make this type of URL work?
http://www.domain.com/cooking-classes.aspx?ID=5/chocolate ? I can get the content to work without the querystring however the existing infrastructure needs the ID to run. I tried:
http://www.domain.com/cooking-classes.aspx/chocolate?ID=5 however the ID comes back incorrectly.
Using http://www.domain.com/cooking-classes.aspx/5/chocolate means a rewrte of the page handling engine.
Am I clutching at straws here? No real way to get PathInfo and Querystring to play nicely with each other?
I'd like to stay away from any IIS mods as we don't have access.
Your last URL is going to yield the best result for search engines, however you may want to drop the .aspx. You will need to write an HttpHandler or HttpModule to be able to accomplish this. It's actually not as much work as it may seem, and you don't have to change your page at all. Your HttpHandler can do a behind the scenes redirect preserving the URL. Check out this article on the MSDN:
http://msdn.microsoft.com/en-us/library/ms972974.aspx
If you don't need anything super specific, you could use an existing HttpModule like the one mentioned in the post on ScottGu's blog:
http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
He mentions UrlRewriter.net which is open source:
http://urlrewriter.net/
Related
Our site consists of 3 main pages we call "Start.aspx" and then a content iframe inside of that where the user does nearly all of the site interactions.
Recently though, I've had to implement functionality that will jump between Start.aspx pages in different products and automatically change the content iframe to a specified page.
The actual functionality works just fine, but the issue we're having is that the full querystring is exposed. Because we load all pages in the content iframe, the page URL remains at "Product/Start.aspx" during regular site usage.
However, this new functionality is passing a querystring to Start.aspx (which has appropriate parsers to load the requested page in the content iframe), and we need that URL to remain as "Start.aspx".
So far, I've researched into URL Rewriting, which was throwing errors because the landing page for each product is "[Product]/Start.aspx". I've looked at a different URL Rewriting solution, as well as ScottGu's blog post on routing.
The issue is that these solutions seem to be used for simplifying navigation, such as taking "Blogpost.aspx?Year=2013&Month=07&Day=15" and turning it into "Blogpost.aspx/2013/07/14" which really isn't what we're going for. We're not trying to simplify navigation via URL, we're really just trying to completely hide our querystrings.
What we're going for is turning "[Product]/Start.aspx?frame=Company.aspx?id=1570" into "[Product]/Start.aspx" once the content iframe has what it needs from the initial querystring. We don't need to account for every single page. We just need that to be the overarching rule. 90% of the time it won't be an issue, as most of the work being done doesn't jump from product to product without the user just switching products (which is done in a fashion that specifically uses "Response.Redirect("[Product]/Start.aspx")".
Once the content iframe has loaded from the Querystring paramters, we don't need them anymore for anything. The rest of the functionality runs through the iframe without any issue.
Am I overthinking this, or am I asking for something that's not really feasible?
As far as literally "removing all of the query string characters" and still beg able to pass the querystring values to another page, I do not think that is possible. Unless you do it in a Session Variable or something like that.
IF you're simply worried about sensitive data being displayed in plain text in the query string, there is the option of "encrypting" the query string:
http://www.codeproject.com/Articles/33350/Encrypting-Query-Strings
The query string will still show but it will be "Product/Start.aspx?e0ayfefae0y0someencryptedmess108yfe0ayf0a". The page that receives the query string would decrypt it. So the functionality of the query string is there, but the values are not known to the end user.
Since you've tagged this as an ASP.NET question, I'd say the way to go is to keep navigation data in your Session variables.
Can you use a POST instead of a GET? That way, the data is in the form, rather than the Query String.
As a side note, hiding the parameters as a way of making the URL look nicer and be bookmark-able is fine. If you're doing it for any kind of security reasons, it's very shallow security. It's trivial for a user to see what's being passed in both the form and on the query string and to change and repost those. Security needs to be handled primarily on the server side.
I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links?
Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details.
You're trying to program a web crawler but it cannot crawl URLs that end with .php?
If that's the case you need to take a step back and think about why that is. It could be because the crawler chooses which URLs to crawl using a regex based on an URI scheme.
In most cases these URLs are just normal HTML but they could also be a generated image (like a captcha) or a download link for a 700mb iso file - and there's no way to know be certain without checking out the header of the HTTP response from that URL.
Note: If you're writing your own crawler from scratch you're going to need good understanding of HTTP.
The first thing your crawler is going to see when gets an URL is the header, which contains a MIME content-type - it tells a browser/crawler how to process and open the data (is it HTML, normal text, .exe, etc). You'll probably want to download pages based on the MIME type instead of an URL scheme. The MIME type for HTML is text/html and you should check for that using the HTTP library you're using before downloading the rest of the content of an URL.
The Javascript problem
Same as above except that running javascript in the crawler/parser is pretty uncommon for simple projects and might create more problems than it solves. Why do you need Javascript?
A different solution
If you're willing to learn Python (or already know it) I suggest you look at Scrapy. It's a web crawling framework built similarly to the Django web framework. It's really easy to use and a lot of problems have already been solved so it could be a good starting point if you're trying to learn more about the technology.
Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.
it would be the same way even it is dynamic or not. actually a crawler is only a mater of 3 things
The url
The data it sent to server if it is a POST Method then
The cookie if authentication is required
that's all,
the common problem when doing crawler:
Miss-guess of default page [index.html, index.php, default.aspx etc].. actually it will work without it for all method [POST/GET]
One of each field name is not written exactly
ASP.Net form viewstate id field (i forgot the name) but i can be achieve easily
Dynamic page generated by javascript. this one is the hardest part and the most cases even google still have problem about this.
hope that help.
You might want to look at this question which details how to write a crawler or look at the source code for http://searcharoo.net/ which contains a good crawler (see here).
I have a page with a menu that uses JQuery AJAX calls to populate the page with. To reflect any changes I update the URL with a #... instead of ?... or /... So an URL that originally reads : htpp://localhost/pages/index/id=1 would look like : http://localhost/#pages/index/id=1. If a user bookmarks this, and later comes back to the page, I wonder if it's possible to use the second URL in my route decoding, or if I have to load it blank, then use the same JS/Ajax to populate the page?
In my mind it is problematic to use Ajax in these cases if a user copies the link and mails it to a friend with JavaScript disabled.
edit#1: Fixed some spelling.
edit#2: To clarify the question a bit: I want a site where I can do the following:
(a): with javascript turned on, use ajax calls to replace the content of a div (without reloading the page)
(b): with javascript turned on, bookmark the page as it is after the ajax call i (a)
(c): take the URL, send it to a person with noscript turned on, and have the same page as after the ajax call was made.
(a) and (b) works just fine on my page but (c) is seemingly impossible.
Currently, the only portion of a URL you can update without causing the browser to redirect is the hash. This portion of the URL is not sent to the server in a request and is only available for client-side processing, so it cannot be used to provide a javascript-free way of providing a link.
The issue you are facing is a common one amongst those using AJAX. The best solution I've encountered is to provide a way to view any AJAX-loaded state of every page through a "true" URL, one that will be passed to the server.
This means you have one URL which provides a "snapshot" of a page's state:
http://localhost/pages/index/1/someaction
And an AJAX-specific URL which provides the local state of the page in the client's browser:
http://localhost/pages/index/1#someaction
What you then have to do is provide some means of generating the "snapshot" link to the page from the AJAX version. A "Link to this Page" or "Permanent Link" button is a reasonable option.
This not possible simply because everything that comes after the # sign (fragment identifier) is never sent to the server and there's no way for the server to ever capture this value, so no routing with it.
You could try replacing the '#' with a '?' This will send the rest of it as a get variable, so you may need to do some tweeks, such as change the format to http://localhost/?pages=index&id=1
There are some fancy things you can set up with the web server so that localhost/article/fancystuff is re-directed to localhost/article.php?title=fancystuff
There are a lot of ways of allowing for an AJAX site to work with bookmarks and the back button. But you should ask your self, do you want people to do certain things. Generally, AJAX is used for more advanced web-applications that do not map well to the traditional back and forth model.
EDIT
What with you additions to the question. I will say that seeming as you want to fully support users who are scared of Javascript that you will need to make your site work perfectly with out any AJAX at all. But you should design it in such a way, that the content of pages are included from separate files. This means that when you add in the additional Javascript, it can load the file and place it more or less directly into the content holder on your page.
You do need to remember that you can't force some one to accept a bookmark or force a change to a book mark. What you are after may be best served suing cookies. Luckily, even less people are scared of cookies, hardly anyone disables them, unless they are either paranoid or up to something.
I want to hide the guts of my URL programmatically.
I know I can use:
Server.Transfer("url",boolean)
This is not what I want in this case. I would like to be able to manipulate the URL after I get the variables I need.
How would I do this in ASP.NET?
Edit:
My URL:
URL.aspx?st=S&scannum=481854
I want to change it when the page loads to just be URL.aspx? but I need to first get the st and scannum values.
Have you seen this article that covers Url Rewriting in ASP.NET?
I recommend checking out ASP.NET MVC as well. MVC stands for Model View Controller. This framework will use a "controller" to route the end user to "views" that display your data (your "model"). MVC does all the routing for you based on the URL.
If you're passing in variables that you don't want displayed in the URL, why not use POST instead of GET?
You will have to provide more details on what your desired end result is. There are many options for manipulating the URL.
Using POST will allow you to transfer information between pages without littering your URL with extra values. Using encryption will not hide the extra parameters, but will make them unreadable. Using a URL Rewriter you can use regex to have the user enter one URL, but actually load another.
I have answered a question similar to this in the past. I say similar, because I am not sure what exactly you are looking for, but I feel the need to post a link to the other question to see if it will help:
ASP.NET - Building your own routing system
Look at ASP.NET Routing for new Apps. Have you tried HttpContext.RewritePath Method (String) Cross-Page Posting in ASP.NET Web Pages
It's not possible to do what I want to do. I would want to change the appearance of my url in javascript without refreshing. If this was possible, hackers would rule the world.