Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.
it would be the same way even it is dynamic or not. actually a crawler is only a mater of 3 things
The url
The data it sent to server if it is a POST Method then
The cookie if authentication is required
that's all,
the common problem when doing crawler:
Miss-guess of default page [index.html, index.php, default.aspx etc].. actually it will work without it for all method [POST/GET]
One of each field name is not written exactly
ASP.Net form viewstate id field (i forgot the name) but i can be achieve easily
Dynamic page generated by javascript. this one is the hardest part and the most cases even google still have problem about this.
hope that help.
You might want to look at this question which details how to write a crawler or look at the source code for http://searcharoo.net/ which contains a good crawler (see here).
Related
Our site consists of 3 main pages we call "Start.aspx" and then a content iframe inside of that where the user does nearly all of the site interactions.
Recently though, I've had to implement functionality that will jump between Start.aspx pages in different products and automatically change the content iframe to a specified page.
The actual functionality works just fine, but the issue we're having is that the full querystring is exposed. Because we load all pages in the content iframe, the page URL remains at "Product/Start.aspx" during regular site usage.
However, this new functionality is passing a querystring to Start.aspx (which has appropriate parsers to load the requested page in the content iframe), and we need that URL to remain as "Start.aspx".
So far, I've researched into URL Rewriting, which was throwing errors because the landing page for each product is "[Product]/Start.aspx". I've looked at a different URL Rewriting solution, as well as ScottGu's blog post on routing.
The issue is that these solutions seem to be used for simplifying navigation, such as taking "Blogpost.aspx?Year=2013&Month=07&Day=15" and turning it into "Blogpost.aspx/2013/07/14" which really isn't what we're going for. We're not trying to simplify navigation via URL, we're really just trying to completely hide our querystrings.
What we're going for is turning "[Product]/Start.aspx?frame=Company.aspx?id=1570" into "[Product]/Start.aspx" once the content iframe has what it needs from the initial querystring. We don't need to account for every single page. We just need that to be the overarching rule. 90% of the time it won't be an issue, as most of the work being done doesn't jump from product to product without the user just switching products (which is done in a fashion that specifically uses "Response.Redirect("[Product]/Start.aspx")".
Once the content iframe has loaded from the Querystring paramters, we don't need them anymore for anything. The rest of the functionality runs through the iframe without any issue.
Am I overthinking this, or am I asking for something that's not really feasible?
As far as literally "removing all of the query string characters" and still beg able to pass the querystring values to another page, I do not think that is possible. Unless you do it in a Session Variable or something like that.
IF you're simply worried about sensitive data being displayed in plain text in the query string, there is the option of "encrypting" the query string:
http://www.codeproject.com/Articles/33350/Encrypting-Query-Strings
The query string will still show but it will be "Product/Start.aspx?e0ayfefae0y0someencryptedmess108yfe0ayf0a". The page that receives the query string would decrypt it. So the functionality of the query string is there, but the values are not known to the end user.
Since you've tagged this as an ASP.NET question, I'd say the way to go is to keep navigation data in your Session variables.
Can you use a POST instead of a GET? That way, the data is in the form, rather than the Query String.
As a side note, hiding the parameters as a way of making the URL look nicer and be bookmark-able is fine. If you're doing it for any kind of security reasons, it's very shallow security. It's trivial for a user to see what's being passed in both the form and on the query string and to change and repost those. Security needs to be handled primarily on the server side.
I have the following problem. My C# application makes use of a webrequest and reades the html code of a url in order to do a bunch of things later on. While everything works fine, there are some websites that when visited they redirect you to another website (http://something.com/disclamer) for example and after you click yes you go back to the original website.
When I run my app it always only checks the html code of the disclaimer page and never gets to the actual page I asked for. I cannot really find a solution at this point since I can't find anything useful in the short html code of the disclaimer site (that comes before the one I want to check).
Any ideas on how I can skip that and take the code for the website I am actually interested in? Please note that I can't find any html redirection code indication (META HTTP-EQUIV etc) in any of the two websites.
Thank you
You can check the StatusCode to decide if you have been redirected. Status codes of 30x will tell you that you've been redirected, in which case you'll need to follow the link in the redirect.
http://msdn.microsoft.com/en-GB/library/system.net.httpwebresponse.statuscode.aspx
I believe you have to analyse the HTML code of such a page with a disclaimer to know which method is used there to make a redirect. Each case can be different: it can be either client-side or server-side redirection. If this disclaimer button submits a form, you will need to make a POST request imitating a button click.
I'm working on an existing large site that uses querystings in ID for different sections (representing physical stores) of the website.
I'd like to be able to implement pathinfo requests for SEO purposes so I'm looking at URLS like:
http://www.domain.com/cooking-classes.aspx?ID=5 (where 5 would be the ID of the local store)
Is there a way to make this type of URL work?
http://www.domain.com/cooking-classes.aspx?ID=5/chocolate ? I can get the content to work without the querystring however the existing infrastructure needs the ID to run. I tried:
http://www.domain.com/cooking-classes.aspx/chocolate?ID=5 however the ID comes back incorrectly.
Using http://www.domain.com/cooking-classes.aspx/5/chocolate means a rewrte of the page handling engine.
Am I clutching at straws here? No real way to get PathInfo and Querystring to play nicely with each other?
I'd like to stay away from any IIS mods as we don't have access.
Your last URL is going to yield the best result for search engines, however you may want to drop the .aspx. You will need to write an HttpHandler or HttpModule to be able to accomplish this. It's actually not as much work as it may seem, and you don't have to change your page at all. Your HttpHandler can do a behind the scenes redirect preserving the URL. Check out this article on the MSDN:
http://msdn.microsoft.com/en-us/library/ms972974.aspx
If you don't need anything super specific, you could use an existing HttpModule like the one mentioned in the post on ScottGu's blog:
http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
He mentions UrlRewriter.net which is open source:
http://urlrewriter.net/
I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)
The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.
You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.
yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.
You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class
I have a page with a menu that uses JQuery AJAX calls to populate the page with. To reflect any changes I update the URL with a #... instead of ?... or /... So an URL that originally reads : htpp://localhost/pages/index/id=1 would look like : http://localhost/#pages/index/id=1. If a user bookmarks this, and later comes back to the page, I wonder if it's possible to use the second URL in my route decoding, or if I have to load it blank, then use the same JS/Ajax to populate the page?
In my mind it is problematic to use Ajax in these cases if a user copies the link and mails it to a friend with JavaScript disabled.
edit#1: Fixed some spelling.
edit#2: To clarify the question a bit: I want a site where I can do the following:
(a): with javascript turned on, use ajax calls to replace the content of a div (without reloading the page)
(b): with javascript turned on, bookmark the page as it is after the ajax call i (a)
(c): take the URL, send it to a person with noscript turned on, and have the same page as after the ajax call was made.
(a) and (b) works just fine on my page but (c) is seemingly impossible.
Currently, the only portion of a URL you can update without causing the browser to redirect is the hash. This portion of the URL is not sent to the server in a request and is only available for client-side processing, so it cannot be used to provide a javascript-free way of providing a link.
The issue you are facing is a common one amongst those using AJAX. The best solution I've encountered is to provide a way to view any AJAX-loaded state of every page through a "true" URL, one that will be passed to the server.
This means you have one URL which provides a "snapshot" of a page's state:
http://localhost/pages/index/1/someaction
And an AJAX-specific URL which provides the local state of the page in the client's browser:
http://localhost/pages/index/1#someaction
What you then have to do is provide some means of generating the "snapshot" link to the page from the AJAX version. A "Link to this Page" or "Permanent Link" button is a reasonable option.
This not possible simply because everything that comes after the # sign (fragment identifier) is never sent to the server and there's no way for the server to ever capture this value, so no routing with it.
You could try replacing the '#' with a '?' This will send the rest of it as a get variable, so you may need to do some tweeks, such as change the format to http://localhost/?pages=index&id=1
There are some fancy things you can set up with the web server so that localhost/article/fancystuff is re-directed to localhost/article.php?title=fancystuff
There are a lot of ways of allowing for an AJAX site to work with bookmarks and the back button. But you should ask your self, do you want people to do certain things. Generally, AJAX is used for more advanced web-applications that do not map well to the traditional back and forth model.
EDIT
What with you additions to the question. I will say that seeming as you want to fully support users who are scared of Javascript that you will need to make your site work perfectly with out any AJAX at all. But you should design it in such a way, that the content of pages are included from separate files. This means that when you add in the additional Javascript, it can load the file and place it more or less directly into the content holder on your page.
You do need to remember that you can't force some one to accept a bookmark or force a change to a book mark. What you are after may be best served suing cookies. Luckily, even less people are scared of cookies, hardly anyone disables them, unless they are either paranoid or up to something.