Ok I want to develop a scraping application to download specific text inside a div tag on a website. Lets take for example
<div class="main_content">WOTEVER GOES IN HERE, GOES IN HERE</div>
How would I go about downloading the text
WOTEVER GOES IN HERE, GOES IN HERE
I understand I would need to use WebClient() with
.DownloadFile(sourceFileAddress, destinationFilePath);
Thankyou
HTTP requests are done on the "resource" basis and this resource is a file -> you can't download some text from a page, you need to download the file and parse it.
If the file is eg. very big and you know the div is at the beginning you may consider using TCP/IP sockets and handling the request and response manually (parsing on the fly), but I don't really know if that would give you any benefit.
Related
So as for input:
I am using C# and Selenium Webdriver to automate some verification on a website. Browser is IE9.
The steps that I am working on:
I have a table that was generated by ajax query. When I click print button it returns me a file to download that can be printed.
The issue that I need to catch the link to the file that is proposed for downloading and I have run out of ideas how to do that.
So I would be grateful to hear any advice of skilled users =).
Updated 08/01/14:
Ups sorry, I forgot to say that there is no link, actually button click triggers either a JS or ajax request that creates a document and then only then the link is generated and Open/Save IE dialog is displayed.
Updated
Link HTML
<a id="ucRadGrid_lnkPrintPDF" onclick="ucRadGrid.print();" href="javascript:__doPostBack('ucRadGrid','PDF')">
What i do, is i never allow my webdriver to manage a download. What i'll do is use pure C# to download the file for me.
You can do this simply by finding the link's href attr, and downloading that. Here is some pseudo-code:
var href = driver.FindElement(By.ID("download_link").GetAttribute("href");'
DownloadFile(href, "my file.ext");
From there, you can do what you need to. Validate the text using pure C#, etc.
EDIT After your comment below:
What you can to do is find the URL of the resource you want to download. That might require even playing around with the JS in the page. Find the function that downloads the file, and either execute that code, or parse through the function for the URL, and then execute DownloadFile
I have some web pages with A lot of tags.i want to download source Page where it tag is Span and its className is Something.
It's possible that i just download a part of page(source code) not whole page?
I know that i can do it with webbrowser(for example navigate to my destination page and search for specific tag and get its source code)
But with it,i must first get whole page and after that get specific tag.
there is any way (for example: WebClient class) to download just my specific tag with specific ClassName source code?
No, the HTTP protocol doesn't have any facility to do what you need (the only thing one can do is get a certain Range, but that requires you to know exactly where the data is, so that doesn't seem to help), you will have to download the entire page and then parse what you need.
I'm afraid that you cannot download just parts of your page and you need to load the whole page first. But to make it maybe easier, you can parse the HTML in XML and then work with it which is a lot easier.
This website that keeps updating some live information about the bus timings in Helsinki.
I want to parse the live information from the website and display it on my WP7 phone. The user needs to enter the bus stop number and the WP7 app should show the buses/trams currently in the bus stop.
Is there any way I could obtain the real time information from the website?
If you look at the source of the website (http://www.omatlahdot.fi/omatlahdot/web?command=fullscreen&stop=1020455) -- in IE right-click on the page and select View Source -- you'll see that there's really very little in the actual source file, in particular none of the data is there. All of the hard work is coming from the referenced javascript file scripts/fullscreen_header.js (full path is http://www.omatlahdot.fi/omatlahdot/scripts/fullscreen_header.js). You want to download that .js file and study how it retrieves data with AJAX calls. Start with the reloadPage function.
You can make these same calls (e.g., using WebClient) to retrieve the data into your application. If you want to extract the data from the returned HTML, I'd consider parsing it simply as a string since I am assuming that it would have a very regular structure and dragging in a general-purpose HTML parser would probably be overkill.
Alternatively, you might find out if the omatlahodot.fi provides the data as JSON or XML feeds, so you don't have to "screen-scrape" the HTML. I don't read Finnish, so I can't help you with that. Look around on their websites (maybe a section called "dev" or "api") or send them an email inquiry.
Please let us know how it works out!
I want to download an image from a cartoon website. and my app is WinForm,not WebForm.
So let's say that there is an image on the a.html.
Normally, when I click the previous page and am redirected to this page,
there will be a image :"image is loading",let's say A.jpg, in the same block.
After 5 seconds, the real one,let's say B.jpg, will be displayed.
So what I got is only the caching image rather than the one,B.jpg, which I want.
So..... how should I do it?
Thanks in advance.
ps: I have posted this qustion for more than 48 hours, and only got a few of answers which don't solve my problem.
I am wondering that why there are only 2 people posted their answers?
Is my question not clear?
If any, please let me know.
Thanks
EDIT: Original answer removed since I misunderstood the question entirely.
What you want to do is basically HTML scraping: using the actual html of the page to discover where files are hosted and download them. Because I'm not sure if there are any legal reasons that would prevent you from downloading the image files in this manner, I'm just going to outline an approach to doing this and not provide any working samples or anything. In other words, use this information at your own risk.
Using fiddler2 in Firefox, you should be able to find the domain and full url that one of the images is downloaded from. Basically just start fiddler2, navigate to the site in firefox, and then look for the biggest file that is downloaded. That will tell you exactly where the image is coming from.
Next, take a look at the HTML source code for the page you are viewing. The way this particular site works, it looks like it hides the previous/next downloads in a swf or something, but you can find the urls in the javascript for the page. Look for a javascript array called picArr.
To download these using a WinForms app, I would use the WebRequest object. Create a request for each image url and save the response to disk.
I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links?
Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details.
You're trying to program a web crawler but it cannot crawl URLs that end with .php?
If that's the case you need to take a step back and think about why that is. It could be because the crawler chooses which URLs to crawl using a regex based on an URI scheme.
In most cases these URLs are just normal HTML but they could also be a generated image (like a captcha) or a download link for a 700mb iso file - and there's no way to know be certain without checking out the header of the HTTP response from that URL.
Note: If you're writing your own crawler from scratch you're going to need good understanding of HTTP.
The first thing your crawler is going to see when gets an URL is the header, which contains a MIME content-type - it tells a browser/crawler how to process and open the data (is it HTML, normal text, .exe, etc). You'll probably want to download pages based on the MIME type instead of an URL scheme. The MIME type for HTML is text/html and you should check for that using the HTTP library you're using before downloading the rest of the content of an URL.
The Javascript problem
Same as above except that running javascript in the crawler/parser is pretty uncommon for simple projects and might create more problems than it solves. Why do you need Javascript?
A different solution
If you're willing to learn Python (or already know it) I suggest you look at Scrapy. It's a web crawling framework built similarly to the Django web framework. It's really easy to use and a lot of problems have already been solved so it could be a good starting point if you're trying to learn more about the technology.