I am trying to extract images and some text off the following site http://bit.ly/16jFeyA
Web Form , C# , Visual Studio, HtmlAgilityPack
Encoding Works well with WebClient Only , browser wb.Document.Encoding = "GB2312"; doesn't work, Not important.
The site uses Lazy Load, for images. The WebBrowser Loads properly, with the images with info but when i extract using either web client / wb.DocumentText , it will not download the "full information" some information are missing especially the images links etc.
Is there anyway around this? I am trying to extract images and product info.
Extracted using wb.DocumentText after scrolling down to force image to load(due to lazy load) - http://notepad.cc/share/EjW3tFCffO
wb = webBrowser
Thanks in advance!
You need to use something which knows how to evaluate and execute client-side JavaScript, such as a headless browser. PhantomJS should suffice.
Related
Currently, I have an aspx page in vb that launches a RadHTMLChart and I want to grab the SVG code of that chart. However, since the chart is rendered client-side, I have to launch this aspx page and then grab the SVG code from a second aspx page during postback. Currently, I am using Server.Execute (firstpage.aspx) to grab the SVG code but this does not work. I want to use the SVG to generate a PDF document but the Server.Execute command seems to run in the background and the code that comes after it do not wait for it to finish first, hence I am not grabbing the SVG content. Does anyone know of another way to grab this SVG content?
It seems you need all of this done on the server, so you need to find a way to launch a browser on the server, get the needed data from it and close it. I think a tool called PhantomJS can do this for you, so you can give it a crack (it's free I think).
Here is an example on exporting an HtmlChart, but it relies on a user interaction (which can be automated via some scripts) but it needs the page opened on the client machine: http://www.telerik.com/support/code-library/exporting-radhtmlchart-to-png-and-pdf. Anyway, it may still be helpful for ideas and to show how to get the SVG string.
I am currently building a little application based on watin that log in into a website and then start going through a serie of URL to download PDF files using Watin.
The website uses a lot of javascript to load pdf in embedded HTML.
The program works fine for now but is very slow since watin doesn't handle downloads very efficiently ( It uses Firefox download system and type slowly filename before saving.
I would like to know if there is a better framework for Web Scraping that could provide the same support for Ajax sites but better / faster way to download files.
I've been all around the web and found about selenium, but it doesn't present itself as more efficient than watin concerning file downloading.
Thanks in advance for your help.
You could write a Google Chrome extension using these two APIs as the main engine:
https://developer.chrome.com/extensions/webRequest.html
to know when and how to authenticate and when to start download and:
https://developer.chrome.com/extensions/downloads.html
to start the download of the file.
Whatever is missing from these two APIs for you to achieve your goal, you can compensate with a custom content script - a javascript that is injected into the page that is opened by the extension - and for example hook into the jquery's .ready event to initialize scraping.
These will definitely be faster than Watin since writing for watin is a layer of abstraction more than talking to the browser directly.
This is what I have done:
I have loaded a pdf file in web browser,
Now I want to select text from that file and paste into a text box.
Can anyone help me?
I'm pretty sure that this is going to be prohibitively difficult, if not impossible, to do.
The browser does not 'run' the PDF, it acts as a host for the PDF application, which ends up sharing it's main window. After that, control of the cursor etc passes to the PDF application and the browser is effectively no longer aware of what happens inside it. If the PDF application being used exposes COM interfaces for manipulating the cursor/text selection (doubtful), then it's possible to script against those interfaces from client script - but you won't be able to actually run any script in that window because the browser is showing a PDF, not a web page.
It might be possible if you hosted the web control on a windows forms application, but even so I wouldn't even know where to start on that one.
If your goal is to extract text from the PDF then you're probably better off pushing it through a .Net PDF library. A quick google on that one will yield you some suitable libraries.
if your pdf file has form elements then the file can be submitted to a url.
check this link.. it might help.
Can a PDF fillable form post itself to an HTTPS URL?
Earlier, I used System.Diagnostics.ProcessStartInfo to pass website url and open it in internet explorer.
Now, I have a HTML page code in database. I am working on a Windows application. I need to dump code on browser when click on windows application button. What is best .Net library to perform this task?
I looked at Process.Start() function, but it take html file name. In my situation, I dont have html file.
Have a look at embedding the WebBrowser control into your application.
You can call the NavigateToString method and pass the HTML source from your database as a string for it to render.
Since you're using WPF, there's a nice guide on how to integrate a WebBrowser control into your application.
I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.
What I want is the ability to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.
So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?
As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.
You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.
You can download it from http://htmlagilitypack.codeplex.com/