I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.
What I want is the ability to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.
So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?
As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.
You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.
You can download it from http://htmlagilitypack.codeplex.com/
Related
I have images stored in a SQL Server database with datatype image. I want to retrieve them and convert them to bitmap, and use them to create an asp.net web form image gallery for an online shopping web site.
Should I use <asp:Repeater> control, <asp:GridView> or data list control?
I don't want to use image path stored in the database
It's sad that the experts here are asking you questions as if you already know how to solve your problem which it is clear from your question you don't. Let me try and give you a little background and direction and I think you will be able to get closer to solving what you want to do.
Your images are just blobs in your sqlserver database which has no direct connection the web. The only way you can show images is you need to basically put them into an img tag with src= the location on your web server. What needs to happen is the location on the web server you choose must instead of reading a file from the file system of the server, must somehow grab the image from the database and then stream those image bytes to the img tag on the page.
THere are multiple ways to do that in asp.net. The easiest is a handler or ashx file (don't even know if those are supported anymore).
At anyrate, here is a link that might help. You might try googling something like "display image from sql server on asp.net" and see what else comes out. Obviously, lot's of people do this and you will to soon.
Good Luck.
https://www.aspsnippets.com/Articles/Display-image-from-database-in-Image-control-without-using-Generic-Handler-in-ASPNet.aspx
I have a .NET Server Control app that simply returns some HTML. I also need to embed several picture files into the assembly so that the HTML file can use them as its src= for each of them.
We will simply have a .HTML file that lives in the project as an embedded resource and the server control code will read this html and serve it up. Within THAT html, we will need to have all the picture src links (as well as CSS, js, etc) to point back to embedded resource files.
Does anyone know what code I would put in the HTML file for the pictures to make it point back to the embedded picture file?
I have to do this on a grand scale... hundreds of times. I really would like a programmatic approach to doing this so I can write a wrapper and never have to touch it again when we update the server control with new html, picture files, etc.
One might imagine a way to do this at compile time where I can loop through the embedded files with GetManifestResourceNames and then replace() the src links with the HTTP resource links I suppose?
Thank you for any guidance!
Hm, your question covers quite many aspects. Let me repeat to see if I got it: You have an assembly, with a raw HTML file in it. This file references some items, which are to be found within this same assembly, and you want to have them served to the client upon request as well.
One possible solution might be this.
Instead of a raw HTML file, use a templated one. Then, feed all available resource names as proper URL's into the templating engine, to replace the placeholders.You may want to look at DotLiquid for this.
Create a HTTP handler for each file type you want to serve. Inside the handler, you pull the item from the resources of the dll and serve them.
Alternatively, if those resources are rather small, you want to have a look into the data URI scheme, to save the extra requests and omit the handler. With this you could replace the placeholders with the data URI's directly, and serve a single HTML file with everything in it in the frist place.
Another choice is to have your .NET Server Control app check for optional GET arguments and return the image instead of the HTML.
Your original HTML request might be a simple:
GET netServerApp
Which returns the HTML with normal embedded links.
The HTML image links in the HTML might look like this:
<img src="netServerApp?src=Image1.svg">
or the like. Your server app would then return the appropriate image, instead of the HTML.
It means several round trips to get everything, but that is normal for HTML anyway.
I am trying to extract images and some text off the following site http://bit.ly/16jFeyA
Web Form , C# , Visual Studio, HtmlAgilityPack
Encoding Works well with WebClient Only , browser wb.Document.Encoding = "GB2312"; doesn't work, Not important.
The site uses Lazy Load, for images. The WebBrowser Loads properly, with the images with info but when i extract using either web client / wb.DocumentText , it will not download the "full information" some information are missing especially the images links etc.
Is there anyway around this? I am trying to extract images and product info.
Extracted using wb.DocumentText after scrolling down to force image to load(due to lazy load) - http://notepad.cc/share/EjW3tFCffO
wb = webBrowser
Thanks in advance!
You need to use something which knows how to evaluate and execute client-side JavaScript, such as a headless browser. PhantomJS should suffice.
I'm using a expertPDF to convert a couple webpages to PDF, and there's one that i'm having difficulties with. This page only renders content when info is POST'd to it, and the content is text and a PNG graph (the graph is the most important piece).
I tried creating a page form with a 'auto submit' on the body onload='' event. If i go to this page, it auto posts to the 3rd party page and i get the page as i expect. But it appears ExpertPDF won't take a 'snapshot' if the page is redirected.
I tried using HTTPRequest/Response and WebClient, but have only been able to retrieve the HTML, which doesn't include the PNG graph.
Any idea how i can create a memorystream that includes the HTML AND the PNG graph or post to it, but then somehow send ExpertPDF to that URL to take a snapshot of the posted results?
Help is greatly appreciated - i've spent too much time trying on this one sniff.
Thanks!
In HTML/HTTP the web page (the HTML) is a separate resource from any images it includes. So you would need to parse the HTML and find the URL that points to your graph, and then make a second request to that URL to get the image. (This is unless the page spits the image out inline, which is pretty rare, and if that were the case you probably wouldn't be asking.)
A quick look at ExpertPDF's FAQ page, there's a FAQ question that deals specifically with your problem. I would recommend you take a look at that.
** UPDATE **
Take a look at the second FAQ question:
Q: When I convert a HTML string to PDF, the external CSS files and images are not applied in the rendered PDF document.
You can take the original (single) response from your WebClient and convert that into a string and pass that string to ExpertPDF based on the answer to that question.
I have HTML content coming from a database that gets rendered to a web page (like a CMS). In the HTML content, I want to allow ~/ in the paths of images and links to let the system ResolveUrl's. So what is an efficient method to do this? I will be using this C# across web forms and MVC. Thanks for any help o advice.
One option might be to use the Html Agiltiy Pack to parse the HTML and extract the URLs, then resolve them and update the HTML with the resolved URLs.