I have HTML content coming from a database that gets rendered to a web page (like a CMS). In the HTML content, I want to allow ~/ in the paths of images and links to let the system ResolveUrl's. So what is an efficient method to do this? I will be using this C# across web forms and MVC. Thanks for any help o advice.
One option might be to use the Html Agiltiy Pack to parse the HTML and extract the URLs, then resolve them and update the HTML with the resolved URLs.
Related
Specs: Win7 64, VS 2010, .NET 4.0, NCrawler library
I'm writing a crawler that will extract some data from a online shop. The application extracts the URLs fine, and I can navigate to every item from the shop properly. The problem is that every "propretyBag" object that keeps all the page data of the product is in a text form, I was wondering if there is a way to read the contents of a specific tag like <-description>Text<-/descriptopn> from this "propertyBag" or there is another way to do it.
THx
You need a HTML parser like HtmlAgilityPack (http://htmlagilitypack.codeplex.com/) to extract the required information.
But I would recommend to use Abot (https://code.google.com/p/abot/) as web crawler. It is an activly developed free open source web crawler written in C#.
Abot has built in HTML parsers like HtmlAgilityPack (extract elements via XPath) and CsQuery (extract elements via CSS selectors).
I have a .NET Server Control app that simply returns some HTML. I also need to embed several picture files into the assembly so that the HTML file can use them as its src= for each of them.
We will simply have a .HTML file that lives in the project as an embedded resource and the server control code will read this html and serve it up. Within THAT html, we will need to have all the picture src links (as well as CSS, js, etc) to point back to embedded resource files.
Does anyone know what code I would put in the HTML file for the pictures to make it point back to the embedded picture file?
I have to do this on a grand scale... hundreds of times. I really would like a programmatic approach to doing this so I can write a wrapper and never have to touch it again when we update the server control with new html, picture files, etc.
One might imagine a way to do this at compile time where I can loop through the embedded files with GetManifestResourceNames and then replace() the src links with the HTTP resource links I suppose?
Thank you for any guidance!
Hm, your question covers quite many aspects. Let me repeat to see if I got it: You have an assembly, with a raw HTML file in it. This file references some items, which are to be found within this same assembly, and you want to have them served to the client upon request as well.
One possible solution might be this.
Instead of a raw HTML file, use a templated one. Then, feed all available resource names as proper URL's into the templating engine, to replace the placeholders.You may want to look at DotLiquid for this.
Create a HTTP handler for each file type you want to serve. Inside the handler, you pull the item from the resources of the dll and serve them.
Alternatively, if those resources are rather small, you want to have a look into the data URI scheme, to save the extra requests and omit the handler. With this you could replace the placeholders with the data URI's directly, and serve a single HTML file with everything in it in the frist place.
Another choice is to have your .NET Server Control app check for optional GET arguments and return the image instead of the HTML.
Your original HTML request might be a simple:
GET netServerApp
Which returns the HTML with normal embedded links.
The HTML image links in the HTML might look like this:
<img src="netServerApp?src=Image1.svg">
or the like. Your server app would then return the appropriate image, instead of the HTML.
It means several round trips to get everything, but that is normal for HTML anyway.
I have set up a .net http module to capture the html output of a page. I am looking to finding the quickest way to do the following:
Search through all the images (ie. img tags and input controls of type image)
Find those that have a relative source path
Manipulate the path by converting it from relative to absolute (I pass the absolute path to it)
Update the html source
Output the manipulated html source to users browser
Any suggestions as to the best way and more performant way of doing this? I am developing in c#.
You may want to have a look at http://htmlagilitypack.codeplex.com, it makes parsing and modifying HTML content really easy without having to resolve to a bunch of RegEx.
I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.
What I want is the ability to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.
So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?
As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.
You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.
You can download it from http://htmlagilitypack.codeplex.com/
I want to develop an asp.net page through which I can specify the URL of any page which contains links of many files & directories. I want to download them all. Similar to DownThemAll plugin of FireFox.
i.e.
"MyPage.htm" file contains many links to files/directories located on the same server.
now I want to write a function which can download all these file if I provide
"www.mycustomdomain.com\Mypage.htm" as input.
I hope question is clear.
Fetch the web page as HTML. Google (c# fetch file from web). The first link will give you the idea.
Then find the links with regular expressions.
Some example regex pattern for links in www.x.com should be as
(http://www.x.com/.*?)
(But better if you also include the A tag in your regex pattern)
And download the files as shown in:
http://www.csharp-examples.net/download-files/
Hope I understand your question. You have a HTM file with a list of links and these links are links to specific files on a remote server and you want to download all the files.
There is no fail proof way to do this.
Check this question. How do you parse an HTML in vb.net Even though this is for VB.net it is related to what you asked for. You can get an array of links and then start downloading the files.
You can use the Computer.Network.DownloadFile method to download the remot file as save it on a location of yours.
Thi is not a fail prrof method because if a download requires authentication then it will download the HTML page [mostly loin page]