I have set up a .net http module to capture the html output of a page. I am looking to finding the quickest way to do the following:
Search through all the images (ie. img tags and input controls of type image)
Find those that have a relative source path
Manipulate the path by converting it from relative to absolute (I pass the absolute path to it)
Update the html source
Output the manipulated html source to users browser
Any suggestions as to the best way and more performant way of doing this? I am developing in c#.
You may want to have a look at http://htmlagilitypack.codeplex.com, it makes parsing and modifying HTML content really easy without having to resolve to a bunch of RegEx.
Related
I have a .NET Server Control app that simply returns some HTML. I also need to embed several picture files into the assembly so that the HTML file can use them as its src= for each of them.
We will simply have a .HTML file that lives in the project as an embedded resource and the server control code will read this html and serve it up. Within THAT html, we will need to have all the picture src links (as well as CSS, js, etc) to point back to embedded resource files.
Does anyone know what code I would put in the HTML file for the pictures to make it point back to the embedded picture file?
I have to do this on a grand scale... hundreds of times. I really would like a programmatic approach to doing this so I can write a wrapper and never have to touch it again when we update the server control with new html, picture files, etc.
One might imagine a way to do this at compile time where I can loop through the embedded files with GetManifestResourceNames and then replace() the src links with the HTTP resource links I suppose?
Thank you for any guidance!
Hm, your question covers quite many aspects. Let me repeat to see if I got it: You have an assembly, with a raw HTML file in it. This file references some items, which are to be found within this same assembly, and you want to have them served to the client upon request as well.
One possible solution might be this.
Instead of a raw HTML file, use a templated one. Then, feed all available resource names as proper URL's into the templating engine, to replace the placeholders.You may want to look at DotLiquid for this.
Create a HTTP handler for each file type you want to serve. Inside the handler, you pull the item from the resources of the dll and serve them.
Alternatively, if those resources are rather small, you want to have a look into the data URI scheme, to save the extra requests and omit the handler. With this you could replace the placeholders with the data URI's directly, and serve a single HTML file with everything in it in the frist place.
Another choice is to have your .NET Server Control app check for optional GET arguments and return the image instead of the HTML.
Your original HTML request might be a simple:
GET netServerApp
Which returns the HTML with normal embedded links.
The HTML image links in the HTML might look like this:
<img src="netServerApp?src=Image1.svg">
or the like. Your server app would then return the appropriate image, instead of the HTML.
It means several round trips to get everything, but that is normal for HTML anyway.
I'm using to load website follow code
webContent.Navigate(new Uri(linkURL));
I want to cache all content and html tag, style, js in web to read offline.
I tried download html source, file css and js using Webclient and replace these file to html resource then save to file "index.htm" but not good.
Can you find the way to resolve this issue? thank you.
The only way to do this is to download ALL the relevant files (including those referenced inside the HTML. eg. images, css, js, etc.) and save them ALL to Isolated Storage with appropriate similar file and folder structures.
The important point is that you also need to update all the paths within the content so that they point to relative paths that match where files have been saved.
You can then load the HTML from IsolatedStorage.
This is potentially a lot of work. I'd recommend exploring other options if possible first. Also remember to manage the files stored in IsolatedStorage appropriately so you don't just keep adding files there indefinitely.
I have HTML content coming from a database that gets rendered to a web page (like a CMS). In the HTML content, I want to allow ~/ in the paths of images and links to let the system ResolveUrl's. So what is an efficient method to do this? I will be using this C# across web forms and MVC. Thanks for any help o advice.
One option might be to use the Html Agiltiy Pack to parse the HTML and extract the URLs, then resolve them and update the HTML with the resolved URLs.
I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.
What I want is the ability to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.
So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?
As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.
You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.
You can download it from http://htmlagilitypack.codeplex.com/
Has anybody successfully used the SautinSoft HTML to RTF DLL which has images with a UNC path?
When we use the component to transform a HTML document with images whose src attribute is pointing to a UNC path the resulting RTF document has the images missing.
When navigating to the HTML page directly - with UNC paths as the source - the images are displaying correctly.
This is no longer relevant as we've moved onto using Aspose.Words as the tool to export the HTML to RTF format - from first impressions it appears (a) a lot more flexible, (b) easier to read, and (c) has proper documentation.
My name is Max from SautinSoft, the current version of the HTML to RTF .Net component supports UNC paths. Thanks for noticing about this issue!
Try wkhtmltopdf
http://code.google.com/p/wkhtmltopdf/