In my scenario I want to download the HTML of a page (any page on the Internet) programaticaly but also I want all of the images in the HTML to be in base64 embedded format (not referenced)
In other words, instead of :
<img src='/images/delete.gif' />
I want the downloaded html to look like this:
<img src="..." />
This way I don't need to go through the process of storing all images in directories, etc, etc.
Does any of you have any idea how this can be done? Or any plugin to do this efficiently?
Well, you'd need to:
Download the original HTML
Find each img element in the HTML (for instance using the HTML agility pack) and for each one:
If it's already using a data URL, ignore it
Otherwise:
Download the image
Encoded it in Base64 using Convert.ToBase64String
Replace the original img tag with one using the base64 version (either in the original string, or via a DOM representation)
Save the final HTML to disk
Is any of these steps causing you a particular problem? You could potentially make it quicker by downloading the images in parallel, but I'd get a serial version working first.
Instead of using a html page with images as base64 encoded strings in the src attribute you might consider using the MHTML format instead. Most browsers supports the format and it embeds all external resources (including images).
var msg = new CDO.MessageClass();
msg.MimeFormatted = true;
msg.CreateMHTMLBody("http://www.google.com", CDO.CdoMHTMLFlags.cdoSuppressNone, "", "");
var stream = msg.GetStream();
var mhtml = stream.ReadText(stream.Size);
Use a regular expression (regex) to extract URLs from img tags, translate them to absolute URLs using the Uri class, then use WebClient to download the target images. After that it's just a case of using Convert.ToBase64String to produce the Base64.
Related
When converting from docx to html you may specify the output path for any images
org.docx4j.Docx4J.toHTML(wordMLPackage, imageDirPath, imageTargetUri, fos2);
and the resulting html document references images via files:
<img height="22" id="rId7" src="..cc6bcedf-2770-45ad-8e81-610bbd8746ceimage1.png" width="42">
Instead I would like the converter to embed the files as base64. Is this possible?
You can write your own ConversionImageHandler implementation to do that.
The default implementation HTMLConversionImageHandler writes images to files.
To use your image handler, specify it via htmlSettings.setImageHandler
You do not need a custom ConversionImageHandler to achieve this.
You can simply set imageDirPath to an empty string and the images will be embedded
org.docx4j.Docx4J.toHTML(wordMLPackage, "", "", fos2);
This occurs because org.docx4j.model.images.AbstractConversionImageHandler (from which HTMLConversionImageHandler derives) already handles this case for you .
I am searching for a solution to convert HTML to PDF with external CSS support. I downloaded the trial version of the Winnovative Toolkit Total v11.14, and tried out the demo application for the method public byte[] GetPdfBytesFromHtmlString (string htmlString, string urlBase). The PDF files are generated, but the CSS is not applied.
Note: I tried the same input HTML string and base URL in the demo site. It's working fine, so I don't know why it's not working in my system. The demo application is shared in v11.14 ZIP files.
Input provided for this method:
htmlString = HTML source of the url 'http://www.winnovative-software.com/'
urlBase = "http://www.winnovative-software.com/"
Are you using any proxy to access Internet? In this case you should set the HtmlToPdfConverter.ProxyOptions object properties in your code.
I have C# code for fetching images from URLs like http://i.imgur.com/QvkaduU.jpg but how would I fetch the image from Web pages like this:http://imgur.com/gallery/QvkaduU?
Is there any "easy" way to do this or I will have to fetch the HTML and construct a C# parser that looks in HTML for images that are bigger than all the others?
Let me clear this up. If you paste http://imgur.com/gallery/QvkaduU (HTML version) into for example Facebook's status update field it will find the main image and make a thumbnail out of it, this is exactly the behavior I'm looking for. The question is, how is this done? Do I have to write my own HTML parser or is there an easy way to get this?
There is no easy way to get a "good" thumbnail image for an arbitrary URL.
Facebook's algorithm for doing so is fairly complex. Page developers are able to give it a hint by adding various meta tags to the <head>, including:
<meta property="og:image" content="http://url_to_your_image_here" />
or
<link rel="image_src" href="http://www.code-digital.co.uk/preview.jpg" />
(more on this)
... so if you wanted to replicate Facebook's algorithm, you would need to fetch the page source, parse it for any "hints" like the one above (you'd better check that I haven't missed any other "hint" formats), and come up with a fallback algorithm if the page doesn't include one of those.
A more realistic solution would be to use someone else's URL -> thumbnail system.
If you like Facebook's version, I think you should be able to request Facebook's thumbnail for a given URL via their API.
Other services which offer this sort of thing are:
http://webthumb.bluga.net/home (not free)
http://immediatenet.com/thumbnail_api.html (free, may have restrictive TOS)
https://www.google.com/search?q=get+thumbnail+for+url
If the QvkaduU part is always the same between the html page and the image, could you just do a string replacement?
"http://imgur.com/gallery/QvkaduU".Replace("imgur.com/gallery","i.imgur.com") + ".jpg";
I would fetch the whole HTML source and put all <img ... src="..."> parameters as well as < ... style="... background-image: ...;"> css inline properties using regex and try to download all files behind the links temporary. Then I would (try to convert it to Bitmap and) check the pixel size, the largest picture should be the picture you want.
Google might help you how to check pixel size and convert any images.
The regex to get all image links from a HTML source should be
<img[^>]+src=\"([^"]+)\".*?>|<[^>]+style=\"[^"]*background-image:\s*url\(\s*'?([^')])\s*'?)\s*;.*?> (not tested, but pretty sure)
The result will be in the 2nd or 3rd group index, also don't forget to prefix the current url on relative links.
You're already on the right track, yes the most reliable way would be to fetch the HTML, parse it and look for images, you would then rank the images based on position and size. For instance, if the first image you find is big enough to make the thumbnail, then cool, if however it is small, you go to the next image, etc. It would be most advisable to use an image plugin like Timthumb (I think I've seen an ASP.NET version sometime) and cache the images such that once you've looked up the thumbnail to represent a website, you can call the image(s) from the catch instead.
Can you try to do something like this?
public void ProcessRequest(HttpContext context)
{
{
// load here the image
....
// and send it to browser
ctx.Response.OutputStream.Write(imageData, 0, imageData.Length);
}
}
You can also try what they are talking about here. I tried it and it worked like a charm.
http://www.dotnetspider.com/resources/42565-Download-images-from-URL-using-C.aspx
can you try this
public Bitmap getImageFromURL(String sURL)
{
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(sURL);
myRequest.Method = "GET";
HttpWebResponse myResponse = (HttpWebResponse)myRequest.GetResponse();
System.Drawing.Bitmap bmp = new System.Drawing.Bitmap(myResponse.GetResponseStream());
myResponse.Close();
return bmp;
}
gotten from
How to get an image to a pictureBox from an URL? (Windows Mobile)
I want to know how to create an image object having the "src info from an email". I already manage to get read the inbox, and to parse the html of it, and get out all of the "src = foo" from all the images in the email. My question is how do I then proceed to create an image using the information taken out from "src" in the of the html. I need this object in order to store it in a sharepoint picture library. Just want to know how to create the image object of the image stored in the html of the email.
Not sure about how to put it in SharePoint, but assuming you have a src in an extractedSrc variable:
WebClient webClient = new WebClient();
webClient.DownloadFile(extractedSrc, localFileName)
I guess there are two basic cases you have to consider, 1. The src attribute points to an external image (ie. image stored on a web site), 2. Src points to an image attached in the email.
For case 1. You need to download the image from the external server and then you can save it in your share point
For case 2. You have to decode the attachment sections of the email to extract the file data and then you can save it to your library
I am writing a SharePoint timer job, which needs to pull the content of a web page, and send that HTML as an email.
I am using HttpWebRequest and HttpWebResponse objects to pull the content.
The emailing functionality works fine except for one problem.
The web page which serves up the content of my email contains images.
When the html of the page is sent as an email, the image URLs inside the HTML code are all relative URLs, they are not resolved as an absolute URL.
How do i resolve the image URLs to their absolute paths inside the web page content?
Is there any straight forward way to do this? I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Try adding a base element to the head of the html document you retrieve. As href attribute you should use the url of the page you are retrieving.
Found this cool Codeplex tool called HtmlAgilityPack.
http://www.codeplex.com/htmlagilitypack
Using this API, we can parse Html like we can parse XML documents. We can also query and search nodes using XPath.
I used the following code snippet to fix the Image URLs
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlMessage);
//This selects all the Image Nodes
HtmlNodeCollection hrefNodes = htmlDoc.DocumentNode.SelectNodes("//img");
foreach (HtmlNode node in hrefNodes)
{
string imgUrl = node.Attributes["src"].Value;
node.Attributes["src"].Value = webAppUrl + imgUrl;
}
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
htmlDoc.OptionOutputAsXml = false;
htmlDoc.Save(sw);
htmlMessage = sb.ToString();
I've run into this problem a few times, and I dont think there is any magic wand method out there to do it all for you. HTMLAgilityPack does a good job for aggregating the content you need, but you will have to decipher it yourself. For example; getting the list of HtmlNodes that contain "//img" could return any of the following items:
<img src="http://www.adg2435.com/pictures/pic.jpg"/> //absolute url
<img src="coolpicture.jpg"/> //relative to the page
<img src="pictures/pic.jpg"/>
<img src="./pictures/pic.jpg"/>
It is up to you to figure out which types of links are going to show up on the given webpage.
You also need to account for things like this: (Truncate your image url after the extension ".jpg")
<img src="/pictures/pic.jpg?45823593&xyz=95325235r0634945823ot49140200"/>
So, I find it handy to keep a few things on hand at any given time:
The source URL for the entire page
The domain for the given url (to do things like say "does the given src contain the domain?")
This is how you would get the domain of the source link:
Uri domainUri = new Uri(fullUrl);
domainUrl = domainUri.GetLeftPart(UriPartial.Authority);
Potentially, you may want the subdomain (i.e. "http://www.mysite.com/pictures/")
I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Too bad, because that's the only way you'll get the images to show up. Would you rather download all the images and embed them in the email too?