I am writing a SharePoint timer job, which needs to pull the content of a web page, and send that HTML as an email.
I am using HttpWebRequest and HttpWebResponse objects to pull the content.
The emailing functionality works fine except for one problem.
The web page which serves up the content of my email contains images.
When the html of the page is sent as an email, the image URLs inside the HTML code are all relative URLs, they are not resolved as an absolute URL.
How do i resolve the image URLs to their absolute paths inside the web page content?
Is there any straight forward way to do this? I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Try adding a base element to the head of the html document you retrieve. As href attribute you should use the url of the page you are retrieving.
Found this cool Codeplex tool called HtmlAgilityPack.
http://www.codeplex.com/htmlagilitypack
Using this API, we can parse Html like we can parse XML documents. We can also query and search nodes using XPath.
I used the following code snippet to fix the Image URLs
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlMessage);
//This selects all the Image Nodes
HtmlNodeCollection hrefNodes = htmlDoc.DocumentNode.SelectNodes("//img");
foreach (HtmlNode node in hrefNodes)
{
string imgUrl = node.Attributes["src"].Value;
node.Attributes["src"].Value = webAppUrl + imgUrl;
}
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
htmlDoc.OptionOutputAsXml = false;
htmlDoc.Save(sw);
htmlMessage = sb.ToString();
I've run into this problem a few times, and I dont think there is any magic wand method out there to do it all for you. HTMLAgilityPack does a good job for aggregating the content you need, but you will have to decipher it yourself. For example; getting the list of HtmlNodes that contain "//img" could return any of the following items:
<img src="http://www.adg2435.com/pictures/pic.jpg"/> //absolute url
<img src="coolpicture.jpg"/> //relative to the page
<img src="pictures/pic.jpg"/>
<img src="./pictures/pic.jpg"/>
It is up to you to figure out which types of links are going to show up on the given webpage.
You also need to account for things like this: (Truncate your image url after the extension ".jpg")
<img src="/pictures/pic.jpg?45823593&xyz=95325235r0634945823ot49140200"/>
So, I find it handy to keep a few things on hand at any given time:
The source URL for the entire page
The domain for the given url (to do things like say "does the given src contain the domain?")
This is how you would get the domain of the source link:
Uri domainUri = new Uri(fullUrl);
domainUrl = domainUri.GetLeftPart(UriPartial.Authority);
Potentially, you may want the subdomain (i.e. "http://www.mysite.com/pictures/")
I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Too bad, because that's the only way you'll get the images to show up. Would you rather download all the images and embed them in the email too?
Related
BackGround Info
Currently working on a C# web api that will be returning selected Img url's as base64. I currently have the functionality that would preform the base64 conversion however, I am getting a large amount of text which also include Img Url's which I will need to crop out of the string and give it to my function to convert the img to base 64. I read up on an lib.("HtmlAgilityPack;") that should make this task easy but when I am use it I get "HtmlDocument.cs" not found. However, I am not submitting a document, but sending it a string which is HTML. I read the doc and it is suppose to work with a string as well, but it is not working for me. This is the code using "HtmlAgilityPack".
NON WORKING CODE
foreach(var item in returnList)
{
if (item.Content.Contains("~~/picture~~"))
{
HtmlDocument doc = new HtmlDocument();
doc.Load(item.Content);
Error Message From HtmlAgilityPack
Question
I am receiving a string which is Html from SharePoint. This Html string may be tokenized with heading tokens and/or picture tokens. I am trying to isolate the retrieve the html from the img src Hmtl tag. I understand that regex may be impractical, but I would consider working with a regex expressions is it available to retrieve the url from img src.
Sample String
Bullet~~Increased Cash Flow</li><li>~~/Document Text Bullet~~Tax Efficient Organizational Structures</li><li>~~/Document Text Bullet~~Tax Strategies that Closely Align with Business Strategies</li><li>~~/Document Text Bullet~~Complete Knowledge of State and Local Tax Obligations</li></ul><p>~~/Document Heading 2~~is the firm of choice</p><p>~~/Document Text~~When it comes to accounting and advisory services is the unique firm of choice. As a trusted advisor to our clients, we bring an integrated client service approach with dedicated industry experience. Dixon Hughes Goodman respects the value of every client relationship and provides clients throughout the U.S. with an unwavering commitment to hands-on, personal attention from our partners and senior-level professionals.</p><p>~~/Document Text~~of choice for clients in search of a trusted advisor to deal with their state and local tax needs. Through our leading best practices and experience, our SALT professionals offer quality and ease to the client engagement. We are proud to provide highly comprehensive services.</p>
<p>~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/map-al.jpg" alt="map al" style="width:611px;height:262px;" />
<br></p><p><br></p><p>
~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/Firm_Telescope_Illustration.jpg" alt="Firm_Telescope_Illustration.jpg" style="margin:5px;width:155px;height:155px;" /> </p><p></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div>
Important
I am working with an HTML string, not a file.
The issue you are having is that C# is looking for a file and since it is not finding it, it tells you. This is not an error that will brake your app, it is just telling you that the file is not found and the Lib will than read the string given. This documentation can be found here https://htmlagilitypack.codeplex.com/SourceControl/latest#Trunk/HtmlAgilityPackDocumentation.shfbproj. The code below is a cookie cutter model that anyone can use.
Important
C# is looking for a file which can not be displayed, because it a string that is supplied. That is the message that you are getting, however your still will work as well with accordance to the doc provided and will not effect your code.
Exmample Code
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml("YourContent"); // can be a string or can be a path.
HtmlAttribute att = url.Attributes["src"];
Uri imgUrl = new System.Uri("Url"+ att.Value); // build your url
string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].+?>", RegexOptions.IgnoreCase).Groups[1].Value;
It has been asked multiple times here.
also here
I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.
I'm loading a web page into my WebView, and I can access it's raw HTML as text. The page has several video elements embedded within it, and I want to get their locations as a list of strings so I can download them separately.
How would I go about doing this ?
You can use HTTP agility pack for parsing
HtmlDocument document = new HtmlDocument();
document.LoadHtml(rawText);
var videoSourceNodes = document.DocumentNode.SelectNodes("//video/source");
foreach(var node in videoSourceNodes)
{
var path = node.Attributes["src"].Value;
}
It's your concern to convert relative path to absolute.
In my scenario I want to download the HTML of a page (any page on the Internet) programaticaly but also I want all of the images in the HTML to be in base64 embedded format (not referenced)
In other words, instead of :
<img src='/images/delete.gif' />
I want the downloaded html to look like this:
<img src="data:image/gif;base64,R0lGODl..." />
This way I don't need to go through the process of storing all images in directories, etc, etc.
Does any of you have any idea how this can be done? Or any plugin to do this efficiently?
Well, you'd need to:
Download the original HTML
Find each img element in the HTML (for instance using the HTML agility pack) and for each one:
If it's already using a data URL, ignore it
Otherwise:
Download the image
Encoded it in Base64 using Convert.ToBase64String
Replace the original img tag with one using the base64 version (either in the original string, or via a DOM representation)
Save the final HTML to disk
Is any of these steps causing you a particular problem? You could potentially make it quicker by downloading the images in parallel, but I'd get a serial version working first.
Instead of using a html page with images as base64 encoded strings in the src attribute you might consider using the MHTML format instead. Most browsers supports the format and it embeds all external resources (including images).
var msg = new CDO.MessageClass();
msg.MimeFormatted = true;
msg.CreateMHTMLBody("http://www.google.com", CDO.CdoMHTMLFlags.cdoSuppressNone, "", "");
var stream = msg.GetStream();
var mhtml = stream.ReadText(stream.Size);
Use a regular expression (regex) to extract URLs from img tags, translate them to absolute URLs using the Uri class, then use WebClient to download the target images. After that it's just a case of using Convert.ToBase64String to produce the Base64.
From here, I am trying to get data from stock quote for every 10 mins interval.
I used WebClient for downloading the page content and for parsing I used regular expressions. It is working fine for other urls. For the Particular URL, my parsing code not working.
I think it is the problem with javascript, When I load the page in Browser, after loading the page content, It took some extra time to plot the data. May be this guy is using some client side script for this page. Can anyone help me Please..........
HTML Agility Pack will save you tons of headaches. Try it instead of using regexps to parse HTML.
For what it's worth, in the page you link to the quote data is indeed in Javascript code, check http://www.nseindia.com/js/getquotedata.js and http://www.nseindia.com/js/quote_data.js
as per #Vinko Vrsalovic answer, Html Agility pack is your friend. Here is a sample
WebClient client = new WebClient();
string source = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(source);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//*[#href]");
foreach (HtmlNode node in nodes)
{
if (node.Attributes.Contains("class"))
{
if (node.Attributes["class"].Value.Contains("StockData"))
{// Here is our info }
}
}