c# Html Agility pack getting div and span nodes - c#

This is the html document I am trying to extract the highlighted data in
.
I have read a lot on this site but was unable to find a solution that was helpful.
I tried using
nodes = doc.DocumentNode.SelectNodes(table_title + "/tbody/tr/td");
headers = nodes.Elements("span").Select(d => d.InnerText.Trim());
foreach (var this_header in header)
{
string location = this_header.InnerText.Trim();
Console.Writeline(location);
}
This does not give me the correct information. How do I find the specific content I am looking for?

What is this /tbody/tr/td ... there is no table at all.
you have to get a unique selector (xpath, css, id) at SelectNodes..

Related

Extracting string from Html page using C#

I have a source html page and I want to do the following:
extracting a specific string from the whole html page and save the new choosing string in a new html page.
creating a database on MySQL with 4 columns.
importing the data from the html page to the table on MySql.
I would be pretty thankful and grateful if someone could help me in that cause I have no that perfect knowledge of using C#.
You could use this code :
HttpClient http = new HttpClient();
//I have put Ebay.com. you could use any.
var response = await http.GetByteArrayAsync("ebay.com");
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument Nodes = new HtmlDocument();
Nodes.LoadHtml(source);
In the Nodes object, you will have all the DOM elements in the HTML page.
You could use linq to filter out whatever you need.
Example :
List<HtmlNode> RequiredNodes = Nodes.DocumentNode.Descendants()
.Where(x => x.Attributes["Class"].Contains("List-Item")).ToList();
You will probably need to install Html Agility Pack NuGet or download it from the link.
hope this helps.

HTML parsing from C#

I'm trying to parse some HTML files which don't always have the exact same format. Nevertheless, I've been able to find some patterns which are common to all the files.
For example, this is one of the files:
https://www.sec.gov/Archives/edgar/data/63908/000006390816000103/mcd-12312015x10k.htm#sFBA07EFA89A85B6DB59920A55B5021BC
I've seen that all the files I need have a unique tag which InnerText equals to "Financial Statements and Supplementary Data". I cannot search directly for that string as i appears repeatedly along the text. I used this code to find that tag:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(m_strFilePath);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
if (link.InnerText.Contains("Financial Statements"))
{
}
}
I was wondering if there's any way to get the position of this tag in the html substring so i can get the data i need by doing:
dataNeeded = html.substring(indexOf<a>Tag);
Thanks a lot

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

Get video(s) location from raw HTML text

I'm loading a web page into my WebView, and I can access it's raw HTML as text. The page has several video elements embedded within it, and I want to get their locations as a list of strings so I can download them separately.
How would I go about doing this ?
You can use HTTP agility pack for parsing
HtmlDocument document = new HtmlDocument();
document.LoadHtml(rawText);
var videoSourceNodes = document.DocumentNode.SelectNodes("//video/source");
foreach(var node in videoSourceNodes)
{
var path = node.Attributes["src"].Value;
}
It's your concern to convert relative path to absolute.

Sending a web page by email programmatically. Image URLs not resolved

I am writing a SharePoint timer job, which needs to pull the content of a web page, and send that HTML as an email.
I am using HttpWebRequest and HttpWebResponse objects to pull the content.
The emailing functionality works fine except for one problem.
The web page which serves up the content of my email contains images.
When the html of the page is sent as an email, the image URLs inside the HTML code are all relative URLs, they are not resolved as an absolute URL.
How do i resolve the image URLs to their absolute paths inside the web page content?
Is there any straight forward way to do this? I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Try adding a base element to the head of the html document you retrieve. As href attribute you should use the url of the page you are retrieving.
Found this cool Codeplex tool called HtmlAgilityPack.
http://www.codeplex.com/htmlagilitypack
Using this API, we can parse Html like we can parse XML documents. We can also query and search nodes using XPath.
I used the following code snippet to fix the Image URLs
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlMessage);
//This selects all the Image Nodes
HtmlNodeCollection hrefNodes = htmlDoc.DocumentNode.SelectNodes("//img");
foreach (HtmlNode node in hrefNodes)
{
string imgUrl = node.Attributes["src"].Value;
node.Attributes["src"].Value = webAppUrl + imgUrl;
}
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
htmlDoc.OptionOutputAsXml = false;
htmlDoc.Save(sw);
htmlMessage = sb.ToString();
I've run into this problem a few times, and I dont think there is any magic wand method out there to do it all for you. HTMLAgilityPack does a good job for aggregating the content you need, but you will have to decipher it yourself. For example; getting the list of HtmlNodes that contain "//img" could return any of the following items:
<img src="http://www.adg2435.com/pictures/pic.jpg"/> //absolute url
<img src="coolpicture.jpg"/> //relative to the page
<img src="pictures/pic.jpg"/>
<img src="./pictures/pic.jpg"/>
It is up to you to figure out which types of links are going to show up on the given webpage.
You also need to account for things like this: (Truncate your image url after the extension ".jpg")
<img src="/pictures/pic.jpg?45823593&xyz=95325235r0634945823ot49140200"/>
So, I find it handy to keep a few things on hand at any given time:
The source URL for the entire page
The domain for the given url (to do things like say "does the given src contain the domain?")
This is how you would get the domain of the source link:
Uri domainUri = new Uri(fullUrl);
domainUrl = domainUri.GetLeftPart(UriPartial.Authority);
Potentially, you may want the subdomain (i.e. "http://www.mysite.com/pictures/")
I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Too bad, because that's the only way you'll get the images to show up. Would you rather download all the images and embed them in the email too?

Categories

Resources