I've seen that the html agility pack can come handy but I dont understand how it works. This is how I got the code right now and at the moment it extracts the headings content successfully but also takes more unneeded content.
driver.Manage().Window.Maximize();
driver.Navigate().GoToUrl(response);
String sourcePage = driver.PageSource;
Regex regexHeadings = new Regex("(?<=\\>)(?!\\<)(.*)(?=\\<)(?<!\\>)");
foreach (Match match in regexHeadings.Matches(sourcePage))
{
h1Keywords.Add(match.Value);
colorOutput(ConsoleColor.White, match.Value);
}
I'd recommend you using HtmlAgility Pack with the help of XPath / CSS Selectors.
See this cheatsheet for help: https://devhints.io/xpath
Quick example:
var url = "https://devhints.io/xpath";
var web = new HtmlWeb();
var doc = web.Load(url);
foreach (var heading in doc.DocumentNode.SelectNodes("//h1"))
{
Console.WriteLine(heading.InnerText);
}
Related
What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}
I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img#src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/#href | //img/#src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source:
https://html-agility-pack.net/select-nodes
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', #class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}
How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.
I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img#src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/#href | //img/#src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source:
https://html-agility-pack.net/select-nodes
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', #class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}
Currently I use .Net WebBrowser.Document.Images() to do this. It requires the Webrowser to load the document. It's messy and takes up resources.
According to this question XPath is better than a regex at this.
Anyone know how to do this in C#?
If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.
Otherwise you can try this function, that will return all image links from HtmlSource :
public List<Uri> FetchLinksFromSource(string htmlSource)
{
List<Uri> links = new List<Uri>();
string regexImgSrc = #"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
foreach (Match m in matchesImgSrc)
{
string href = m.Groups[1].Value;
links.Add(new Uri(href));
}
return links;
}
And you can use it like this :
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
}
}
The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Pack over on CodePlex. It rocks (and handles malformed HTML).
Here's a snippet for iterating over links in a document:
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/#href");
Content match = null;
// Run only if there are links in the document.
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute attrib = linkNode.Attributes["href"];
// Do whatever else you need here
}
}
Original Blog Post
If all you need is images I would just use a regular expression. Something like this should do the trick:
Regex rg = new Regex(#"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);
If it's valid xhtml, you could do this:
XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/#src");