Html Agility Pack - Select Divs inside Div - c#

fairly new to the HTML Agility Pack. I've been searching and trying many examples but didn't get to a conclusion yet.. must be doing something wrong.. hope you can assist me.
My goal is to parse the latest news from a website, including image, title and date - pretty simple. I managed to get the image (background attribute) from the div but the divs are nested and for some reason I can't access their values. Here is my code
using System;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var html = #"https://pristontale.eu/";
HtmlWeb web = new HtmlWeb();
var doc = web.Load(html);
var news = doc.DocumentNode.SelectNodes("//div[contains(#class,'index-article-wrapper')]");
foreach (var item in news){
var image = Regex.Match(item.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
var title = item.SelectSingleNode("//div[#class='article-title']").InnerText;
var date = item.SelectSingleNode("//div[#class='article-date']").InnerText;
Console.WriteLine(image, title, date);
}
}
}
This is what the HTML looks like
<div class="index-article-wrapper" onclick="location.href='article.php?id=2';" style="background-image: url(https://cdn.discordapp.com/attachments/765749063621935104/884439050562461696/1_1.png)">
<div class="meta-wrapper">
div class="article-date">5 Sep, 2021</div>
<div class="article-title">Server merge v1.264 update</div>
</div>
</div>
Currently it correctly grabs me all the 4 news articles but only the image - how do i get title and date of each? I have a fiddle here https://dotnetfiddle.net/BVcAmH
Appreciate the help

I just realized the code has been correct all along, the only flaw was the Console.WriteLine
Wrong
Console.WriteLine(image, title, date);
Correct
Console.WriteLine(image + " " + " " + title + " " + date);

Related

Proper way to output HTML to the page

I realize this is probably a fundamental thing I should know but I am self-teaching myself C# and asp.net so I am a little lost at this point.
I have a stored procedure, which will return around 700-800 image URLs.
I need to build HTML like this for all the 800 image URLs and return the HTML to the page:
<div class="tile">
<img src="source.png" height="100" width="100" />
</div>
This is my code currently:
if (reader.HasRows)
{
string flagwallcontent = ""; //using a string to build html
while (reader.Read())
{
flagwallcontent = flagwallcontent + "<div class='tile'>";
flagwallcontent = flagwallcontent + "<img src='" + reader.GetString(0) + "' height='100' width='100'/>";
flagwallcontent = flagwallcontent + "</div>";
}
FlagWallLiteral.Text = flagwallcontent; //returning html to asp literal
}
I feel that this is not the efficient way to do it. Using string to build HTML and return HTML to asp literal. Can you suggest what the best way would be?
One way you could do it is to have an ASP.NET Panel (which renders as a <div>) to be the container for each of X hundred images you will add, like this:
if (reader.HasRows)
{
while (reader.Read())
{
// Create new image control
var newImage = new Image ();
newImage.ImageUrl = reader.GetString(0);
newImage.Height = 100;
newImage.Width = 100;
// Add new image to panel
Panel1.Controls.Add (newImage);
}
}
This allows ASP.NET to render the HTML while working with the strongly typed C# API.

VB.NET / C#.Net MSHTML: Unable to get "name" attribute from Outerhtml after using "setAttribute('name',value)" for certain elements

I am developing a WYSIWYG application specifically for my company usage with custom integration with company's existing tools.
I was unable to get the "name" attribute out of certain elements when trying to get the html string by using ".OuterHtml", especially INPUT tag element.
Code example:
`Dim inElem as windows.forms.htmlElement = hdoc.CreateElement("INPUT")`
`inElem.Id = "txt01"`
`inElem.setAttribute("name", inElem.Id)`
`inElem.setAttribute("type", "text")`
`inElem.setAttribute("placeholder","text here....")`
'' append the created element to html body
`hdoc.Body.AppendChild(inElem)`
--> Getting html string:
** hdoc.body.getElementById("txt01").OuterHtml => "<input id=txt01 placeholder='text here....'></input>"
--> What I really want is:
** hdoc.body.getElementById("txt01").OuterHtml => "<input id=txt01 placeholder='text here....' type='text' name='txt01'></input>"
Yes, not only name attribute were missing, some other too. (e.g. TYPE)
Anyone could help me on this matter?
Solution attempted:
For Each inputEle As Windows.Forms.HtmlElement In hdoc.Body.GetElementsByTagName("input")
CType(inputEle.DomElement, mshtml.IHTMLInputElement).name = inputEle.Id
Next
** FAILED ** :(
ULTIMATE SOLUTION:
Use HTML Agility Pack:
----------------------
Dim inputEle3 As HtmlAgilityPack.HtmlNode = new_wb.CreateElement("input")
inputEle3.Attributes.Add("id", "txt01")
inputEle3.Attributes.Add("name", inputEle3.Id)
inputEle3.Attributes.Add("type", "text")
inputEle3.Attributes.Add("placeholder", "text here ....")
RESULT:
-------
inputEle3.OuterHtml => <input id="txt01" name="txt01" type="text" placeholder="text here ...." >
It works now, provided I use HtmlAgilityPack.dll :(
Microsoft mshtml sucks! :(
This is what worked for me. Forgive me using dynamic datatype, I do not have the mshtml library on my Visual Studio for some reason.
private void Form1_Load(object sender, EventArgs e)
{
this.webBrowser1.Navigate("about:blank");
this.webBrowser1.Document.Write("<INPUT id='hell' class='blah' placeholder='text here' name='hell' type='text'></INPUT>");
dynamic htmldoc = webBrowser1.Document.DomDocument as dynamic;
dynamic node = htmldoc.getElementById("hell") as dynamic;
string x = node.OuterHtml; //gets name but not type
string s = node.GetAttribute["type"]; //gets type
string name = node.GetAttribute["name"]; //gets name
}
So the OuterHtml per say did not get the attribute, but when calling the GetAttribute method it did work. Hopefuly this helps.
ULTIMATE SOLUTION:
Use HTML Agility Pack:
----------------------
Dim inputEle3 As HtmlAgilityPack.HtmlNode = new_wb.CreateElement("input")
inputEle3.Attributes.Add("id", "txt01")
inputEle3.Attributes.Add("name", inputEle3.Id)
inputEle3.Attributes.Add("type", "text")
inputEle3.Attributes.Add("placeholder", "text here ....")
RESULT:
-------
inputEle3.OuterHtml => <input id="txt01" name="txt01" type="text" placeholder="text here ...." >
It works now, provided I use HtmlAgilityPack.dll :( Microsoft mshtml sucks! :(

Replacing A Node Using HtmlAgilityPack Throwing Strange Error

I have a webpage that displays a table the user can edit. After the edits are made I want to save the table as a .html file that I can convert to an image later. I am doing this by overriding the render method. However, I want to remove two buttons and a DropDownList from the final version so that I just get the table by itself. Here is the code I am currently trying:
protected override void Render(HtmlTextWriter writer)
{
using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new StringWriter()))
{
base.Render(htmlwriter);
string renderedContent = htmlwriter.InnerWriter.ToString();
string output = renderedContent.Replace(#"<input type=""submit"" name=""viewReport"" value=""View Report"" id=""viewReport"" />", "");
output = output.Replace(#"<input type=""submit"" name=""redoEdits"" value=""Redo Edits"" id=""redoEdits"" />", "");
var doc = new HtmlDocument();
doc.LoadHtml(output);
var query = doc.DocumentNode.Descendants("select");
foreach (var item in query.ToList())
{
var newNodeStr = "<div></div>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
File.WriteAllText(currDir + "\\outputFile.html", output);
writer.Write(renderedContent);
}
}
Where I have adapted this solution found in another SO post about replacing nodes with HtmlAgilityPack:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
and here is the rendered HTML I am trying to alter:
<select name="Archives" onchange="javascript:setTimeout('__doPostBack(\'Archives\',\'\')', 0)" id="Archives" style="width:200px;">
<option selected="selected" value="Dashboard_Jul-2012">Dashboard_Jul-2012</option>
<option value="Dashboard_Jun-2012">Dashboard_Jun-2012</option>
</select>
The two calls to Replace are working as expected and removing the buttons. However this line:
var query = doc.DocumentNode.Descendants("select");
is throwing this error:
Method not found: 'Int32 System.Environment.get_CurrentManagedThreadId()'.
Any advice is appreciated.
Regards.
Seems like you are using the .Net 4.5 Version of the Agility Pack in a project targeting .Net or lower, you just have to either change the reference of the Dll to the one compiled for your Framework version or change your project to .Net 4.5 (if you're using VS 2012 that is).

Scraping a webpage with C# and HTMLAgility

I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project.
I am doing this as a C# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags <table class="data"> and </table>.
My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By, out of the page and send the data to a SQL table.
One twist is that there is also a small PNG picture that also need to be grabbed from the src="/partcode/number.
I do not have any completed code that woks. I thought this bit of code would tell me if I am heading in the right direction. Even stepping into the debug I can’t see that it does anything. Could someone possibly point me in the right direction on this. The more detailed the better since it is apparent I have a lot to learn.
Thank you I would really appreciate it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
namespace Stats
{
class PartParser
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
//My understanding this reads the entire page in?
var tables = doc.DocumentNode.SelectNodes("//table");
// I assume that this sets up the search for words containing table
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
Console.ReadKey();
}
}
}
The web code is:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Part Number Database: Item Record</title>
<table class="data">
<tr><td>Part-Num</td><td width="50"></td><td>
<img src="/partcode/number/072140" alt="072140"/></td></tr>
<tr><td>Manu-Number</td><td width="50"></td><td>
<img src="/partcode/manu/00721408" alt="00721408" /></td></tr>
<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
<tr><td>Manu-Country</td><td></td><td>United States</td></tr>
<tr><td>Last Modified</td><td></td><td>26 Jan 2009, 8:08 PM</td></tr>
<tr><td>Last Modified By</td><td></td><td>Manu</td></tr>
</table>
<head/>
</html>
The beginning part is off:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
LoadHtml(html) loads an html string into the document, I think you want something like this instead:
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load("http://stackoverflow.com");
A working code, according to the HTML source you provided. It can be factorized, and I'm not checking for null values (in rows, cells, and each value inside the case). If you have the page in 127.0.0.1, that will work. Just paste it inside the Main method of a Console Application and try to understand it.
HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");
var rows = doc.DocumentNode.SelectNodes("//table[#class='data']/tr");
foreach (var row in rows)
{
var cells = row.SelectNodes("./td");
string title = cells[0].InnerText;
var valueRow = cells[2];
switch (title)
{
case "Part-Num":
string partNum = valueRow.SelectSingleNode("./img[#alt]").Attributes["alt"].Value;
Console.WriteLine("Part-Num:\t" + partNum);
break;
case "Manu-Number":
string manuNumber = valueRow.SelectSingleNode("./img[#alt]").Attributes["alt"].Value;
Console.WriteLine("Manu-Num:\t" + manuNumber);
break;
case "Description":
string description = valueRow.InnerText;
Console.WriteLine("Description:\t" + description);
break;
case "Manu-Country":
string manuCountry = valueRow.InnerText;
Console.WriteLine("Manu-Country:\t" + manuCountry);
break;
case "Last Modified":
string lastModified = valueRow.InnerText;
Console.WriteLine("Last Modified:\t" + lastModified);
break;
case "Last Modified By":
string lastModifiedBy = valueRow.InnerText;
Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
break;
}
}

Reading links in header using WebKit.NET

I am trying to figure out how to read header links using C#.NET. I want to get the edit link from Browser1 and put it in browser 2. My problem is that I can't figure out how to get at attributes, or even the link tags for that matter. Below is what I have now.
using System.XML.Linq;
...
string source = webKitBrowser1.DocumentText.ToString();
XDocument doc = new XDocument(XDocument.Parse(source));
webKitBrowser2.Navigate(doc.Element("link").Attribute("href").Value.ToString());
This would work except that xml is different than html, and right off the bat, it says that it was expecting "doctype" to be uppercase.
I finally figured it out, so I will post it for anyone who has the same question.
string site = webKitBrowser1.Url.Scheme + "://" + webKitBrowser1.Url.Authority;
WebKit.DOM.Document doc = webKitBrowser1.Document;
WebKit.DOM.NodeList links = doc.GetElementsByTagName("link");
WebKit.DOM.Element link;
string editlink = "none";
foreach (var item in links)
{
link = (WebKit.DOM.Element)item;
if (link.Attributes["rel"].NodeValue == "edit") { editlink = link.Attributes["href"].NodeValue; }
}
if (editlink != "none") { webKitBrowser2.Navigate(site + editlink); }

Categories

Resources