I'm trying to program an API for discord and I need to retrieve two pieces of information out of the HTML code of the web page https://myanimelist.net/character/214 (and other similar pages with URLs of the form https://myanimelist.net/character/N for integers N), specifically the URL of the Character Picture (in this case https://cdn.myanimelist.net/images/characters/14/54554.jpg) and the name of the character (in this case Youji Kudou). Afterwards I need to save those two pieces of information to JSON.
I am using HTMLAgilityPack for this, yet I can't quite see through it. The following is my first attempt:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
foreach (var node in htmlNodes.Descendants("tr/td/div/a/img"))
{
Console.WriteLine(node.InnerHtml);
}
}
Unfortunately, this produces no output. If I followed the path correctly (which is probably the first mistake) it should be "tr/td/div/a/img". I get no errors, it runs, yet I get no output.
My second attempt is:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
var script = htmlDoc.DocumentNode.Descendants()
.Where(n => n.Name == "tr/td/a/img")
.First().InnerText;
// Return the data of spect and stringify it into a proper JSON object
var engine = new Jurassic.ScriptEngine();
var result = engine.Evaluate("(function() { " + script + " return src; })()");
var json = JSONObject.Stringify(engine, result);
Console.WriteLine(json);
Console.ReadKey();
}
But this also doesn't work.
How can I extract the required information?
EDIT:
So, I've come quite further now, and I've found a solution to finding the link. It was rather simple. But now I'm stuck with finding the name of the character. The website is structured the same on every other link there is (changing the last number) so, I want to find many different ones via for loop. Here's how I tried to do it:
for (int i = 1; i <= 1000; i++)
{
HtmlWeb web = new HtmlWeb();
var html = "https://myanimelist.net/character/" + i;
var htmlDoc = web.Load(html);
foreach (var item in htmlDoc.DocumentNode.SelectNodes("//*[#]"))
{
string n;
n = item.GetAttributeValue("src", "");
foreach (var item2 in htmlDoc.DocumentNode.SelectNodes("//*[#src and #alt='" + n + "']"))
{
Console.WriteLine(item2.GetAttributeValue("src", ""));
}
}
}
in the first foreach I would try to search for the name, which is concluded always at the same position (e.g http://prntscr.com/o1uo3c and http://prntscr.com/o1uo91 and to be specific: http://prntscr.com/o1xzbk) but I haven't found out how yet. Since the structure in the HTML doesn't have any body type I can follow up with. The second foreach loop is to search for the URL which works by now and the n should give me the name, so I can figure it out for each different character.
I was able to extract the character name and image from https://myanimelist.net/character/214 using the following method:
public static CharacterData ExtractCharacterNameAndImage(string url)
{
//Use the following if you are OK with hardcoding the structure of <div> elements.
//var tableXpath = "/html/body/div[1]/div[3]/div[3]/div[2]/table";
//Use the following if you are OK with hardcoding the fact that the relevant table comes first.
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
Where CharacterData is defined as follows:
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
Afterwards, the character data can be serialized to JSON using any of the tools from How to write a JSON file in C#?, e.g. json.net:
var url = "https://myanimelist.net/character/214";
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
Which outputs
{
"Name": "Youji Kudou",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
If you would prefer the Name to include the Japanese in parenthesis, replace GetDirectInnerText() with just InnerText, which results in:
{
"Name": "Youji Kudou (工藤耀爾)",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
Alternatively, if you prefer you could pull the character name from the document title:
var title = string.Concat(htmlDoc.DocumentNode.SelectNodes("/html/head/title").Select(n => n.InnerText.Trim()));
var index = title.IndexOf("- MyAnimeList.net");
if (index >= 0)
title = title.Substring(0, index).Trim();
How did I determine the correct XPath strings?
Firstly, using Firefox 66, I opened the debugger and loaded https://myanimelist.net/character/214 in the window with the debugging tools visible.
Next, following the instructions from How to find xpath of an element in firefox inspector, I selected the Youji Kudou (工藤耀爾) node and copied its XPath, which turned out to be:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
I then tried to select this node using SelectNodes()... and got a null result. But why? To determine this I created a debugging routine that would break the path into successively longer portions and determine where the failure occurs:
static void TestSelect(HtmlDocument htmlDoc, string xpath)
{
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i-1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
This output the following:
Input path: /html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
Testing "/html": result count = 1
Testing "/html/body": result count = 1
Testing "/html/body/div[1]": result count = 1
Testing "/html/body/div[1]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]": result count = null
As you can see, something goes wrong selecting the <tbody> path element. Manual inspection of the InnerHtml returned by selecting /html/body/div[1]/div[3]/div[3]/div[2]/table revealed that, for some reason, the server is not including the <tbody> tag when returning HTML to the HtmlWeb object -- possibly due to some difference in request header(s) provided by Firefox vs HtmlWeb. Once I omitted the tbody path element I was able to query for the character name successfully using:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]
A similar process provided the following working path for the image:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img
Since the two queries are finding contents in the same <table>, in my final code I selected the table only once in a separate step, and removed some of the hardcoding as to the specific nesting of <div> elements.
Demo fiddle here.
Alright, to finnish it up, I've rounded the Code, gratefully assisted by dbc, and implemented nearly completly into the project. Just if someone in later days maybe has a identical question, here they go. This outputs out of a defined number all the character names, links and images and writes it into a JSON file and could be adapted for other websites.
using System;
using System.Linq;
using Newtonsoft.Json;
using HtmlAgilityPack;
using System.IO;
namespace SearchingHTML
{
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
public class Program
{
public static CharacterData ExtractCharacterNameAndImage(string url)
{
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
public static void Main()
{
int max = 10000;
string fileName = #"C:\Users\path of your file.json";
Console.WriteLine("Environment version: " + Environment.Version);
Console.WriteLine("Json.NET version: " + typeof(JsonSerializer).Assembly.FullName);
Console.WriteLine("HtmlAgilityPack version: " + typeof(HtmlDocument).Assembly.FullName);
Console.WriteLine();
for (int i = 6; i <= max; i++)
{
try
{
var url = "https://myanimelist.net/character/" + i;
var htmlDoc = new HtmlWeb().Load(url);
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
TextWriter tsw = new StreamWriter(fileName, true);
tsw.WriteLine(json);
tsw.Close();
} catch (Exception ex) { }
}
}
}
}
/*******************************************************************************************************************************
****************************************************IF TESTING IS REQUIERED****************************************************
*******************************************************************************************************************************
*
* static void TestSelect(HtmlDocument htmlDoc, string xpath)
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i - 1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
*******************************************************************************************************************************
*********************************************FOR TESTING ENTER THIS INTO MAIN CLASS********************************************
*******************************************************************************************************************************
*
* var url2 = "https://myanimelist.net/character/256";
var data2 = ExtractCharacterNameAndImage(url2);
var json2 = JsonConvert.SerializeObject(data2, Formatting.Indented);
Console.WriteLine(json2);
var nameXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]";
var imageXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefox);
TestSelect(htmlDoc, imageXpathFromFirefox);
var nameXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]";
var imageXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefoxFixed);
TestSelect(htmlDoc, imageXpathFromFirefoxFixed);
*******************************************************************************************************************************
*******************************************************************************************************************************
*******************************************************************************************************************************
*/
I'm trying to modify the code below to scrape jobs from www.itoworld.com/careers. The jobs are in a table format and return all the <'td> values.
I believe it comes from the line:
var parentnode = node.ParentNode.ParentNode.ParentNode.FirstChild.NextSibling
However, I want it to write:
<a class="std-btn" href="http://www.itoworld.com/office-manager/">Office Manager</a>
Currently it is writing
<a href='http://www.itoworld.com/office-manager/' target='_blank'>Office ManagerOffice & AdminCambridgeFind out more</a>
I plan on 'brute force' modifying the output to remove unnecessary extras but was hoping there is a smarter way to do this. Is there a way for example to remove the second and third ParentNode after they have been called? (So they do not get written?)
public string ExtractIto()
{
string sUrl = "http://www.itoworld.com/careers/";
GlobusHttpHelper ghh = new GlobusHttpHelper();
List<Links> link = new List<Links>();
bool Next = true;
int count = 1;
string html = ghh.getHtmlfromUrl(new Uri(string.Format(sUrl)));
HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
hd.LoadHtml(html);
var hn = hd.DocumentNode.SelectSingleNode("//*[#class='btn-wrapper']");
var hnc = hn.SelectNodes(".//a");
foreach (var node in hnc)
{
try
{
var parentnode = node.ParentNode.ParentNode.ParentNode.FirstChild.NextSibling;
Links l = new Links();
l.Name = ParseHtmlContainingText(parentnode.InnerText);
l.Link = node.GetAttributeValue("href", "");
link.Add(l);
}
}
string Xml = getXml(link);
return WriteXml(Xml);
For completeness below is the definition of ParseHtmlContainingText
public string ParseHtmlContainingText(string htmlString)
{
return Regex.Replace(Regex.Replace(WebUtility.HtmlDecode(htmlString), #"<[^>]+>| ", ""), #"\s{2,}", " ").Trim();
}
You just need to create a "name node" and use that for your parse method.
I tested with this code and it worked for me.
var parentnode = node.ParentNode.ParentNode.ParentNode.FirstChild.NextSibling;
var nameNode = parentnode.FirstChild;
Links l = new Links();
l.Name = ParseHtmlContainingText(nameNode.InnerText);
l.Link = node.GetAttributeValue("href", "");
<div id="targetSummaryCount" class="large-text" data-reactid=".0.0.0.3.0.0.1.0.0.0.0.$target_key_30.1.1.0">
<span data-reactid=".0.0.0.3.0.0.1.0.0.0.0.$target_key_30.1.1.0.0">0</span>
<span data-reactid=".0.0.0.3.0.0.1.0.0.0.0.$target_key_30.1.1.0.1">/</span>
<span data-reactid=".0.0.0.3.0.0.1.0.0.0.0.$target_key_30.1.1.0.2">20</span>
</div>
I want to get the '0' and '20' values from the above html, using the div id 'targetSummaryCount'. Is there any way to do this in Selenium C# without using xpath?
You can use a CSS selector:
driver.FindElements(By.CssSelector("#targetSummaryCount > span"))
This would match all span elements directly under the element with id="targetSummaryCount".
I have posted the solution in Java. Please Convert it to C#
WebElement divSummary = driver.findElementById("targetSummaryCount");
List<WebElement> targetKeys = divSummary.findElements(By.tagName("span"));
// For getting first and last element
targetKeys.get(0);
targetKeys.get(targetKeys.size()-1);
// For maintaining queue of elements with text 0 or 20
HashMap <Integer, List<WebElement>> elementmap = new HashMap <Integer, List<WebElement>>();
List<WebElement> webElements1 = new ArrayList<WebElement>();
List<WebElement> webElements2 = new ArrayList<WebElement>();
for (WebElement elem : targetKeys){
if(elem.getText().trim().equals("0")){
webElements1.add(elem);
}
else if (elem.getText().trim().equals("20")){
webElements2.add(elem);
}
}
elementmap.put(0,webElements1);
elementmap.put(20,webElements1);
// elementmap.get(0) contains all span tag under target Summary Count div tag
// elementmap.get(20) contains all span tag under target Summary Count div tag
To get the "0" and "20" with an XPath:
string value_00 = driver.FindElementByXPath("//div[#id='targetSummaryCount']/span[1]").Text;
string value_20 = driver.FindElementByXPath("//div[#id='targetSummaryCount']/span[3]").Text;
// or
var elements = driver.FindElementsByXPath("//div[#id='targetSummaryCount']/span");
string value_00 = elements[0].Text;
string value_01 = elements[2].Text;
And with a Css Selector:
string value_00 = driver.FindElementByCssSelector("#targetSummaryCount > span:nth-child(1)").Text;
string value_20 = driver.FindElementByCssSelector("#targetSummaryCount > span:nth-child(3)").Text;
// or
var elements = driver.FindElementsByCssSelector("#targetSummaryCount > span:nth-child(odd)");
string value_00 = elements[0].Text;
string value_01 = elements[1].Text;
You can try this.
var spanCount = driver.FindElements(By.XPath("//*[#id='targetSummaryCount']/span")).Count;
var myResult = "";
for (int spanIndex = 0; spanIndex < spanCount; spanIndex++)
{
var spanText = driver.FindElement(By.XPath("//*[#id='targetSummaryCount']/span[" + spanIndex + "]")).Text;
if (spanText != "/")
{
myResult += "span:" + spanIndex + " spanText:" + myResult;
}
}
You can try this. Fetching the span which is the child of the div having id targetSummaryCount and text as 0 or 20.
var span_00 = driver.FindElementByXPath("//div[#id='targetSummaryCount']/span[text()='0']");
var span_20 = driver.FindElementByXPath("//div[#id='targetSummaryCount']/span[text()='20']");
I'm trying to get a grip on AngleSharp by returning a specific part of an html.
So far that's my Code:
using (WebClient client = new WebClient())
{
string htmlCode = client.DownloadString("http://www.planetradio.de/music/trackfinder.html");
var parser = new HtmlParser();
var document = parser.Parse(htmlCode);
var blueListItemsLinq = document.All.Where(m => m.LocalName == "span id" && m.ClassList.Contains("headerTracklistCurrentSongArtist"));
foreach (var item in blueListItemsLinq)
label.Content = item;
}
What i want it to return is the current Artist which should be in the html under:
<div id="headerTracklistCurrentSong">
<span id="headerTracklistCurrentSongArtist">Olly Murs</span>
<span id="headerTracklistCurrentSongTitle">Kiss Me</span>
But i seem to have made a mistake....so i'd be glad if someone could help me here and explain it to me...
Thanks in advance to everyone answering. :)
The span elements are parsed as AngleSharp.Dom.Html.HtmlSpanElement
so your query should be:
var blueListItemsLinq = document.All.Where(m => m.LocalName == "span" && m.Id == "headerTracklistCurrentSongArtist");
Then you can get the text/values like this:
foreach (var item in blueListItemsLinq)
{
label.Content = item.TextContent; // "Olly Murs"
var child = item.FirstChild as AngleSharp.Dom.Html.IHtmlAnchorElement;
var text = child.Text; // "Olly Murs"
var path = child.PathName; // "/music/trackfinder.html"
}
UPDATED
Since the currnt Artist names are shown in the table at "planetradio.de/music/trackfinder.html", you can get the names like this:
var hitfinderTable = document.All.Where(m => m.Id == "hitfindertable").First() as AngleSharp.Dom.Html.IHtmlTableElement;
foreach (var row in hitfinderTable.Rows)
{
var artistName = row.Cells[2].TextContent;
}
I am trying to loop through my client side Html table contents on my c# server side. Setting the Html table to runat="server" is not an option because it conflicts with the javascript in use.
I use ajax to pass my clientside html table's InnerHtml to my server side method. I thought I would be able to simple create an HtmlTable variable in c# and set the InnerHtml property when I quickly realized this is not possible because I got the error {"'HtmlTable' does not support the InnerHtml property."}
For simplicity , lets say my InnerHtml string passed from client to server is:
string myInnerHtml = "<colgroup>col width="100"/></colgroup><tbody><tr><td>hello</td></tr></tbody>"
I followed a post from another stack overflow question but can not quite get it working.
Can someone point out my errors?
string myInnerHtml = "<colgroup>col width="100"/></colgroup><tbody><tr><td>hello</td></tr></tbody>"
HtmlTable table = new HtmlTable();
System.Text.StringBuilder sb = new System.Text.StringBuilder(myInnerHtml);
System.IO.StringWriter tw = new System.IO.StringWriter(sb);
HtmlTextWriter hw = new HtmlTextWriter(tw);
table.RenderControl(hw);
for (int i = 0; i < table.Rows.Count; i++)
{
for (int c = 0; c < table.Rows[i].Cells.Count; i++)
{
// get cell contents
}
}
hope this helps,
string myInnerHtml = #"<table>
<colgroup>col width='100'/></colgroup>
<tbody>
<tr>
<td>hello 1</td><td>hello 2</td>
</tr>
<tr>
<td>hello 3</td><td>hello 4</td>
</tr>
</tbody>
</table>";
DataSet ds = new DataSet();
ds.ReadXml(new XmlTextReader(new StringReader(myInnerHtml)));
var tr = ds.Tables["tr"];
var td = ds.Tables["td"];
foreach (DataRow trRow in tr.Rows)
foreach(DataRow tdRow in td.AsEnumerable().Where(x => (int)x["tr_Id"] == (int)trRow["tr_Id"]))
Console.WriteLine( tdRow["tr_Id"] + " | " + tdRow["td_Text"]);
output
=========================
//0 | hello 1
//0 | hello 2
//1 | hello 3
//1 | hello 4
Ended up using the HTMLAgilityPack. I found it a lot more readable and manageable for diverse situations.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<html><body><table>" + innerHtml + "</table></html></body>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("//tr"))
{
foreach (HtmlNode cell in row.SelectNodes("td"))
{
var divExists = cell.SelectNodes("div");
if (divExists != null)
{
foreach (HtmlNode div in cell.SelectNodes("div"))
{
string test = div.InnerText + div.Attributes["data-id"].Value;
}
}
}
}
}