Parsing Nodes with HTML AgilityPack

Parsing Nodes with HTML AgilityPack - c#

I'm trying to get information from that page : http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets
rows look like this when inspecting elements :
I've tried this code but it return me null every time on any nodes:
public class ItemSetsTransmog
{
public string ItemSetName { get; set; }
public string ItemSetId { get; set; }
}
public partial class Fmain : Form
{
DataTable Table;
HtmlWeb web = new HtmlWeb();
public Fmain()
{
InitializeComponent();
initializeItemSetTransmogTable();
}
private async void Fmain_Load(object sender, EventArgs e)
{
int PageNum = 0;
var itemsets = await ItemSetTransmogFromPage(0);
while (itemsets.Count > 0)
{
foreach (var itemset in itemsets)
Table.Rows.Add(itemset.ItemSetName, itemset.ItemSetId);
itemsets = await ItemSetTransmogFromPage(PageNum++);
}
}
private async Task<List<ItemSetsTransmog>> ItemSetTransmogFromPage(int PageNum)
{
String url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets";
if (PageNum != 0)
url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets:75+" + PageNum.ToString();
var doc = await Task.Factory.StartNew(() => web.Load(url));
var NameNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
var IdNodes = doc.DocumentNode.SelectNodes("//*[#id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
// if these are null it means the name/score nodes couldn't be found on the html page
if (NameNodes == null || IdNodes == null)
return new List<ItemSetsTransmog>();
var ItemSetNames = NameNodes.Select(node => node.InnerText);
var ItemSetIds = IdNodes.Select(node => node.InnerText);
return ItemSetNames.Zip(ItemSetIds, (name, id) => new ItemSetsTransmog() { ItemSetName = name, ItemSetId = id }).ToList();
}
private void initializeItemSetTransmogTable()
{
Table = new DataTable("ItemSetTransmogTable");
Table.Columns.Add("ItemSetName", typeof(string));
Table.Columns.Add("ItemSetId", typeof(string));
ItemSetTransmogDataView.DataSource = Table;
}
}
}
why does my script doesn't load any of theses nodes ? how can i fix it ?

Your code does not load these nodes because they do not exist in the HTML that is pulled back by HTML Agility Pack. This is probably because a large majority of the markup you have shown is generated by JavaScript. Just try inspecting the doc.ParsedText property in your ItemSetTransmogFromPage() method.
Html Agility Pack is an HTTP Client/Parser, it will not run scripts. If you really need to get the data using this process then you will need to use a "headless browser" such as Optimus to retrieve the page (caveat: I have not used this library, though a nuget package appears to exist) and then probably use HTML Agility Pack to parse/query the markup.
The other alternative might be to try to parse the JSON that exists on this page (if this provides you with the data that you need, although this appears unlikely).
Small note - I think the id in you xpath should be "tab-transmog-sets" instead of "tab - transmog - sets"

Related

Looping through HtmlNodes and collecting data gives me the same result every time

I have an async method which calls a mapper for turning HTML string into an IEnumerable:
public async Task<IEnumerable<MovieRatingScrape>> GetMovieRatingsAsync(string username, int page)
{
var response = await _httpClient.GetAsync($"/betyg/{username}?p={page}");
response.EnsureSuccessStatusCode();
var html = await response.Content.ReadAsStringAsync();
return new MovieRatingsHtmlMapper().Map(html);
}
...
public class MovieRatingsHtmlMapper : HtmlMapperBase<IEnumerable<MovieRatingScrape>>
{
// In reality, this method belongs to base class with signature T Map(string html)
public IEnumerable<MovieRatingScrape> Map(string html)
{
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
return Map(htmlDocument);
}
public override IEnumerable<MovieRatingScrape> Map(HtmlDocument item)
{
var movieRatings = new List<MovieRatingScrape>();
var nodes = item.DocumentNode.SelectNodes("//table[#class='list']/tr");
foreach (var node in nodes)
{
var title = node.SelectSingleNode("//td[1]/a")?.InnerText;
movieRatings.Add(new MovieRatingScrape
{
Date = DateTime.Parse(node.SelectSingleNode("//td[2]")?.InnerText),
Slug = node.SelectSingleNode("//td[1]/a[starts-with(#href, '/film/')]")?
.GetAttributeValue("href", null)?
.Replace("/film/", string.Empty),
SwedishTitle = title,
Rating = node.SelectNodes($"//td[3]/i[{XPathHasClass("fa-star")}]").Count
});
}
return movieRatings;
}
}
The resulting list movieRatings contains copies of the same object, but when I look at the HTML and when I debug and view the HtmlNode node they differ as they are supposed to.
Either I'm blind to something really obvious, or I am hitting some async issue which I do not grasp. Any ideas? I should be getting 50 unique objects out of this call, now I am only getting the first 50 times.
Thank you in advance, Viktor.
Edit: Adding some screenshots to show my predicament. Look at locals InnerHtml (node) and title for item 1 and 2 of the foreach loop.
Edit 2: Managed to reproduce on .NET Fiddle: https://dotnetfiddle.net/A2I4CQ

You need to use .// and not //
Here is the fixed Fiddle: https://dotnetfiddle.net/dZkSRN
// will search anywhere in the document
.// will search anywhere in the current node

i am not super sure how to describe this but your issue is here (i think)
//table[#class='list']/tr"
specifically the //
I experienced the same thing while looking for a span. i had to use something similar
var nodes = htmlDoc.DocumentNode.SelectNodes("//li[#class='itemRow productItemWrapper']");
foreach(HtmlNode node in nodes)
{
var nodeDoc = new HtmlDocument();
nodeDoc.LoadHtml(node.InnerHtml);
string name = nodeDoc.DocumentNode.SelectSingleNode("//span[#class='productDetailTitle']").InnerText;
}

HtmlAgilityPack search url link

I create a WindownsFormApplication for a group of friends. I'm using HtmlAgilityPack for my application.
I need to find all version of taco addon's , like this:
<li><a href='https://www.dropbox.com/s/nks140nf794tx77/GW2TacO_034r.zip?dl=0'>Download Build 034.1866r</a></li>
Additionally, I need to check the latest version for downloading the file with the url as in the code below:
public static bool Tacoisuptodate(string Version)
{
// Load HtmlDocuments
var doc = new HtmlWeb().Load("http://www.gw2taco.com/");
var body = doc.DocumentNode.SelectNodes("//body").Single();
// Sort out the document to take that he to interest us
//SelectNodes("//div"))
foreach (var node in doc.DocumentNode.SelectNodes("//div"))
{
// Check for null value
var classeValue = node.Attributes["class"]?.Value;
var idValue = node.Attributes["id"]?.Value;
var hrefValue = node.Attributes["href"]?.Value;
// We search <div class="widget LinkList" id="LinkList1" into home page >
if (classeValue == "widget LinkList" && idValue == "LinkList1")
{
foreach(HtmlNode content in node.SelectNodes("//li"))
{
Debug.Write(content.GetAttributeValue("href", false));
}
}
}
return false;
}
If somebody could help me, I would really appreciate it.

A single xpath is enough.
var xpath = "//h2[text()='Downloads']/following-sibling::div[#class='widget-content']/ul/li/a";
var doc = new HtmlAgilityPack.HtmlWeb().Load("http://www.gw2taco.com/");
var downloads = doc.DocumentNode.SelectNodes(xpath)
.Select(li => new
{
href = li.Attributes["href"].Value,
name = li.InnerText
})
.ToList();

System.ArgumentNullException when trying to access span with Xpath (C#)

So i've been trying to get a program working where I get info from google finance regarding different stock stats. So far I have not been able to get information out of spans. As of now I have hardcoded direct access to the apple stock.
Link to Apple stock: https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=NgItWIG1GIftsAHCn4zIAg
What i can't understand is that I receive correct output when I trying it in the chrome console with the following command:
$x("//*[#id=\"appbar\"]//div//div//div//span");
This is my current code in Visual studio 2015 with Html Agility Pack installed(I suspect a fault in currDocNodeCompanyName):
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
var histDoc = web.Load("https://www.google.com/finance/historical?q=NASDAQ%3AAAPL&ei=q9IsWNm4KZXjsAG-4I7oCA.html");
var histDocNode = histDoc.DocumentNode.SelectNodes("//*[#id=\"prices\"]//table//tr//td");
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
var currDocNodeCurrency = currDoc.DocumentNode.SelectNodes("//*[#id=\"ref_22144_elt\"]//div//div");
var currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
var histDocText = histDocNode.Select(node => node.InnerText);
var currDocCurrencyText = currDocNodeCurrency.Select(node => node.InnerText);
var currDocCompanyName = currDocNodeCompanyName.Select(node => node.InnerText);
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(currDocCompanyName.Take(2).ToString());
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I have been trying the Xpath expression [text] and received an output that i can work with when using the chrome console but not in VS. I have also been experimenting with a foreach-loop, a few suggested it to others.
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
///same as before
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
HtmlNodeCollection currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
///Same as before
List <string> blaList = new List<string>();
foreach (HtmlNode x in currDocNodeCompanyName)
{
blaList.Add(x.InnerText);
}
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(blaList[1]);
result.Add(blaList[2]);
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I would really appreciate if anyone could point me in the right direction.

If you check the contents of currDoc.DocumentNode.InnerHtml you will notice that there is no element with the id "appbar", therefore the result is correct, since the xpath doesn't return anything.
I suspect that the html element you're trying to find is generated by a script (js for example), and that explains why you can see it on the browser and not on the HtmlDocument object, since HtmlAgilityPack does not render scripts, it only download and parse the raw source code.

Editing XML node inner text only if has specified value

Hello i need your super help.
Im not soo skilled in C# and i stack for about 6 hours on this. So please if anyone know help me . Thx
I have Xml like this
<COREBASE>
<AGENT>
<AGENT_INDEX>1</AGENT_INDEX>
<AGENT_PORTER_INDEX>
</AGENT_PORTER_INDEX>
<AGENT_NAME>John</AGENT_NAME>
<AGENT_SURNAME>Smith</AGENT_SURNAME>
<AGENT_MOBILE_NUMBER>777777777</AGENT_MOBILE_NUMBER>
</AGENT>
<AGENT>
<AGENT_INDEX>2</AGENT_INDEX>
<AGENT_PORTER_INDEX>1
</AGENT_PORTER_INDEX>
<AGENT_NAME>Charles</AGENT_NAME>
<AGENT_SURNAME>Bukowski</AGENT_SURNAME>
<AGENT_MOBILE_NUMBER>99999999</AGENT_MOBILE_NUMBER>
</AGENT>
</COREBASE>
And I need to select agent by index in windows forms combo box and than edit and save his attributes to xml. I found how to edit and save it but i dont know why but its saved to the first agent and overwrite his attributes in XML but not in the selected one.. :-(
Plese i will be glad for any help
private void buttonEditAgent_Click(object sender, EventArgs e)
{
XmlDocument AgentBaseEdit = new XmlDocument();
AgentBaseEdit.Load("AgentBase.xml");
XDocument AgentBase = XDocument.Load("AgentBase.xml");
var all = from a in AgentBase.Descendants("AGENT")
select new
{
agentI = a.Element("AGENT_INDEX").Value,
porterI = a.Element("AGENT_PORTER_INDEX").Value,
agentN = a.Element("AGENT_NAME").Value,
agentS = a.Element("AGENT_SURNAME").Value,
agentM = a.Element("AGENT_MOBILE_NUMBER").Value,
};
foreach (var a in all)
{
if ("" == textBoxEditAgentIndex.Text.ToString())
{
MessageBox.Show("You must fill Agent Index field !!", "WARNING");
}
else
{
// AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/AGENT_INDEX").InnerText == textBoxEditAgentIndex.Text
if (a.agentI == textBoxEditAgentIndex.Text.ToString())
{
AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/AGENT_INDEX").InnerText = textBoxEditAgentIndex.Text;
AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/AGENT_PORTER_INDEX").InnerText = textBoxEditAgentPorterIndex.Text;
AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/AGENT_NAME").InnerText = textBoxEditAgentName.Text;
AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/AGENT_SURNAME").InnerText = textBoxEditAgentSurname.Text;
AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/AGENT_MOBILE_NUMBER").InnerText = textBoxEditAgentMobile.Text;
AgentBaseEdit.Save("AgentBase.xml");
ClearEditAgentTxtBoxes();
}
}
}
}
Am i on the right way but i dont see the doors or i am totaly wrong ? Thx all. Miko
OK i tried it this way but it didnt changed the inner text
string agentIndex = comboBoxEditAgentI.SelectedItem.ToString();
XmlDocument AgentBaseEdit = new XmlDocument();
AgentBaseEdit.Load("AgentBase.xml");
XDocument AgentBase = XDocument.Load("AgentBase.xml");
var xElemAgent = AgentBase.Descendants("AGENT")
.First(a => a.Element("AGENT_INDEX").Value == agentIndex);
xElemAgent.Element("AGENT_MOBILE_NUMBER").Value = textBoxEditAgentMobile.Text;
xElemAgent.Element("AGENT_SURNAME").Value = textBoxEditAgentSurname.Text;
AgentBaseEdit.Save("AgentBase.xml");

It would be easier if you use Linq2Xml.
int agentIndex = 2;
XDocument xDoc = XDocument.Load(filename);
var xElemAgent = xDoc.Descendants("AGENT")
.First(a => a.Element("AGENT_INDEX").Value == agentIndex.ToString());
//or
//var xElemAgent = xDoc.XPathSelectElement(String.Format("//AGENT[AGENT_INDEX='{0}']",agentIndex));
xElemAgent.Element("AGENT_MOBILE_NUMBER").Value = "5555555";
xDoc.Save(fileName)
PS: namespaces: System.Xml.XPath System.Xml.Linq

It does not work, because you are selecting the first agent explicitly with in each loop
AgentBaseEdit.SelectSingleNode("COREBASE/AGENT/...")
But you can do it easier by reading and changing withing the same xml document. I'm only changing the agent name and replacing it with "test 1", "test 2", ...
XDocument AgentBase = XDocument.Load("AgentBase.xml");
int i = 0;
foreach (XElement el in AgentBase.Descendants("AGENT")) {
el.Element("AGENT_NAME").Value = "test " + ++i;
// ...
}
AgentBase.Save("AgentBase.xml");
UPDATE
However, I'm suggesting you to separate the logic involving the XML handling from the form. Start by creating an Agent class
public class Agent
{
public string Index { get; set; }
public string PorterIndex { get; set; }
public string Name { get; set; }
public string Surname { get; set; }
public string Mobile { get; set; }
}
Then create an interface defining the needed functionality for an agent repository. The advantage of this interface is that it will make it easier later to switch to another kind of repository like a relational database.
public interface IAgentRepository
{
IList<Agent> LoadAgents();
void Save(IEnumerable<Agent> agents);
}
Then create a class that handles the agents. Here is a suggestion:
public class AgentXmlRepository : IAgentRepository
{
private string _xmlAgentsFile;
public AgentXmlRepository(string xmlAgentsFile)
{
_xmlAgentsFile = xmlAgentsFile;
}
public IList<Agent> LoadAgents()
{
XDocument AgentBase = XDocument.Load(_xmlAgentsFile);
var agents = new List<Agent>();
foreach (XElement el in AgentBase.Descendants("AGENT")) {
var agent = new Agent {
Index = el.Element("AGENT_INDEX").Value,
PorterIndex = el.Element("AGENT_PORTER_INDEX").Value,
Name = el.Element("AGENT_NAME").Value,
Surname = el.Element("AGENT_SURNAME").Value,
Mobile = el.Element("AGENT_MOBILE_NUMBER").Value
};
agents.Add(agent);
}
return agents;
}
public void Save(IEnumerable<Agent> agents)
{
var xDocument = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement("COREBASE",
agents.Select(a =>
new XElement("AGENT",
new XElement("AGENT_INDEX", a.Index),
new XElement("AGENT_PORTER_INDEX", a.PorterIndex),
new XElement("AGENT_NAME", a.Name),
new XElement("AGENT_SURNAME", a.Surname),
new XElement("AGENT_MOBILE_NUMBER", a.Mobile)
)
)
)
);
xDocument.Save(_xmlAgentsFile);
}
}
The form can now concentrate on the editing logic. The form does not even need to know what kind of repository to use if you inject the repository in the form constructor (of cause the form constructor must declare a parameter of type IAgentRepository):
var myAgentForm = new AgentForm(new AgentXmlRepository("AgentBase.xml"));
myAgentForm.Show();
UPDATE #2
Note that you cannot change a single item within an XML file. You must load all the agents, make an edit and then rewrite the whole file, even if you edited only one agent.
To do this, you can use my LoadAgents method, then pick an agent from the returned list, edit the agent and finally write the agents list back to the file with my Save method. You can find an agent in the list with LINQ:
Agent a = agents
.Where(a => a.Index == x)
.FirstOrDefault();
This returns null if an agent with the required index does not exist. Since Agent is a reference type, you don’t have to write it back to the list. The list is keeping a reference to the same agent as the variable a.

Reach functionality from other class c#

update
I'm writing a silverlight application and I have the following Class "Home", in this class a read a .xml file a write these to a ListBox. In a other class Overview I will show the same .xml file. I know it is stupid to write the same code as in the class "Home".
The problem is, how to reach these data.
My question is how can I reuse the method LoadXMLFile() from another class?
The code.
// Read the .xml file in the class "Home"
public void LoadXMLFile()
{
WebClient xmlClient = new WebClient();
xmlClient.DownloadStringCompleted += new DownloadStringCompletedEventHandler(XMLFileLoaded);
xmlClient.DownloadStringAsync(new Uri("codeFragments.xml", UriKind.RelativeOrAbsolute));
}
private void XMLFileLoaded(object sender, DownloadStringCompletedEventArgs e)
{
if (e.Error == null)
{
string xmlData = e.Result;
XDocument xDoc = XDocument.Parse(xmlData);
var tagsXml = from c in xDoc.Descendants("Tag") select c.Attribute("name");
List<Tag> lsTags = new List<Tag>();
foreach (string tagName in tagsXml)
{
Tag oTag = new Tag();
oTag.name = tagName;
var tags = from d in xDoc.Descendants("Tag")
where d.Attribute("name").Value == tagName
select d.Elements("oFragments");
var tagXml = tags.ToArray()[0];
foreach (var tag in tagXml)
{
CodeFragments oFragments = new CodeFragments();
oFragments.tagURL = tag.Attribute("tagURL").Value;
//Tags.tags.Add(oFragments);
oTag.lsTags.Add(oFragments);
}
lsTags.Add(oTag);
}
//List<string> test = new List<string> { "a","b","c" };
lsBox.ItemsSource = lsTags;
}
}

Create a class to read the XML file, make references to this from your other classes in order to use it. Say you call it XmlFileLoader, you would use it like this in the other classes:
var xfl = new XmlFileLoader();
var data = xfl.LoadXMLFile();
If I were you, I would make the LoadXMLFile function take a Uri parameter to make it more reusable:
var data = xfl.LoadXMLFile(uriToDownload);

You could create a class whose single responsibility is loading XML and returning it, leaving the class that calls your LoadXmlFile method to determine how to handle the resulting XML.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing Nodes with HTML AgilityPack - c#

Related

Looping through HtmlNodes and collecting data gives me the same result every time

HtmlAgilityPack search url link

System.ArgumentNullException when trying to access span with Xpath (C#)

Editing XML node inner text only if has specified value

Reach functionality from other class c#

Categories

Resources