I'm using Html Agility Pack to fetch a webpage.
I want to collect all the TEXT I AM LOOKING FOR of the following form:
<li></li>
I tried this code:
var web = new HtmlWeb();
var doc = web.Load(url);
var nodes1 = doc.DocumentNode.SelectNodes("//[#data-address]");
var nodes2 = doc.DocumentNode.SelectNodes("//[#data-address={0}]");
both threw an exception: Expression must evaluate to a node-set.
How can i correct my selector ?
I'm not an XPath expert by any means, but I suspect you want:
// Note the *
var nodes1 = doc.DocumentNode.SelectNodes("//*[#data-address]");
In other words "any element with a data-address attribute"
Related
Hello How can I get a html content like a shoutbox or just the username of user connected in C# ?
Example: <p><?php echo USER['name'] ?></p>
in C#: How can I get the p value ?
You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
`
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//p")//this xpath selects all p tags
.Select(p => p.InnerText)
.ToList();
`
Give the P element an ID, and reference it by:
string contents = yourId.Text;
with your html like this:
<p id="yourId"></p>
I'm trying to grap p tag from form tag but it is null:
string html = "<form id='foo123'> <p> loll </p> </form>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectNodes("//form[contains(#id, 'foo')]"); //.Count = 1
var p = node[0].SelectSingleNode("./p"); // p is null
How do I fix this?
This is a known issue where the Agility Pack is incorrectly fixing the nesting of tags. You can work around it by calling:
HtmlNode.ElementsFlags.Remove("form");
See: http://htmlagilitypack.codeplex.com/workitem/23074
How can I get the XPath from a clicked HtmlElement in the WebBrowserControl?
This is how I retrieve the clicked HtmlElement:
System.Windows.Forms.HtmlDocument document = this.webBrowser1.Document;
document.MouseUp += new HtmlElementEventHandler(this.htmlDocument_Click);
private void htmlDocument_Click(object sender, HtmlElementEventArgs e)
{
HtmlElement element = this.webBrowser1.Document.GetElementFromPoint(e.ClientMousePosition);
}
I want to click specific elements (price, article number, description, etc) on a website and get their XPath expressions.
Thank you!
XPath expression is not a standard feature of HTML (unlike with XML). If you're looking to get an element XPath which you can later use with Html Agility Pack, you have at least two options:
Walk up the element's DOM ancestry tree using HtmlElement.Parent and construct
the XPath manually.
Use Html Agility Pack itself and do something like this (untested):
HtmlElement element = this.webBrowser1.Document.GetElementFromPoint(e.ClientMousePosition);
var savedId = element.Id;
var uniqueId = Guid.NewGuid().ToString();
element.Id = uniqueId;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(element.Document.GetElementsByTagName("html")[0].OuterHtml);
element.Id = savedId;
var node = doc.GetElementbyId(uniqueId);
var xpath = node.XPath;
How can I get List() from this text:
For the iPhone do the following:<ul><li>Go to AppStore</li><li>Search by him</li><li>Download</li></ul>
that should consist :Go to AppStore,
Search by him,
Download
Load the string up into the HTML Agility Pack then select all li elements inner text.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("following:<ul><li>Go to AppStore</li><li>Search by him</li><li>Download</li></ul>");
var uls = doc.DocumentNode.Descendants("li").Select(d => d.InnerText);
foreach (var ul in uls)
{
Console.WriteLine(ul);
}
Wrap in an XML root element and use LINQ to XML:
var xml = "For the iPhone do the following:<ul><li>Go to AppStore</li><li>Search by him</li><li>Download</li></ul>";
xml = "<r>"+xml+"</r>";
var ele = XElement.Parse(xml);
var lists = ele.Descendants("li").Select(e => e.Value).ToList();
Returns in lists:
Go to AppStore
Search by him
Download
Why does this pick all of my <li> elements in my document?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<Page>();
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes("//li");
What I want is to get all <li> elements in the <div> with an id of "myTrips".
It's a bit confusing because you're expecting that it would do a selectNodes on only the div with id "myTrips", however if you do another SelectNodes("//li") it will performn another search from the top of the document.
I fixed this by combining the statement into one, but that would only work on a webpage where you have only one div with an id "mytrips". The query would look like this:
doc.DocumentNode.SelectNodes("//div[#id='myTrips'] //li");
var liOfTravels = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']")
.SelectNodes(".//li");
Note the dot in the second line. Basically in this regard HTMLAgitilityPack completely relies on XPath syntax, however the result is non-intuitive, because those queries are effectively the same:
doc.DocumentNode.SelectNodes("//li");
some_deeper_node.SelectNodes("//li");
Creating a new node can be beneficial in some situations and lets you use the xpaths more intuitively. I've found this useful in a couple of places.
var myTripsDiv = doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']");
var myTripsNode = HtmlNode.CreateNode(myTripsDiv.InnerHtml);
var liOfTravels = myTripsNode.SelectNodes("//li");
You can do this with a Linq query:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var travelList = new List<HtmlNode>();
foreach (var matchingDiv in doc.DocumentNode.DescendantNodes().Where(n=>n.Name == "div" && n.Id == "myTrips"))
{
travelList.AddRange(matchingDiv.DescendantNodes().Where(n=> n.Name == "li"));
}
I hope it helps
This seems counter intuitive to me aswell, if you run a selectNodes method on a particular node I thought it would only search for stuff underneath that node, not in the document in general.
Anyway OP if you change this line :
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("//li");
TO:
var liOfTravels =
doc.DocumentNode.SelectSingleNode("//div[#id='myTrips']").SelectNodes("li");
I think you'll be ok, i've just had the same issue and that fixed it for me. Im not sure though if the li would have to be a direct child of the node you have.