I have html content that I am storing as an XML document (using HTML Agility Pack). I know some XPath, but I am not able to zero into the exact content I need.
In my example below, I am trying to extract the "src" and "alt" text from the large image. This is my example:
<html>
<body>
....
<div id="large_image_display">
<img class="photo" src="images/KC0763_l.jpg" alt="Circles t-shirt - Navy" />
</div>
....
<div id="small_image_display">
<img class="photo" src="images/KC0763_s.jpg" alt="Circles t-shirt - Navy" />
</div>
</body>
</html>
What is the XPath to get "images/KC0763_l.jpg" and "Circles t-shirt - Navy"? This is how far I got but it is wrong. Mostly pseudo code at this point:
\\div[#class='large_image_display']\img[1][#class='photo']#src
\\div[#class='large_image_display']\img[1][#class='photo']#alt
Any help in getting this right would be greatly appreciated.
The following xpath will get you to the src attributes for the img tags:
'//html/body/div/img[#class="photo"]/#src'
And similarly this will get you to the alt attributes:
'//html/body/div/img[#class="photo"]/#alt'
From there you can get to the attribute text. If you want to only find the ones that match 'large_image_display' then you would filter it further like this:
'//html/body/div[#id="large_image_display"]/img[#class="photo"]/#src'
Use the following XPath expressions:
/html/body/div[#id='large_image_display']/img/#src
and
/html/body/div[#id='large_image_display']/img/#alt
Always try to avoid using the // abbreviation, because it may result in very inefficient evaluation (causes the whole (sub)tree to be scanned).
In this particular case we know that the html element is the top element of the document and we can simply select it by /html -- not //html.
Your major problem was that in your expressions you were using \ and \\ and there are no such operators in XPath. The correct XPath operators you were trying to use are / and the // abbreviation.
Related
I'm generating a XML document that will be parsed as XHTML using XDocument. In some parts of it I have lists formated as:
<root>
<div>
<span>Item 1</span>
</div>
<div>
<span>Item 2</span>
</div>
</root>
The whitespace between <div> and <span> (and respective terminators) is messing up my CSS. Is it possible to force it to NOT insert white-space in those cases, generating something like:
<root>
<div><span>Item 1</span></div>
<div><span>Item 2</span></div>
</root>
SaveOptions.DisableFormatting does work, but then it becomes a pain to (human) read the file. So I need something else.
I think I found an answer, I will leave it here for others to comment and find possible bugs before accepting it.
I inserted a blank XText as the first element in the div and made XDocument understand it as mixed content (or something like that) and produce the output that I need.
div.AddFirst(new XText(""));
Does anyone have documentation on why it doesn't format mixed content and if that is indeed what is happening?
BTW, it has to be a empty XText, just the below doesn't work:
div.AddFirst("");
Can I use csquery to find a html with a certain attribute with a certain value.
So if I got a page where there is something like this
<html>
<body>
<div align="left">something</div>
</body>
</html>
Can I then get the hole line out by search for a div with the attribute align with the value left? or even just the html element, and then get the value from within the attribute?
As always, thanks for the help and time.
I haven't used csquery myself but when looking at the docs, and you can use css queries this should work
div[align='left']
EDIT:
After being assured that this is in response to a client side operation, in the script it should look like this:
var rows = query["div[align='left']"];
This how you can look up elements by tag and attribute selectors, is to have the attributes you want in brackets. and then the value interpolated like so.
I have an XML document that looks like
<a>foo<b>bar</b></a>
Creating an XDocument with the above XML, then using
doc.Descendants(new XName("a")).First().Value
results in "foobar" rather than "foo" as I expected.
How can I just get the value of <a /> without subtracting the value of <b /> from <a />?
Thanks in advance!
<a> actually contains two nodes, a text node and the b element. You can filter a children to of type XText:
var xml = "<a>foo<b>bar</b></a>";
var document = XDocument.Parse(xml);
Console.WriteLine(document.Descendants("a").First().Nodes().OfType<XText>().First().Value);
Seem kind of invalid XML. Maybe you should try attribute ... something like this
<a name="foo">
<b>bar</b>
</a>
This isn't quite an answer, but I would question the validity of your XML in this example.
Consider the XML-compatible markup for a HTML paragraph with a hyperlink:
<p>Go to the StackOverflow front page</p>
The contents of the paragraph (the value of <p> ) is still the full sentence rather than just the words "go to the ".
Also, given your example, what happens if you have more text after <b>bar</b>?
there is an html codes like below :
<div class="class1 class2 class3">
<div class="class4 class5">
<span class="class6">GOAL STRING</span>
</div>
</div>
now i want to find that GOAL STRING use from HTMLAgilityPack.
how can i do that?
[with LINQ and without LINQ = please show us both ways]
thanks in advance
Well you can use xpath to get the span directly.
document.DocumentNode.SelectSingleNode("//div[#class='class1 class2 class3']/div[#class='class4 class5']/span[#class='class6']").InnerText;
This is a good resource for xpath specifically the table in the middle of the page:
http://www.codeproject.com/Articles/9494/Manipulate-XML-data-with-XPath-and-XmlDocument-C
Also on Google Chrome you can right click -> inspect element and then right click the element that shows up on the tree and click copy as Xpath to get a starting point. These expressions can usually be simplified.
I'm learning how to use the library for the first time and would like some help.
Consider I have this somewhere in my HTMLDocument:
<h1>Casablanca
<span>(2010) <span class="pro-link">More at <strong>IMDbPro</strong> ยป</span><span class="title-extra"></span></span>
</h1>
How can I extract just the Casablanca text, not the span div?
Also, am I correct in thinking that the HtmlNode.InnerText is the text inside of a Div?
Well, there will be a TextNode as the first child of the H1 node.
YourH1Node.FirstChild.InnerText or something similar...