How can I extract things from Divs using HTMLAgilityPack? - c#

I'm learning how to use the library for the first time and would like some help.
Consider I have this somewhere in my HTMLDocument:
<h1>Casablanca
<span>(2010) <span class="pro-link">More at <strong>IMDbPro</strong> ยป</span><span class="title-extra"></span></span>
</h1>
How can I extract just the Casablanca text, not the span div?
Also, am I correct in thinking that the HtmlNode.InnerText is the text inside of a Div?

Well, there will be a TextNode as the first child of the H1 node.
YourH1Node.FirstChild.InnerText or something similar...

Related

Export contenteditable div data to Word causes blank line

I have a contenteditable div the user enter data. When they enter line break, each browser stores the data differently. When I export this data to Word using HtmlToOpenXml it adds a blank line for the content and I want to avoid that so the html page and word doc look the same.
One option for me is to replace the tags <br>, <div>, <p> with blank and then replace the </div> and </p> with <br/> in the C# code using RegEx. But I do not know what all formatting is used for contenteditable div by different browsers and this implementation may not help.
I would like to know what is the best way to address this or is there any open source tool/dll that helps me with this issue?
e.g. ContentEditable div actual data in browsers looks like below
Chrome -
line1<div>line2</div><div>line3</div>
IE Edge-
<div>line1</div><div>line22</div><div>line3<br></div>
FireFox - I read it uses <p> </p> instead of <div> </div>
Safari - ????
A Solution I found:
You could use RegEx, which I highly recommend in C# for parsing information.
Then effectively based on the formatting you could narrow down what browser it is and then move on towards parsing it's output and what its XML means universally. This will not be easy but no cross-platform ever truly is. I would give a example of how this could be done, but RegEx in all honesty takes a good amount of work and it would be quite a bit of code to make a example that could show you how to parse it and find out what the browser is.

Why does my XPath get stuck on first result when I add /text()?

Here's a simulation of the HTML I am trying to use my XPath on:
<div class="stream-links">
<div>
value I need
</div>
<div>
value I need
</div>
<div>
value I need
</div>
</div>
Now, when I use the XPath pattern //div[#class='stream-links']/div/a in my browser it selects the <a ...> node. Everytime I press enter it selects the next one, but when I use the pattern //div[#class='stream-links']/div/a/text() it gets stuck on the text of the first <a ...> node so when I press enter it does not move to the next. (Using Firebug plugin on FireFox btw to inspect element)
I'm coding a program in C# and the amount of divs under the parent div is a variable so I can't use //div[#class='stream-links']/div[number here]/a/text() because I need to get all of them.
My code for using the Xpath is HtmlNodeCollection NODECOL1 = MEDOC.DocumentNode.SelectNodes("//div[#class='stream-links']/div/a[1]");
So my questions are:
1) Is there a particular reason Firebug doesn't jump to the next <a...> or is it a 'bug' on the plugin's side?
2) Will my code work nevertheless or do I need to approach it in another way?
There're a few things not right with the rest of my code so I can't see if that part of my code actually works or not, wouldn't ask question 2 if I could test it myself right now.
For your HTML, this XPath selects three a elements:
//div[#class='stream-links']/div/a
This XPath selects three text nodes:
//div[#class='stream-links']/div/a/text()
This XPath selects one a element:
//div[#class='stream-links']/div/a[1]
My code for using the Xpath is HtmlNodeCollection NODECOL1 =
MEDOC.DocumentNode.SelectNodes("//div[#class='stream-links']/div/a[1]");
1) Is there a particular reason Firebug doesn't jump to the next
or is it a 'bug' on the plugins side?
//div[#class='stream-links']/div/a[1] only selects one a element.
2) Will my code work nevertheless or do I need to approach it in
another way?
There's a few things not right with the rest of my code so I can't see
if that part of my code actually works or not, wouldn't ask question 2
if I could test it myself right now.
That's not a reasonable question to ask given what you've shown us. Perhaps knowing what the above XPaths return will help you answer it for yourself.

can you find html element by attribute with csquery

Can I use csquery to find a html with a certain attribute with a certain value.
So if I got a page where there is something like this
<html>
<body>
<div align="left">something</div>
</body>
</html>
Can I then get the hole line out by search for a div with the attribute align with the value left? or even just the html element, and then get the value from within the attribute?
As always, thanks for the help and time.
I haven't used csquery myself but when looking at the docs, and you can use css queries this should work
div[align='left']
EDIT:
After being assured that this is in response to a client side operation, in the script it should look like this:
var rows = query["div[align='left']"];
This how you can look up elements by tag and attribute selectors, is to have the attributes you want in brackets. and then the value interpolated like so.

Grabbing text of a single element in XDocument (C#)

I have an XML document that looks like
<a>foo<b>bar</b></a>
Creating an XDocument with the above XML, then using
doc.Descendants(new XName("a")).First().Value
results in "foobar" rather than "foo" as I expected.
How can I just get the value of <a /> without subtracting the value of <b /> from <a />?
Thanks in advance!
<a> actually contains two nodes, a text node and the b element. You can filter a children to of type XText:
var xml = "<a>foo<b>bar</b></a>";
var document = XDocument.Parse(xml);
Console.WriteLine(document.Descendants("a").First().Nodes().OfType<XText>().First().Value);
Seem kind of invalid XML. Maybe you should try attribute ... something like this
<a name="foo">
<b>bar</b>
</a>
This isn't quite an answer, but I would question the validity of your XML in this example.
Consider the XML-compatible markup for a HTML paragraph with a hyperlink:
<p>Go to the StackOverflow front page</p>
The contents of the paragraph (the value of <p> ) is still the full sentence rather than just the words "go to the ".
Also, given your example, what happens if you have more text after <b>bar</b>?

Extract content with XPath?

I have html content that I am storing as an XML document (using HTML Agility Pack). I know some XPath, but I am not able to zero into the exact content I need.
In my example below, I am trying to extract the "src" and "alt" text from the large image. This is my example:
<html>
<body>
....
<div id="large_image_display">
<img class="photo" src="images/KC0763_l.jpg" alt="Circles t-shirt - Navy" />
</div>
....
<div id="small_image_display">
<img class="photo" src="images/KC0763_s.jpg" alt="Circles t-shirt - Navy" />
</div>
</body>
</html>
What is the XPath to get "images/KC0763_l.jpg" and "Circles t-shirt - Navy"? This is how far I got but it is wrong. Mostly pseudo code at this point:
\\div[#class='large_image_display']\img[1][#class='photo']#src
\\div[#class='large_image_display']\img[1][#class='photo']#alt
Any help in getting this right would be greatly appreciated.
The following xpath will get you to the src attributes for the img tags:
'//html/body/div/img[#class="photo"]/#src'
And similarly this will get you to the alt attributes:
'//html/body/div/img[#class="photo"]/#alt'
From there you can get to the attribute text. If you want to only find the ones that match 'large_image_display' then you would filter it further like this:
'//html/body/div[#id="large_image_display"]/img[#class="photo"]/#src'
Use the following XPath expressions:
/html/body/div[#id='large_image_display']/img/#src
and
/html/body/div[#id='large_image_display']/img/#alt
Always try to avoid using the // abbreviation, because it may result in very inefficient evaluation (causes the whole (sub)tree to be scanned).
In this particular case we know that the html element is the top element of the document and we can simply select it by /html -- not //html.
Your major problem was that in your expressions you were using \ and \\ and there are no such operators in XPath. The correct XPath operators you were trying to use are / and the // abbreviation.

Categories

Resources