Grabbing text of a single element in XDocument (C#)

Grabbing text of a single element in XDocument (C#) - c#

I have an XML document that looks like
<a>foo<b>bar</b></a>
Creating an XDocument with the above XML, then using
doc.Descendants(new XName("a")).First().Value
results in "foobar" rather than "foo" as I expected.
How can I just get the value of <a /> without subtracting the value of <b /> from <a />?
Thanks in advance!

<a> actually contains two nodes, a text node and the b element. You can filter a children to of type XText:
var xml = "<a>foo<b>bar</b></a>";
var document = XDocument.Parse(xml);
Console.WriteLine(document.Descendants("a").First().Nodes().OfType<XText>().First().Value);

Seem kind of invalid XML. Maybe you should try attribute ... something like this
<a name="foo">
<b>bar</b>
</a>

This isn't quite an answer, but I would question the validity of your XML in this example.
Consider the XML-compatible markup for a HTML paragraph with a hyperlink:
<p>Go to the StackOverflow front page</p>
The contents of the paragraph (the value of <p> ) is still the full sentence rather than just the words "go to the ".
Also, given your example, what happens if you have more text after <b>bar</b>?

Related

How to prevent tags from becoming separated by white-space?

I'm generating a XML document that will be parsed as XHTML using XDocument. In some parts of it I have lists formated as:
<root>
<div>
<span>Item 1</span>
</div>
<div>
<span>Item 2</span>
</div>
</root>
The whitespace between <div> and <span> (and respective terminators) is messing up my CSS. Is it possible to force it to NOT insert white-space in those cases, generating something like:
<root>
<div><span>Item 1</span></div>
<div><span>Item 2</span></div>
</root>
SaveOptions.DisableFormatting does work, but then it becomes a pain to (human) read the file. So I need something else.

I think I found an answer, I will leave it here for others to comment and find possible bugs before accepting it.
I inserted a blank XText as the first element in the div and made XDocument understand it as mixed content (or something like that) and produce the output that I need.
div.AddFirst(new XText(""));
Does anyone have documentation on why it doesn't format mixed content and if that is indeed what is happening?
BTW, it has to be a empty XText, just the below doesn't work:
div.AddFirst("");

Selenium: Get Element, which is only Text

I'm trying to find a Text via Selenium, which is directly in the HTML.
This can look something like this:
<br>
Uploaded.net
<img class="bbCodeImage LbImage" />
<br>
I found the Image after the Text, but even now, I can't navigate to the text:
I I went to the img-Element, then tried:
var des2 = ele.FindElement(ByProxy.XPath("preceding-sibling::*"));
Interesting enough, this already returns the br-element and not the text, which is right above it. I also tried to brute force it and get all Elements, with this text:
var des2 = thread.FindElements(ByProxy.XPath("descendant::*[contains(text(), \"Uploaded.net\")]")).SelectMany(f => f.FindElements(ByProxy.XPath("descendant::*")));
foreach(var ele in des2)
{
Debug.WriteLine(ele.Text);
}
So I read all Descendants with the mentioned Text and iterate over all of them, but none of them has a Text set.
Am I missing something crucial here?

Selenium doesn't support a text node. You could however get the text with a piece of JavaScript:
string text = (string)((IJavaScriptExecutor)driver).ExecuteScript(
"return arguments[0].previousSibling.textContent.trim();", ele);

I dont think there is any obvious solution to this. Can offer a very very round about solution.
Get the pagesource of the page -- driver.getPageSource();
Split the pagesource by the img tag. Then split the first element of previous split by br tag. The last element of the array should now be the text.
If you have control over development of this, someone should fix the page.

Find element with xpath in c# selenium webdriver

I have this element in my source code, I get h2 element with inner text(yahoo), and I want to access the nearest article that contains h2.
<article>
<div></div>
<div></div>
<header>
<a></a>
<a>
<h2>yahoo</h2>
</a>
</header>
</article>
my written XPath is this : //h2[text()='yahoo']//..//..
but it doesn't work.

How about this xpath:
//h2[. = 'yahoo']/ancestor::article[1]

One possible XPath to get <article> elements containing <h2> element equals yahoo :
//article[.//h2='yahoo']

If you're want to interract with article, use next string:
//article[header//h2[text='yahoo']]

Please use this xpath and try : //h2[text()='yahoo']//..//..//..
Explanation:
1st //.. is taking you to the anchor(a) tag.
2nd //.. is taking you to header tag.
3rd //.. will take you where you want to be i.e article tag.

Extract content with XPath?

I have html content that I am storing as an XML document (using HTML Agility Pack). I know some XPath, but I am not able to zero into the exact content I need.
In my example below, I am trying to extract the "src" and "alt" text from the large image. This is my example:
<html>
<body>
....
<div id="large_image_display">
<img class="photo" src="images/KC0763_l.jpg" alt="Circles t-shirt - Navy" />
</div>
....
<div id="small_image_display">
<img class="photo" src="images/KC0763_s.jpg" alt="Circles t-shirt - Navy" />
</div>
</body>
</html>
What is the XPath to get "images/KC0763_l.jpg" and "Circles t-shirt - Navy"? This is how far I got but it is wrong. Mostly pseudo code at this point:
\\div[#class='large_image_display']\img[1][#class='photo']#src
\\div[#class='large_image_display']\img[1][#class='photo']#alt
Any help in getting this right would be greatly appreciated.

The following xpath will get you to the src attributes for the img tags:
'//html/body/div/img[#class="photo"]/#src'
And similarly this will get you to the alt attributes:
'//html/body/div/img[#class="photo"]/#alt'
From there you can get to the attribute text. If you want to only find the ones that match 'large_image_display' then you would filter it further like this:
'//html/body/div[#id="large_image_display"]/img[#class="photo"]/#src'

Use the following XPath expressions:
/html/body/div[#id='large_image_display']/img/#src
and
/html/body/div[#id='large_image_display']/img/#alt
Always try to avoid using the // abbreviation, because it may result in very inefficient evaluation (causes the whole (sub)tree to be scanned).
In this particular case we know that the html element is the top element of the document and we can simply select it by /html -- not //html.
Your major problem was that in your expressions you were using \ and \\ and there are no such operators in XPath. The correct XPath operators you were trying to use are / and the // abbreviation.

How can I extract things from Divs using HTMLAgilityPack?

I'm learning how to use the library for the first time and would like some help.
Consider I have this somewhere in my HTMLDocument:
<h1>Casablanca
<span>(2010) <span class="pro-link">More at <strong>IMDbPro</strong> »</span><span class="title-extra"></span></span>
</h1>
How can I extract just the Casablanca text, not the span div?
Also, am I correct in thinking that the HtmlNode.InnerText is the text inside of a Div?

Well, there will be a TextNode as the first child of the H1 node.
YourH1Node.FirstChild.InnerText or something similar...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Grabbing text of a single element in XDocument (C#) - c#

<a> actually contains two nodes, a text node and the b element. You can filter a children to of type XText: var xml = "<a>foo<b>bar</b></a>"; var document = XDocument.Parse(xml); Console.WriteLine(document.Descendants("a").First().Nodes().OfType<XText>().First().Value);

Seem kind of invalid XML. Maybe you should try attribute ... something like this <a name="foo"> <b>bar</b> </a>

Related

How to prevent tags from becoming separated by white-space?

Selenium: Get Element, which is only Text

Find element with xpath in c# selenium webdriver

Extract content with XPath?

How can I extract things from Divs using HTMLAgilityPack?

Categories

Resources