Cant get Innertexts from webpage using html agility pack xpath in c#

Cant get Innertexts from webpage using html agility pack xpath in c# - c#

So this is my code guys.
Im trying to get the text inside a span and storage it locally. Im using html agility pack and trying to retrieve the text using xpath but the nodes dont retrieve anything and appear as null.
This is the page im trying to get the text from: https://siat.sat.gob.mx/app/qr/faces/pages/mobile/validadorqr.jsf?D1=10&D2=1&D3=15030267855_SDS150309FC7
Specifically the "Denominación o razón social" text.
namespace ObtencionDatosSatBeta
{
public partial class Form1 : Form
{
DataTable table;
public Form1()
{
InitializeComponent();
}
private void InitTable()
{
table = new DataTable("tabladedatosTable");
table.Columns.Add("Variable", typeof(string));
table.Columns.Add("Contenido", typeof(string));
//table.Rows.Add("Super Mario 64", "84%");
tabladedatos.DataSource = table;
}
private async void Form1_Load(object sender, EventArgs e)
{
InitTable();
HtmlWeb web = new HtmlWeb();
var doc = await Task.Factory.StartNew(() => web.Load("https://siat.sat.gob.mx/app/qr/faces/pages/mobile/validadorqr.jsf?D1=10&D2=1&D3=15030267855_SDS150309FC7"));
var nodes = doc.DocumentNode.SelectNodes("//*[#id=\"ubicacionForm: j_idt12:0:j_idt13: j_idt17_data\"]//tr//td//span");
var innerTexts = nodes.Select(node => node.InnerText);
}
private void tabladedatos_CellContentClick(object sender, DataGridViewCellEventArgs e)
{
}
}
}
Any idea?
var nodes = doc.DocumentNode.SelectNodes("//*[#id=\"ubicacionForm: j_idt12:0:j_idt13: j_idt17_data\"]//tr//td//span");
The line of code above is the one that appears as null.

Use this Xpath which gets the first span under the element with the following ID: ubicacionForm:j_idt10:0:j_idt11:j_idt14_data
(//*[#id='ubicacionForm:j_idt10:0:j_idt11:j_idt14_data']//span)[1]
You can select the element using multiple different ways by copying the HTML in chrome (Ctrl + Option + J)
And then paste the HTML in Xpather where you can play around with your Xpath. Xpather.com

Related

I wish to replace text in a web page in C#

Here is the code that I am using
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
IHTMLDocument2 doc2 = webBrowser1.Document.DomDocument as IHTMLDocument2;
StringBuilder html = new StringBuilder(doc2.body.outerHTML);
String substitution = "<span style='background-color: rgb(255, 255, 0);'> sensor </span>";
html.Replace("sensor", substitution);
doc2.body.innerHTML = html.ToString();
}
It works, but the I cannot use the form nor the web browser
I have tried to added
webBrowser1.Document.Write(html.ToString()); //after doc2 at the end
But the webpage displayed is not formmatted correctly
I would be grateful, to get this fixed

You first need to find your element in the HTMLDocument DOM and then manipulate the innerHTML property with the relevant HTML.
There are a variety of ways to do this, including injecting javascript (here) or using HtmlAgilityPack.
The following code uses GetElementsByTagName DOM function to iterate over the span tags in the document on this site: https://www.w3schools.com/html/
It replaces all span text's including "Tutorial" with the html snippet your provided.
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var elements = webBrowser1.Document.GetElementsByTagName("span");
foreach (HtmlElement element in elements)
{
if(string.IsNullOrEmpty(element.InnerText))
continue;
if (element.InnerText.Contains("Tutorial"))
{
element.InnerHtml = "<span style='background-color: rgb(255, 255, 0);'> sensor </span>";
}
}
}

Selenium C# - how to use element by "nobr"

I have this hml code:
<nobr class="ms-crm-Form-Title-Data autoellipsis">
Text - Some Text
I would like to get the text value by selenium driver. how can I do that? I have tried with CssSelector:
[FindsBy(How = How.CssSelector, Using = "nobr[class = ms-crm-Form-Title-Data autoellipsis]")]
public IWebElement ApplicationNumberLabel { get; set; }
but I'm getting Could not find element Error.
Thanks for your help.

If you want to find an Element with selenium in C# you do it like this:
IWebElement element = driver.FindElement(By.Id("TheId"));
In this case By.Id is the selector this can be XPath, Css, Tag, Id and many other selectors. If you want to select based on multiple css classes you need XPath.
IWebDriver driver = new FirefoxDriver();
private void button1_Click(object sender, EventArgs e)
{
driver.Navigate().GoToUrl("file:///C:/Users/notmyname/Desktop/test.html");
IWebElement element = driver.FindElement(By.XPath("//nobr[#class='ms-crm-Form-Title-Data autoellipsis']"));
textBox1.Text = element.Text;
}

Get links of a html document in order

I want to get all the links of a HTML document. This isn't a problem, but apparently it puts all the links in an alphabetic order before storing them in an array one by one. I want to have the links in original order (not in alphabetic).
So is there any possibility to get the first found link, store it, then the second one,...? I already tried using HtmlAgilityPack and the Webbrowser-Control methods, but both order them alphabetically. The original order is important for later purposes.
I heard that it might be possible with Regex, but I've found enough answers, where they say that you shouldn't use it for HTML parsing. So how can I do it?
Here's the Webbrowser-Control code, I tried to use to get the links and store them into an array:
private void btnGet_Click(object sender, EventArgs e)
{
HtmlWindow mainFrame = webFl.Document.Window.Frames["mainFrame"];
HtmlElementCollection links = mainFrame.Document.Links;
foreach (HtmlElement link in links)
{
string linkText = link.OuterHtml;
if (linkText.Contains("puzzle"))
{
arr[i] = linkText;
i++;
}
}
}
Thank you in advance,
Opak

You can get the correct order by walking the DOM tree using HTML DOM API. The following code does this. Note, I use dynamic to access DOM API. That's because WebBrowser's HtmlElement.FirstChild/HtmlElement.NextSibling don't work for this purpose, as they return null for DOM text nodes.
private void btnGet_Click(object sender, EventArgs e)
{
Action<object> walkTheDom = null;
var links = new List<object>();
// element.FirstChild / NextSibling don't work as they stop at DOM text nodes
walkTheDom = (element) =>
{
dynamic domElement = element;
if (domElement.tagName == "A")
links.Add(domElement);
for (dynamic child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // Element node?
walkTheDom(child);
}
};
walkTheDom(this.webBrowser.Document.Body.DomElement);
string html = links.Aggregate(String.Empty, (a, b) => a + ((dynamic)b).outerHtml + Environment.NewLine);
MessageBox.Show(html);
}
[UPDATE] If you really need to get a list of HtmlElement objects for <A> tags, instead of dynamic native elements, that's still possible with a little trick using GetElementById:
private void btnGet_Click(object sender, EventArgs e)
{
// element.FirstChild / NextSibling don't work because they stop on DOM text nodes
var links = new List<HtmlElement>();
var document = this.webBrowser.Document;
dynamic domDocument = document.DomDocument;
Action<dynamic> walkTheDom = null;
walkTheDom = (domElement) =>
{
if (domElement.tagName == "A")
{
// get HtmlElement for the found <A> tag
string savedId = domElement.id;
string uniqueId = domDocument.uniqueID;
domElement.id = uniqueId;
links.Add(document.GetElementById(uniqueId));
if (savedId != null)
domElement.id = savedId;
else
domElement.removeAttribute("id");
}
for (var child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // is an Element node?
walkTheDom(child);
}
};
// walk the DOM for <A> tags
walkTheDom(domDocument.body);
// show the found tags
string combinedHtml = links.Aggregate(String.Empty, (html, element) => html + element.OuterHtml + Environment.NewLine);
MessageBox.Show(combinedHtml);
}

Saving Xml to a Document C#

Unlike what I've been able to find on here I wand to maintain syntax within my xml document, and serialization doesn't touch on that. I want to be able to add another "task" tag to the xml document...Loading the information isn't a problem, I've had to deal with that before... but this is.
Main Program:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Xml;
using System.Xml.Linq;
using System.Xml.Serialization;
using System.IO;
namespace ToDoList
{
public partial class Form1 : Form
{
string title; //the variable for the title textbox value to be stored in
string details; //the variable for the details textbox value to be stored in
string itemstr; //the variable for title and details to be merged in
public Form1()
{
InitializeComponent();
}
public void Form1_Load(object sender, EventArgs e)
{
optionsbtn.Text = "Options"; //make the options button's text options
var items = ToDochkbox.Items; //create a private "var" items symbolizing the Checkbox's items array
XDocument xmlDoc = XDocument.Load("tasksdoc.xml"); //load the xml document (in bin or release)
var q = from c in xmlDoc.Descendants("root") //go "within" the <root> </root> tag in the file
select (string)c.Element("task"); //find the first <task></task> tag
foreach (string N in q) //now cycle through all the <task></task> tags and per cycle save them to string "N"
{
items.Add(N); //add the item to the checkbox list
}
}
public void addbtn_Click(object sender, EventArgs e)
{
var items = ToDochkbox.Items; //create a private "var" items symbolizing the Checkbox's items array
title = Addtb.Text; //set the title string to equal the title textbox's contents
details = detailstb.Text; //set the details string to equal the detail textbox's contents
itemstr = title +" - " + details; //set a variable to equal the title string, a - with spaces on each end, and then the details string
items.Add(itemstr); //add the variable itemstr (above) to the the checkbox list
}
private void optionsbtn_Click(object sender, EventArgs e)
{
new options().Show();//show the options form
}
private void aboutToolStripMenuItem_Click(object sender, EventArgs e)
{
new options().Show();//show the options form
}
private void saveToolStripMenuItem_Click(object sender, EventArgs e)
{
}
public void loadToolStripMenuItem_Click(object sender, EventArgs e)
{
optionsbtn.Text = "Options"; //make the options button's text options
var items = ToDochkbox.Items; //create a private "var" items symbolizing the Checkbox's items array
XDocument xmlDoc = XDocument.Load("tasksdoc.xml"); //load the xml document (in bin or release)
var q = from c in xmlDoc.Descendants("root") //go "within" the <root> </root> tag in the file
select (string)c.Element("task"); //find the first <task></task> tag
foreach (string N in q) //now cycle through all the <task></task> tags and per cycle save them to string "N"
{
items.Add(N); //add the item to the checkbox list
}
}
}
}
And My XML Document:
<root>
<task>First Task - Create a Task</task>
</root>

The class that you could use to serialize:
public class MyClass
{
[XmlElement("task")]
public List<string> Tasks { get; set; }
}
Placing the XmlElementAttribute on a collection type will cause each element to be serialized without being placed a node for the list.
Xml with XmlElementAttribute:
<root>
<task>First Task - Create a Task</task>
<task>SecondTask - Create a Task</task>
<task>ThirdTask - Create a Task</task>
</root>
Xml without XmlElementAttribute:
<root>
<Tasks>
<Task>First Task - Create a Task</Task>
<Task>SecondTask - Create a Task</Task>
<Task>ThirdTask - Create a Task</Task>
</Tasks>
</root>
I answered another question about serializing lists in a similar way a few days ago. Check out his question and then the answer, it might be what you are trying to do.

C# - XmlNodeList - Getting inner xml/text between description tags without HTML

Right now I've got a list box that shows RSS article titles/urls of an RSS feed. The title and URL extraction were no problem, but now I'm trying to have the description appear in a rich text box whenever the article title is selected in the list box. I can successfully get the description to show up in the text box, but it's always followed by a bunch of extra html. Example:
There's a silly rumor exploding on the Internet this weekend, alleging that Facebook is shutting down on March 15 because CEO Mark Zuckerberg "wants his old life back," and desires to "put an end to all the madness."<div class="feedflare">
<img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=at7OdUE16Y0:jsXll_RkIzI:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=at7OdUE16Y0:jsXll_RkIzI:gIN9vFwOqvQ" border="0"></img>
Code:
private void button1_Click(object sender, EventArgs e)
{
{
XmlTextReader rssReader = new XmlTextReader(txtUrl.Text);
XmlDocument rssDoc = new XmlDocument();
rssDoc.Load(rssReader);
XmlNodeList titleList = rssDoc.GetElementsByTagName("title");
XmlNodeList urlList = rssDoc.GetElementsByTagName("link");
descList = rssDoc.GetElementsByTagName("description");
for (int i = 0; i < titleList.Count; i++)
{
lvi = rowNews.Items.Add(titleList[i].InnerXml);
lvi.SubItems.Add(urlList[i].InnerXml);
}
}
}
private void rowNews_SelectedIndexChanged(object sender, EventArgs e)
{
if (rowNews.SelectedIndices.Count <= 0)
{
return;
}
int intselectedindex = rowNews.SelectedIndices[0]; // Get index of article title
txtDesc.Text=(descList[intselectedindex].InnerText);
// Get description array index that matched list index
}

You can strip html using approach from Using C# regular expressions to remove HTML tags

You can use InnerText instead of InnerHtml. This will only get the content of your child nodes without any markup.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Cant get Innertexts from webpage using html agility pack xpath in c# - c#

Related

I wish to replace text in a web page in C#

Selenium C# - how to use element by "nobr"

Get links of a html document in order

Saving Xml to a Document C#

C# - XmlNodeList - Getting inner xml/text between description tags without HTML

Categories

Resources