Here is the code that I am using
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
IHTMLDocument2 doc2 = webBrowser1.Document.DomDocument as IHTMLDocument2;
StringBuilder html = new StringBuilder(doc2.body.outerHTML);
String substitution = "<span style='background-color: rgb(255, 255, 0);'> sensor </span>";
html.Replace("sensor", substitution);
doc2.body.innerHTML = html.ToString();
}
It works, but the I cannot use the form nor the web browser
I have tried to added
webBrowser1.Document.Write(html.ToString()); //after doc2 at the end
But the webpage displayed is not formmatted correctly
I would be grateful, to get this fixed
You first need to find your element in the HTMLDocument DOM and then manipulate the innerHTML property with the relevant HTML.
There are a variety of ways to do this, including injecting javascript (here) or using HtmlAgilityPack.
The following code uses GetElementsByTagName DOM function to iterate over the span tags in the document on this site: https://www.w3schools.com/html/
It replaces all span text's including "Tutorial" with the html snippet your provided.
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var elements = webBrowser1.Document.GetElementsByTagName("span");
foreach (HtmlElement element in elements)
{
if(string.IsNullOrEmpty(element.InnerText))
continue;
if (element.InnerText.Contains("Tutorial"))
{
element.InnerHtml = "<span style='background-color: rgb(255, 255, 0);'> sensor </span>";
}
}
}
Related
So this is my code guys.
Im trying to get the text inside a span and storage it locally. Im using html agility pack and trying to retrieve the text using xpath but the nodes dont retrieve anything and appear as null.
This is the page im trying to get the text from: https://siat.sat.gob.mx/app/qr/faces/pages/mobile/validadorqr.jsf?D1=10&D2=1&D3=15030267855_SDS150309FC7
Specifically the "Denominación o razón social" text.
namespace ObtencionDatosSatBeta
{
public partial class Form1 : Form
{
DataTable table;
public Form1()
{
InitializeComponent();
}
private void InitTable()
{
table = new DataTable("tabladedatosTable");
table.Columns.Add("Variable", typeof(string));
table.Columns.Add("Contenido", typeof(string));
//table.Rows.Add("Super Mario 64", "84%");
tabladedatos.DataSource = table;
}
private async void Form1_Load(object sender, EventArgs e)
{
InitTable();
HtmlWeb web = new HtmlWeb();
var doc = await Task.Factory.StartNew(() => web.Load("https://siat.sat.gob.mx/app/qr/faces/pages/mobile/validadorqr.jsf?D1=10&D2=1&D3=15030267855_SDS150309FC7"));
var nodes = doc.DocumentNode.SelectNodes("//*[#id=\"ubicacionForm: j_idt12:0:j_idt13: j_idt17_data\"]//tr//td//span");
var innerTexts = nodes.Select(node => node.InnerText);
}
private void tabladedatos_CellContentClick(object sender, DataGridViewCellEventArgs e)
{
}
}
}
Any idea?
var nodes = doc.DocumentNode.SelectNodes("//*[#id=\"ubicacionForm: j_idt12:0:j_idt13: j_idt17_data\"]//tr//td//span");
The line of code above is the one that appears as null.
Use this Xpath which gets the first span under the element with the following ID: ubicacionForm:j_idt10:0:j_idt11:j_idt14_data
(//*[#id='ubicacionForm:j_idt10:0:j_idt11:j_idt14_data']//span)[1]
You can select the element using multiple different ways by copying the HTML in chrome (Ctrl + Option + J)
And then paste the HTML in Xpather where you can play around with your Xpath. Xpather.com
I want to get all the links of a HTML document. This isn't a problem, but apparently it puts all the links in an alphabetic order before storing them in an array one by one. I want to have the links in original order (not in alphabetic).
So is there any possibility to get the first found link, store it, then the second one,...? I already tried using HtmlAgilityPack and the Webbrowser-Control methods, but both order them alphabetically. The original order is important for later purposes.
I heard that it might be possible with Regex, but I've found enough answers, where they say that you shouldn't use it for HTML parsing. So how can I do it?
Here's the Webbrowser-Control code, I tried to use to get the links and store them into an array:
private void btnGet_Click(object sender, EventArgs e)
{
HtmlWindow mainFrame = webFl.Document.Window.Frames["mainFrame"];
HtmlElementCollection links = mainFrame.Document.Links;
foreach (HtmlElement link in links)
{
string linkText = link.OuterHtml;
if (linkText.Contains("puzzle"))
{
arr[i] = linkText;
i++;
}
}
}
Thank you in advance,
Opak
You can get the correct order by walking the DOM tree using HTML DOM API. The following code does this. Note, I use dynamic to access DOM API. That's because WebBrowser's HtmlElement.FirstChild/HtmlElement.NextSibling don't work for this purpose, as they return null for DOM text nodes.
private void btnGet_Click(object sender, EventArgs e)
{
Action<object> walkTheDom = null;
var links = new List<object>();
// element.FirstChild / NextSibling don't work as they stop at DOM text nodes
walkTheDom = (element) =>
{
dynamic domElement = element;
if (domElement.tagName == "A")
links.Add(domElement);
for (dynamic child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // Element node?
walkTheDom(child);
}
};
walkTheDom(this.webBrowser.Document.Body.DomElement);
string html = links.Aggregate(String.Empty, (a, b) => a + ((dynamic)b).outerHtml + Environment.NewLine);
MessageBox.Show(html);
}
[UPDATE] If you really need to get a list of HtmlElement objects for <A> tags, instead of dynamic native elements, that's still possible with a little trick using GetElementById:
private void btnGet_Click(object sender, EventArgs e)
{
// element.FirstChild / NextSibling don't work because they stop on DOM text nodes
var links = new List<HtmlElement>();
var document = this.webBrowser.Document;
dynamic domDocument = document.DomDocument;
Action<dynamic> walkTheDom = null;
walkTheDom = (domElement) =>
{
if (domElement.tagName == "A")
{
// get HtmlElement for the found <A> tag
string savedId = domElement.id;
string uniqueId = domDocument.uniqueID;
domElement.id = uniqueId;
links.Add(document.GetElementById(uniqueId));
if (savedId != null)
domElement.id = savedId;
else
domElement.removeAttribute("id");
}
for (var child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // is an Element node?
walkTheDom(child);
}
};
// walk the DOM for <A> tags
walkTheDom(domDocument.body);
// show the found tags
string combinedHtml = links.Aggregate(String.Empty, (html, element) => html + element.OuterHtml + Environment.NewLine);
MessageBox.Show(combinedHtml);
}
Hi I am working on data scraping application in C#.
Actually I want to get all the Display text but not the html tags.
Here's My code
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
Load(#"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str = doc.DocumentNode.InnerText;
This inner html is returning some tags and scripts as well but I want to only get the Display text that's visible to user.
Please help me.
Thanks
[I believe this will solve ur problem][1]
Method 1 – In Memory Cut and Paste
Use WebBrowser control object to process the web page, and then copy the text from the control…
Use the following code to download the web page:
Collapse | Copy Code
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;
Use the following event code to process the downloaded web page text:
Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}
Method 2 – In Memory Selection Object
This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.
Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}
Method 3 – The Elegant, Simple, Slower XmlDocument Approach
A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.
The XmlDocument object will load / process HTML files with only 3 simple lines of code:
Collapse | Copy Code
XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;
There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.
Packages
To remove javascript and css:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
To remove comments (untested):
foreach(var comment in doc.DocumentNode.Descendants("//comment()").ToArray())
comment.Remove()
For removing all html tags from a string you can use:
String output = inputString.replaceAll("<[^>]*>", "");
For removing a specific tag:
String output = inputString.replaceAll("(?i)<td[^>]*>", "");
Hope it helps :)
I have the following code that I managed to come up with:
private void button1_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (var o = new OpenFileDialog())
{
if (o.ShowDialog() == DialogResult.OK)
doc.Load(o.FileName);
}
foreach (HtmlAgilityPack.HtmlAttribute att in doc.DocumentNode.Attributes)
{
label1.Text += Environment.NewLine +
att.Name + " " + att.Value;
}
}
But it's not doing anything. There are no errors, no exceptions, and it compiles and runs. But, as you can see, from inside the foreach loop, it is supposed to keep adding found attributes and their values to the label1.Text control, but it isn't. Nothing happens!
Am I doing something wrong? Can someone please help?
Thank you
By iterating over doc.DocumentNode.Attributes, you are trying to get attributes of the root element (DocumentNode) which is a placeholder containing your <html> tag (and possibly some adjacent nodes like comments and white space). Which does not make a lot of sense.
What are you trying to extract exactly?
Right now I've got a list box that shows RSS article titles/urls of an RSS feed. The title and URL extraction were no problem, but now I'm trying to have the description appear in a rich text box whenever the article title is selected in the list box. I can successfully get the description to show up in the text box, but it's always followed by a bunch of extra html. Example:
There's a silly rumor exploding on the Internet this weekend, alleging that Facebook is shutting down on March 15 because CEO Mark Zuckerberg "wants his old life back," and desires to "put an end to all the madness."<div class="feedflare">
<img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=at7OdUE16Y0:jsXll_RkIzI:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=at7OdUE16Y0:jsXll_RkIzI:gIN9vFwOqvQ" border="0"></img>
Code:
private void button1_Click(object sender, EventArgs e)
{
{
XmlTextReader rssReader = new XmlTextReader(txtUrl.Text);
XmlDocument rssDoc = new XmlDocument();
rssDoc.Load(rssReader);
XmlNodeList titleList = rssDoc.GetElementsByTagName("title");
XmlNodeList urlList = rssDoc.GetElementsByTagName("link");
descList = rssDoc.GetElementsByTagName("description");
for (int i = 0; i < titleList.Count; i++)
{
lvi = rowNews.Items.Add(titleList[i].InnerXml);
lvi.SubItems.Add(urlList[i].InnerXml);
}
}
}
private void rowNews_SelectedIndexChanged(object sender, EventArgs e)
{
if (rowNews.SelectedIndices.Count <= 0)
{
return;
}
int intselectedindex = rowNews.SelectedIndices[0]; // Get index of article title
txtDesc.Text=(descList[intselectedindex].InnerText);
// Get description array index that matched list index
}
You can strip html using approach from Using C# regular expressions to remove HTML tags
You can use InnerText instead of InnerHtml. This will only get the content of your child nodes without any markup.