How to parse a dynamically updating webpage in C# - c#

I am trying to parse the number shown in this page:
https://www.edf.org/embed/methane-counters
I have tried WebBrowser, WebClient ... etc. with no good result. Every time I try something new, in the HTML returned I get this (HTML area where the number is shown):
<strong id=\"methane\"></strong>
... as you see there is no number between the 'strong' tags. Just in case, this is the latest code I have tried, that still do not work:
using (WebBrowser myWebBrowser = new WebBrowser()) {
myWebBrowser.ScriptErrorsSuppressed = true;
myWebBrowser.Navigate(myURL);
while ((myWebBrowser.ReadyState != WebBrowserReadyState.Complete))
Application.DoEvents();
myContent = myWebBrowser.Document.Body.InnerHtml;
myContent = myWebBrowser.DocumentText;
}
... neither of the last two calls returns the HTML with the number on it.
Any ideas on how to get the proper content of this page?

Related

How to select element based on ID and gt all the InnerHtml using HTML Agilitypack

I need to pull part of html from external url to another page using agility-pack. I am not sure if i can select a node/element based on id or classname using agility pack. So far i manage to pull complete page but i want to target on node/element with specific id and all its contents.
protected void WebScrapper()
{
HtmlDocument doc = new HtmlDocument();
var url = #"https://www.itftennis.com/en/tournament/w15-valencia/esp/2022/w-itf-esp-35a-2022/acceptance-list/";
var webGet = new HtmlWeb();
doc = webGet.Load(url);
var baseUrl = new Uri(url);
//doc.LoadHtml(doc);
Response.Write(doc.DocumentNode.InnerHtml);
//Response.Write(doc.DocumentNode.Id("acceptance-list-container"));
//var innerContent = doc.DocumentNode.SelectNodes("/div").FirstOrDefault().InnerHtml;
}
When i use Response.Write(doc.DocumentNode.Id("acceptance-list-container")) it generates error.
When i use below code it generates error System.ArgumentNullException: Value cannot be null.
doc.DocumentNode.SelectNodes("/div[#id='acceptance-list-container']").FirstOrDefault().InnerHtml;
so far nothing works if you fix one issue other issue shows up.
The error you get indicates that the SelectNodes() call didn't find any nodes and returned null. In cases like this, it is useful to inspect the actual HTML by using doc.DocumentNode.InnerHtml.
Your code sample is somewhat messy and you are probably trying to do too many things at once (what is Response.Write() for example?). You should try to focus on one thing at a time if possible.
Here is a simple unit test that can get you started:
using HtmlAgilityPack;
using Xunit;
using Xunit.Abstractions;
namespace Scraping.Tests
{
public class ScrapingTests
{
private readonly ITestOutputHelper _outputHelper;
public ScrapingTests(ITestOutputHelper outputHelper)
{
_outputHelper = outputHelper;
}
[Fact]
public void Test()
{
const string url = #"https://www.itftennis.com/en/tournament/w15-valencia/esp/2022/w-itf-esp-35a-2022/acceptance-list/";
var webGet = new HtmlWeb();
HtmlDocument doc = webGet.Load(url);
string html = doc.DocumentNode.InnerHtml;
_outputHelper.WriteLine(html); // use this if you just want to print something
Assert.Contains("acceptance-list-container", html); // use this if you want to automate an assertion
}
}
}
When I tried that the first time, I got some HTML with an iframe. I visited the page in a browser and I was presented with a google captcha. After completing the captcha, I was able to view the page in the browser, but the HTML in the unit test was still different from the one I got in the browser.
Interestingly enough, the HTML in the unit test contains the following:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
It is obvious that this website has some security measures in place in order to block web scrapers. If you manage to overcome this obstacle and get the actual page's HTML in your program, parsing it and getting the parts that you need will be straightforward.

Selenium WebDriver and C# (VS): Look for a specific header string

I've been trying, without luck, to use IJavaScriptExecutor to find a specific header string in a page. Here's the html code form the page:
<div class="wrap">
<h2>Edit Page <a href="http://www.webtest.bugrit.net/wordpress/wp-admin/post-
new.php?post_type=page" class="add-new-h2">Add New</a></h2>
<div id...
The text I need to check for is the "Edit Page" string.
This is the closest I've come, which isn't very close:
var element = FFDriver.Instance.FindElements(By.ClassName("add-new-h2"));
IJavaScriptExecutor js = FFDriver.Instance as IJavaScriptExecutor;
if (js != null) {
string innerHtml = (string)js.ExecuteScript("return arguments[0].innerHTML;", element);
//System.Windows.Forms.MessageBox.Show(innerHtml);
if (innerHtml.Equals("Edit Page")) {
return true;
} else {
return false;
}
}
Now, I realize that the text I should expect to get from that code isn't the exact string "Edit Page". But shouldn't it return something? When I enable the MessageBox line, the innerHtml string is empty.
Or, of couse - if someone knows another, possible easier, way to check for the existance of a specific string inside a specific html tag, I'm all ears.
Your element returns you <a> element, not <h2>. Your <a> doesn't contain Edit Page string.
Try find your element like this to the parent element <h2> (only if class name add-new-h2 is unique, otherwise you will get the first matching one):
var element = FFDriver.Instance.FindElement(By.XPath(".//a[#class='add-new-h2']/.."));
var containsText = element.Text.Contains("Edit Page");

Get GeckoFx firefox browser control iframe html not accessible

I am using the GeckoFX 22 c# web browser control but cannot manage to access tags within an iframe. When I check the gecko innerhtml it seems that although the iframe tag shows in the html, the contents of it do not.
This is the code I used to get the inner html of the browser control which just shows the iframe tag as empty (when it should have another doc inside of it):
GeckoHtmlElement element = null;
var geckoDomElement = webBrowser.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
var innerHtml = element.InnerHtml;
}
Previously I used code similar to the code below to access individual elements which works fine:
GeckoDocument checkDoc = (GeckoDocument)webBrowser.Window.Document;
var x = (checkDoc.GetElementsByTagName("a").Where(b => b.Id == "ipt-form-format-aside").First());
I am able to get individual elements and change their values/trigger events etc without problems with the main html document but anything in an iframe is impossible to get the elements of. I think perhaps the Iframe has not been loaded yet or something like that. Is there a way to force the control to wait for the I frame to load before attempting to access its elements?
string content = null;
var iframe = webBrowser.Document.GetElementsByTagName("iframe").FirstOrDefault() as Gecko.DOM.GeckoIFrameElement;
if(iframe != null)
{
var html = iframe.ContentDocument.DocumentElement as GeckoHtmlElement;
if (html != null)
content = html.OuterHtml;
}
I'm just posting this for anyone else that might get this problem
foreach (GeckoIFrameElement _E in geckoWebBrowser1.Document.GetElementsByTagName("iframe"))
{
if (_E.GetAttribute("class") == "testClass")
{
var innerHTML = _E.ContentDocument;
foreach (GeckoHtmlElement _A in innerHTML.GetElementsByTagName("input"))
{
_A.SetAttribute("value", "Test");
}
}
}
I got a similar problem so i did this
checkDoc.Window.Frames(1)
instead of
checkDoc.GetElementsByTagName("iframe")
value within the parenthesis (i.e. 1 here) depends of your index

XPath to get youtube video c#

The code works, but I can not get the exact link to the page, and that way I do not return anything, it returns an exception that happens because the path does not find anything. The page is as follows, and I need to get the video that this path at the "SelectSingleNode".
please help me to build the correct XPath to get the link of the video from youtube.
My source code:
private void DownloadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
var data = e.Document.DocumentNode.SelectSingleNode("//html/head/link")
.Attributes["href"].Value.ToString();
MessageBox.Show(data);
Uri obj = new Uri(data);
Web.Source = obj;
Web.Visibility = Visibility.Visible;
}
If you click "Inspect Element" on this page, you will find the youtube link that way I describe in "SelectSingleNode". I just left there so you can find the link and help me, but that string is not correct.
This code gets another link. I need to get the real link of youtube video. I try this XPath string but now works: "//html/body/div/iframe/html/head"
There are multiple child LINK elements in the HEAD. You need to identify via the which one you want, similar to this: e.Document.DocumentNode.SelectSingleNode("//html/head/link[#id='someId']")
How about //iframe[contains(#src, "youtube")]//link[#rel="canonical"]?
Returns <link rel="canonical" href="http://www.youtube.com/watch?v=byp94CCWKSI"/>.
I am guessing based off of the method signature that you are using HtmlAgilityPack... which does not retrieve the content of IFrames. You will need to issue a separate request for the content of the IFrame:
var hwMainPage = new HtmlWeb();
var hdMainPage = hwMainPage.Load(#"http://www.unnu.com/jason-derulo/the-other-side");
var iframeUri = hdMainPage.DocumentNode
.SelectSingleNode("//iframe[contains(#src, \"youtube\")]")
.Attributes["src"].Value;
var hwIframe = new HtmlWeb();
var hdIframe = hwIframe.Load(iframeUri);
var videoCanonicalUri = hdIframe.DocumentNode
.SelectSingleNode("//link[#rel=\"canonical\"]")
.Attributes["href"].Value;
// videoCanonicalUri == http://www.youtube.com/watch?v=byp94CCWKSI

How to feed WebBrowser control and manipulate the HTML document?

Good day
I have question about displaying html documents in a windows forms applications. App that I'm working on should display information from the
database in the html format. I will try to describe actions that I have taken (and which failed):
1) I tried to load "virtual" html page that exists only in memory and dynamically change it's parameters (webbMain is a WebBrowser control):
public static string CreateBookHtml()
{
StringBuilder sb = new StringBuilder();
//Declaration
sb.AppendLine(#"<?xml version=""1.0"" encoding=""utf-8""?>");
sb.AppendLine(#"<?xml-stylesheet type=""text/css"" href=""style.css""?>");
sb.AppendLine(#"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.1//EN""
""http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"">");
sb.AppendLine(#"<html xmlns=""http://www.w3.org/1999/xhtml"" xml:lang=""en"">");
//Head
sb.AppendLine(#"<head>");
sb.AppendLine(#"<title>Exemplary document</title>");
sb.AppendLine(#"<meta http-equiv=""Content-Type"" content=""application/xhtml+xml;
charset=utf-8""/ >");
sb.AppendLine(#"</head>");
//Body
sb.AppendLine(#"<body>");
sb.AppendLine(#"<p id=""paragraph"">Example.</p>");
sb.AppendLine(#"</body>");
sb.AppendLine(#"</html>");
return sb.ToString();
}
void LoadBrowser()
{
this.webbMain.Navigate("about:blank");
this.webbMain.DocumentText = CreateBookHtml();
HtmlDocument doc = this.webbMain.Document;
}
This failed, because doc.Body is null, and doc.getElementById("paragraph") returns null too. So I cannot change paragraph InnerText property.
Furthermore, this.webbMain.DocumentText is "\0"...
2) I tried to create html file in specified folder, load it to the WebBrowser and then change its parameters. Html is the same as created by
CreateBookHtml() method:
private void LoadBrowser()
{
this.webbMain.Navigate("HTML\\BookPage.html"));
HtmlDocument doc = this.webbMain.Document;
}
This time this.webbMain.DocumentText contains Html data read from the file, but doc.Body returns null again, and I still cannot take element using
getByElementId() method. Of course, when I have text, I would try regex to get specified fields, or maybe do other tricks to achieve a goal, but I wonder - is there simply way to mainipulate html? For me, ideal way would be to create HTML text in memory, load it into the WebBrowser control, and then dynamically change its parameters using IDs. Is it possible? Thanks for the answers in advance, best regards,
Paweł
I've worked some time ago with the WebControl and like you wanted to load a html from memory but have the same problem, body being null. After some investigation, I noticed that the Navigate and NavigateToString methods work asynchronously, so it needs a little time for the control to load the document, the document is not available right after the call to Navigate. So i did something like (wbChat is the WebBrowser control):
wbChat.NavigateToString("<html><body><div>first line</div></body><html>");
DoEvents();
where DoEvents() is implemeted as:
[SecurityPermissionAttribute(SecurityAction.Demand, Flags = SecurityPermissionFlag.UnmanagedCode)]
public void DoEvents()
{
DispatcherFrame frame = new DispatcherFrame();
Dispatcher.CurrentDispatcher.BeginInvoke(DispatcherPriority.Background,
new DispatcherOperationCallback(ExitFrame), frame);
Dispatcher.PushFrame(frame);
}
and it worked for me, after the DoEvents call, I could obtain a non-null body:
mshtml.IHTMLDocument2 doc2 = (mshtml.IHTMLDocument2)wbChat.Document;
mshtml.HTMLDivElement div = (mshtml.HTMLDivElement)doc2.createElement("div");
div.innerHTML = "some text";
mshtml.HTMLBodyClass body = (mshtml.HTMLBodyClass)doc2.body;
if (body != null)
{
body.appendChild((mshtml.IHTMLDOMNode)div);
body.scrollTop = body.scrollHeight;
}
else
Console.WriteLine("body is still null");
I don't know if this is the right way of doing this, but it fixed the problem for me, maybe it helps you too.
Later Edit:
public object ExitFrame(object f)
{
((DispatcherFrame)f).Continue = false;
return null;
}
The DoEvents method is necessary on WPF. For System.Windows.Forms one can use Application.DoEvents().
Another way to do the same thing is:
webBrowser1.DocumentText = "<html><body>blabla<hr/>yadayada</body></html>";
this works without any extra initialization

Categories

Resources