The code works, but I can not get the exact link to the page, and that way I do not return anything, it returns an exception that happens because the path does not find anything. The page is as follows, and I need to get the video that this path at the "SelectSingleNode".
please help me to build the correct XPath to get the link of the video from youtube.
My source code:
private void DownloadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
var data = e.Document.DocumentNode.SelectSingleNode("//html/head/link")
.Attributes["href"].Value.ToString();
MessageBox.Show(data);
Uri obj = new Uri(data);
Web.Source = obj;
Web.Visibility = Visibility.Visible;
}
If you click "Inspect Element" on this page, you will find the youtube link that way I describe in "SelectSingleNode". I just left there so you can find the link and help me, but that string is not correct.
This code gets another link. I need to get the real link of youtube video. I try this XPath string but now works: "//html/body/div/iframe/html/head"
There are multiple child LINK elements in the HEAD. You need to identify via the which one you want, similar to this: e.Document.DocumentNode.SelectSingleNode("//html/head/link[#id='someId']")
How about //iframe[contains(#src, "youtube")]//link[#rel="canonical"]?
Returns <link rel="canonical" href="http://www.youtube.com/watch?v=byp94CCWKSI"/>.
I am guessing based off of the method signature that you are using HtmlAgilityPack... which does not retrieve the content of IFrames. You will need to issue a separate request for the content of the IFrame:
var hwMainPage = new HtmlWeb();
var hdMainPage = hwMainPage.Load(#"http://www.unnu.com/jason-derulo/the-other-side");
var iframeUri = hdMainPage.DocumentNode
.SelectSingleNode("//iframe[contains(#src, \"youtube\")]")
.Attributes["src"].Value;
var hwIframe = new HtmlWeb();
var hdIframe = hwIframe.Load(iframeUri);
var videoCanonicalUri = hdIframe.DocumentNode
.SelectSingleNode("//link[#rel=\"canonical\"]")
.Attributes["href"].Value;
// videoCanonicalUri == http://www.youtube.com/watch?v=byp94CCWKSI
Related
I need to pull part of html from external url to another page using agility-pack. I am not sure if i can select a node/element based on id or classname using agility pack. So far i manage to pull complete page but i want to target on node/element with specific id and all its contents.
protected void WebScrapper()
{
HtmlDocument doc = new HtmlDocument();
var url = #"https://www.itftennis.com/en/tournament/w15-valencia/esp/2022/w-itf-esp-35a-2022/acceptance-list/";
var webGet = new HtmlWeb();
doc = webGet.Load(url);
var baseUrl = new Uri(url);
//doc.LoadHtml(doc);
Response.Write(doc.DocumentNode.InnerHtml);
//Response.Write(doc.DocumentNode.Id("acceptance-list-container"));
//var innerContent = doc.DocumentNode.SelectNodes("/div").FirstOrDefault().InnerHtml;
}
When i use Response.Write(doc.DocumentNode.Id("acceptance-list-container")) it generates error.
When i use below code it generates error System.ArgumentNullException: Value cannot be null.
doc.DocumentNode.SelectNodes("/div[#id='acceptance-list-container']").FirstOrDefault().InnerHtml;
so far nothing works if you fix one issue other issue shows up.
The error you get indicates that the SelectNodes() call didn't find any nodes and returned null. In cases like this, it is useful to inspect the actual HTML by using doc.DocumentNode.InnerHtml.
Your code sample is somewhat messy and you are probably trying to do too many things at once (what is Response.Write() for example?). You should try to focus on one thing at a time if possible.
Here is a simple unit test that can get you started:
using HtmlAgilityPack;
using Xunit;
using Xunit.Abstractions;
namespace Scraping.Tests
{
public class ScrapingTests
{
private readonly ITestOutputHelper _outputHelper;
public ScrapingTests(ITestOutputHelper outputHelper)
{
_outputHelper = outputHelper;
}
[Fact]
public void Test()
{
const string url = #"https://www.itftennis.com/en/tournament/w15-valencia/esp/2022/w-itf-esp-35a-2022/acceptance-list/";
var webGet = new HtmlWeb();
HtmlDocument doc = webGet.Load(url);
string html = doc.DocumentNode.InnerHtml;
_outputHelper.WriteLine(html); // use this if you just want to print something
Assert.Contains("acceptance-list-container", html); // use this if you want to automate an assertion
}
}
}
When I tried that the first time, I got some HTML with an iframe. I visited the page in a browser and I was presented with a google captcha. After completing the captcha, I was able to view the page in the browser, but the HTML in the unit test was still different from the one I got in the browser.
Interestingly enough, the HTML in the unit test contains the following:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
It is obvious that this website has some security measures in place in order to block web scrapers. If you manage to overcome this obstacle and get the actual page's HTML in your program, parsing it and getting the parts that you need will be straightforward.
Is there a way to extract the user-agent string that the WebView control uses? If so, I would greatly appreciate it if anyone can give me a method to do so. Using the following does not seem to work:
var userAgent = new StringBuilder(256);
int length = 0;
UrlMkGetSessionOption(UrlMonOptionUserAgent, userAgent, userAgent.Capacity - 1, ref length, 0);
I take that back, using UrlMkGetSessionOption as mentioned in the code above does work.
I currently use this method, adapted from a method given for windows phone originally. It gives the correct result, and gets it straight from a real instance of a WebView object, so gives me more confidence in it having the correct value.
private static string s_userAgent = null;
// Get the default UserAgent which webviews use on this platform.
public async Task<string> GetUserAgent()
{
if (s_userAgent == null)
{
const string Html = #"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"">
<html>
<head>
<script language=""javascript"" type=""text/javascript"">
function notifyUA() { window.external.notify(navigator.userAgent); }
</script>
</head>
<body onload=""notifyUA();""></body>
</html>";
SemaphoreSlim signal_done = new SemaphoreSlim(0, 1);
var wv = new WebView();
wv.ScriptNotify += (sender, args) =>
{
s_userAgent = args.Value;
// set signal, to show we've done
signal_done.Release();
};
wv.NavigateToString(Html);
// wait for signal
await signal_done.WaitAsync();
Debug.WriteLine("GetUserAgent() called. User agent from WebView: \n{0}", s_userAgent);
}
return s_userAgent;
}
This started as a comment but became too long. To expand on his (#Rexfelis) own answer:
I've found that there can be a difference in what UrlMkGetSessionOption returns depending on where you are in the application lifecycle and if a WebView has been initialized yet in a XAML view.
If you call it before component initialization, it will be missing WebView/3.0 (at least in Windows 10); after initialization it will have that text and results in the same string as the answer by #SimonTillson.
If you need to know the right user agent before component initialization, you have to new up a WebView and navigate before querying UrlMkGetSessionOption; e.g. var wv = new WebView(); wv.NavigateToString(...);. It seems that the user agent is modified on first navigation to include WebView/3.0.
I am using the GeckoFX 22 c# web browser control but cannot manage to access tags within an iframe. When I check the gecko innerhtml it seems that although the iframe tag shows in the html, the contents of it do not.
This is the code I used to get the inner html of the browser control which just shows the iframe tag as empty (when it should have another doc inside of it):
GeckoHtmlElement element = null;
var geckoDomElement = webBrowser.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
var innerHtml = element.InnerHtml;
}
Previously I used code similar to the code below to access individual elements which works fine:
GeckoDocument checkDoc = (GeckoDocument)webBrowser.Window.Document;
var x = (checkDoc.GetElementsByTagName("a").Where(b => b.Id == "ipt-form-format-aside").First());
I am able to get individual elements and change their values/trigger events etc without problems with the main html document but anything in an iframe is impossible to get the elements of. I think perhaps the Iframe has not been loaded yet or something like that. Is there a way to force the control to wait for the I frame to load before attempting to access its elements?
string content = null;
var iframe = webBrowser.Document.GetElementsByTagName("iframe").FirstOrDefault() as Gecko.DOM.GeckoIFrameElement;
if(iframe != null)
{
var html = iframe.ContentDocument.DocumentElement as GeckoHtmlElement;
if (html != null)
content = html.OuterHtml;
}
I'm just posting this for anyone else that might get this problem
foreach (GeckoIFrameElement _E in geckoWebBrowser1.Document.GetElementsByTagName("iframe"))
{
if (_E.GetAttribute("class") == "testClass")
{
var innerHTML = _E.ContentDocument;
foreach (GeckoHtmlElement _A in innerHTML.GetElementsByTagName("input"))
{
_A.SetAttribute("value", "Test");
}
}
}
I got a similar problem so i did this
checkDoc.Window.Frames(1)
instead of
checkDoc.GetElementsByTagName("iframe")
value within the parenthesis (i.e. 1 here) depends of your index
Good day
I have question about displaying html documents in a windows forms applications. App that I'm working on should display information from the
database in the html format. I will try to describe actions that I have taken (and which failed):
1) I tried to load "virtual" html page that exists only in memory and dynamically change it's parameters (webbMain is a WebBrowser control):
public static string CreateBookHtml()
{
StringBuilder sb = new StringBuilder();
//Declaration
sb.AppendLine(#"<?xml version=""1.0"" encoding=""utf-8""?>");
sb.AppendLine(#"<?xml-stylesheet type=""text/css"" href=""style.css""?>");
sb.AppendLine(#"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.1//EN""
""http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"">");
sb.AppendLine(#"<html xmlns=""http://www.w3.org/1999/xhtml"" xml:lang=""en"">");
//Head
sb.AppendLine(#"<head>");
sb.AppendLine(#"<title>Exemplary document</title>");
sb.AppendLine(#"<meta http-equiv=""Content-Type"" content=""application/xhtml+xml;
charset=utf-8""/ >");
sb.AppendLine(#"</head>");
//Body
sb.AppendLine(#"<body>");
sb.AppendLine(#"<p id=""paragraph"">Example.</p>");
sb.AppendLine(#"</body>");
sb.AppendLine(#"</html>");
return sb.ToString();
}
void LoadBrowser()
{
this.webbMain.Navigate("about:blank");
this.webbMain.DocumentText = CreateBookHtml();
HtmlDocument doc = this.webbMain.Document;
}
This failed, because doc.Body is null, and doc.getElementById("paragraph") returns null too. So I cannot change paragraph InnerText property.
Furthermore, this.webbMain.DocumentText is "\0"...
2) I tried to create html file in specified folder, load it to the WebBrowser and then change its parameters. Html is the same as created by
CreateBookHtml() method:
private void LoadBrowser()
{
this.webbMain.Navigate("HTML\\BookPage.html"));
HtmlDocument doc = this.webbMain.Document;
}
This time this.webbMain.DocumentText contains Html data read from the file, but doc.Body returns null again, and I still cannot take element using
getByElementId() method. Of course, when I have text, I would try regex to get specified fields, or maybe do other tricks to achieve a goal, but I wonder - is there simply way to mainipulate html? For me, ideal way would be to create HTML text in memory, load it into the WebBrowser control, and then dynamically change its parameters using IDs. Is it possible? Thanks for the answers in advance, best regards,
Paweł
I've worked some time ago with the WebControl and like you wanted to load a html from memory but have the same problem, body being null. After some investigation, I noticed that the Navigate and NavigateToString methods work asynchronously, so it needs a little time for the control to load the document, the document is not available right after the call to Navigate. So i did something like (wbChat is the WebBrowser control):
wbChat.NavigateToString("<html><body><div>first line</div></body><html>");
DoEvents();
where DoEvents() is implemeted as:
[SecurityPermissionAttribute(SecurityAction.Demand, Flags = SecurityPermissionFlag.UnmanagedCode)]
public void DoEvents()
{
DispatcherFrame frame = new DispatcherFrame();
Dispatcher.CurrentDispatcher.BeginInvoke(DispatcherPriority.Background,
new DispatcherOperationCallback(ExitFrame), frame);
Dispatcher.PushFrame(frame);
}
and it worked for me, after the DoEvents call, I could obtain a non-null body:
mshtml.IHTMLDocument2 doc2 = (mshtml.IHTMLDocument2)wbChat.Document;
mshtml.HTMLDivElement div = (mshtml.HTMLDivElement)doc2.createElement("div");
div.innerHTML = "some text";
mshtml.HTMLBodyClass body = (mshtml.HTMLBodyClass)doc2.body;
if (body != null)
{
body.appendChild((mshtml.IHTMLDOMNode)div);
body.scrollTop = body.scrollHeight;
}
else
Console.WriteLine("body is still null");
I don't know if this is the right way of doing this, but it fixed the problem for me, maybe it helps you too.
Later Edit:
public object ExitFrame(object f)
{
((DispatcherFrame)f).Continue = false;
return null;
}
The DoEvents method is necessary on WPF. For System.Windows.Forms one can use Application.DoEvents().
Another way to do the same thing is:
webBrowser1.DocumentText = "<html><body>blabla<hr/>yadayada</body></html>";
this works without any extra initialization
I have a need to verify a specific hyperlink exists on a given web page. I know how to download the source HTML. What I need help with is figuring out if a "target" url exists as a hyperlink in the "source" web page.
Here is a little console program to demonstrate the problem:
public static void Main()
{
var sourceUrl = "http://developer.yahoo.com/search/web/V1/webSearch.html";
var targetUrl = "http://developer.yahoo.com/ypatterns/";
Console.WriteLine("Source contains link to target? Answer = {0}",
SourceContainsLinkToTarget(
sourceUrl,
targetUrl));
Console.ReadKey();
}
private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
string content;
using (var wc = new WebClient())
content = wc.DownloadString(sourceUrl);
return content.Contains(targetUrl); // Need to ensure this is in a <href> tag!
}
Notice the comment on the last line. I can see if the target URL exists in the HTML of the source URL, but I need to verify that URL is inside of a <href/> tag. This way I can validate it's actually a hyperlink, instead of just text.
I'm hoping someone will have a kick-ass regular expression or something I can use.
Thanks!
Here is the solution using the HtmlAgilityPack:
private static bool SourceContainsLinkToTarget(string sourceUrl, string targetUrl)
{
var doc = (new HtmlWeb()).Load(sourceUrl);
foreach (var link in doc.DocumentNode.SelectNodes("//a[#href]"))
if (link.GetAttributeValue("href",
string.Empty).Equals(targetUrl))
return true;
return false;
}
The best way is to use a web scraping library with a built in DOM parser, which will build an object tree out of the HTML and let you explore it programmatically for the link entity you are looking for. There are many available - for example Beautiful Soup (python) or scrapi (ruby) or Mechanize (perl). For .net, try the HTML agility pack. http://htmlagilitypack.codeplex.com/