c# access all html nodes

c# access all html nodes - c#

I am trying to access all nodes in a website. Here is my some codes.
string Url = "http://quickfind.kassad.in/profile/euw/exploit4/";
string text1 = "";
HtmlAgilityPack.HtmlWeb web;
HtmlAgilityPack.HtmlDocument doc;
web = new HtmlWeb();
doc = web.Load(Url);
text1 = doc.DocumentNode
.SelectNodes("//*[#id=\"games\"]/div[2]/div[1]/strong/text()")[0]
.InnerText;
This code does not work. I can access games node but I can't access their child nodes. I tried to access with InnerHtml but it doesn't contain game's child node. How can I access that nodes? And also I tried to access with webbrowser.documenttext but its same.

Looking at the source of the URL, it looks like <div id="games"></div> gets populated after the page has loaded.
It makes an additional call to: http://quickfind.kassad.in/ahnlab_hs_sys/euw/AcquisitionServiceGate/RGN.aspx

Related

Can't parse body of page

I am trying parse some href from one page, my code looks like:
WebClient webClient = new WebClient();
string htmlCode = webClient.DownloadString("https://www.firmy.cz/Auto-moto");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[contains(#class,'companyWrap')]");
string target = "";
foreach (HtmlNode link in collection)
{
target = target +"\n"+ link.Attributes["href"].Value;
}
On this page my doc.ParsedText haven't body <body id="root" class="root">
</body> but if i go to page i see elements of body. Can u tell me where is a problem?
Blockquote

If you view the source of the URL you are trying to parse (https://www.firmy.cz/Auto-moto), you can see that the body is empty.
It seems like the page is loading the content through JavaScript on the client side and will thus not be available for you to parse.

c# web scraping to get URL from html

I am trying to scrape a website and get a URL from it, I am using htmlagilitypack and the code below:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://putlocker.ist/scorpion-season-1-episode-1/");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//[#id='vplayer_media']/video"))
{
string whatever = link.OuterHtml;
textBox1.Text = whatever;
}
I used opera's developer tools to copy the XPath which came out as this:
//*[#id="vplayer_media"]/video
I couldn't use it because of double quotes so I replaced it with
#"//*[#id=""vplayer_media""]/video"
but I get the error:
Object reference not set to an instance of an object
What am I doing wrong?

Escape the double-quotes in your xpath:
"//[#id=\"vplayer_media\"]/video"
Or use double-double-quotes with a 'literal'
#"//[#id=""vplayer_media""]/video"

How to get hidden InnerHtml of web page that set by javascript?

I know that I can get source of web page with this code:
browser.DocumentText;
some data of page filled by javascript innetHtml function and will not visible in browser.Text but in browser's output is visible.
How can I get source code of data that added by javascript to page?

If you know what type of tag contains the inner HTML you want to get at, you could do something like this (this example loops through the div tags, but you could do p, or table cells, or whatever):
HtmlElementCollection collection = browser.Document.GetElementsByTagName("div");
foreach (HtmlElement element in collection) {
string html = element.InnerHtml;
string text = element.InnerText;
// do something with the HTML or text here...
}
Or if you know the specific ID of the element you want to get, use:
HtmlElement element = browser.Document.GetElementById("someId123");
if(null != element) // do something with it...

You could give HtmlAgilityPack a try and follow this answer.
HtmlWeb webGet = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = webGet.Load(url);

Extract image sources from a web page, where img tags might be added when the page is rendered by javascript etcq

I want to extract the "" of all the images in a web page in C#/asp.net.
I am using:
WebClient client = new WebClient();
string mainSource = client.DownloadString(URL);
and searching mainSource string for "".
This method seems to work correctly, but only if all the images(" tags) are present in raw source code of the web page.
The image tags rendered by javascript etc are not being scanned in the above process.
Is there another way to do this?

Try this out
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//img[#src]"))
{
}

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.

After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source

Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}

Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

c# access all html nodes - c#

Looking at the source of the URL, it looks like <div id="games"></div> gets populated after the page has loaded. It makes an additional call to: http://quickfind.kassad.in/ahnlab_hs_sys/euw/AcquisitionServiceGate/RGN.aspx

Related

Can't parse body of page

c# web scraping to get URL from html

How to get hidden InnerHtml of web page that set by javascript?

Extract image sources from a web page, where img tags might be added when the page is rendered by javascript etcq

Can i read iframe through WebClient (i want the outer html)?

Categories

Resources