Can i read iframe through WebClient (i want the outer html)? - c#

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.

After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source

Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}

Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

Related

Can't parse body of page

I am trying parse some href from one page, my code looks like:
WebClient webClient = new WebClient();
string htmlCode = webClient.DownloadString("https://www.firmy.cz/Auto-moto");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[contains(#class,'companyWrap')]");
string target = "";
foreach (HtmlNode link in collection)
{
target = target +"\n"+ link.Attributes["href"].Value;
}
On this page my doc.ParsedText haven't body <body id="root" class="root">
</body> but if i go to page i see elements of body. Can u tell me where is a problem?
Blockquote
If you view the source of the URL you are trying to parse (https://www.firmy.cz/Auto-moto), you can see that the body is empty.
It seems like the page is loading the content through JavaScript on the client side and will thus not be available for you to parse.

c# web scraping to get URL from html

I am trying to scrape a website and get a URL from it, I am using htmlagilitypack and the code below:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://putlocker.ist/scorpion-season-1-episode-1/");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//[#id='vplayer_media']/video"))
{
string whatever = link.OuterHtml;
textBox1.Text = whatever;
}
I used opera's developer tools to copy the XPath which came out as this:
//*[#id="vplayer_media"]/video
I couldn't use it because of double quotes so I replaced it with
#"//*[#id=""vplayer_media""]/video"
but I get the error:
Object reference not set to an instance of an object
What am I doing wrong?
Escape the double-quotes in your xpath:
"//[#id=\"vplayer_media\"]/video"
Or use double-double-quotes with a 'literal'
#"//[#id=""vplayer_media""]/video"

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?
Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];

With HtmlAgilityPack, verify that element on webpage exists

Let's say I'm on http://google.com, and I want to verify that there is an element with id="hplogo" that exists on the page (which there is, it's the Google logo).
I want to use HtmlAgilityPack, so I write something like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("http://google.com");
var foo = (from bar in doc.DocumentNode.DescendantNodes()
where bar.GetAttributeValue("id", null) == "hplogo"
select bar).FirstOrDefault();
if (foo == null)
{
HasSucceeded = 1;
MessageBox.Show("not there");
}
else
{
MessageBox.Show("it's there");
}
return HasSucceeded;
}
Which should return the "it's there" message because it is there. But it doesn't. What am I doing wrong?
Method LoadHtml(html) loads string, which contain html content for parsing. This is not url of resource to load. So you are loading string "http://google.com" and trying to find logo in it. Which of course gives you not there result.
You can use WebClient to download resource content:
WebClient client = new WebClient();
string html = client.DownloadString("http://google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlAgilityPack - How to understand page redirected and load redirected page

With using HtmlAgilityPack and c# 4.0 how can you determine whether page is being redirected or not. I am using this method to load page.
HtmlDocument hdMyDoc = hwWeb.Load(srPageUrl);
And example redirection result i suppose
Returned inner html
<meta http-equıv="refresh" content="0;URL=http://www.pratikev.com/fractalv33/pratikEv/pages/home.jsp">
c# 4.0
For this case, parse the HTML is the best way.
var page = "...";
var doc = new HtmlDocument();
doc.Load(page);
var root = doc.DocumentNode;
var select = root.SelectNodes("//meta[contains(#content, 'URL')]");
try
{
Console.WriteLine("has redirect..");
Console.WriteLine(select[0].Attributes["content"].Value.Split('=')[1]);
}
catch
{
Console.WriteLine("have not redirect using HTML");
}
Assuming the document is relatively well-formed, I suppose you could do something like this:
static string GetMetaRefreshUrl(string sourceUrl)
{
var web = new HtmlWeb();
var doc = web.Load(sourceUrl);
var xpath = "//meta[#http-equiv='refresh' and contains(#content, 'URL')]";
var refresh = doc.DocumentNode.SelectSingleNode(xpath);
if (refresh == null)
return null;
var content = refresh.Attributes["content"].Value;
return Regex.Match(content, #"\s*URL\s*=\s*([^ ;]+)").Groups[1].Value.Trim();
}

Categories

Resources