c# web scraping to get URL from html - c#

I am trying to scrape a website and get a URL from it, I am using htmlagilitypack and the code below:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://putlocker.ist/scorpion-season-1-episode-1/");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//[#id='vplayer_media']/video"))
{
string whatever = link.OuterHtml;
textBox1.Text = whatever;
}
I used opera's developer tools to copy the XPath which came out as this:
//*[#id="vplayer_media"]/video
I couldn't use it because of double quotes so I replaced it with
#"//*[#id=""vplayer_media""]/video"
but I get the error:
Object reference not set to an instance of an object
What am I doing wrong?

Escape the double-quotes in your xpath:
"//[#id=\"vplayer_media\"]/video"
Or use double-double-quotes with a 'literal'
#"//[#id=""vplayer_media""]/video"

Related

Get webpage source code with alt key code symbols using asp.net c#

I'm trying to get webpage source code using htmlagilitypack. This is my code to get source code and fill into multiline textbox:
var url = "http://www.example.com";
var web = new HtmlWeb();
var doc = web.Load(url);
sourcecodetxt.Text = doc.ToString();
code is working fine but if my webpage have some "Alt Codes Symbols" then symbol changed with some characters eg: ★ changed with ★
My question is how to get original symbol. Sorry for my bad english. Thanks in advance.
Try using WebClient and HtmlDocument's Load() method so you can specify the encoding:
WebClient client = new WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(client.OpenRead("http://www.example.com"), Encoding.UTF8);

What is the fastest way to get an HTML document node using XPath and the HtmlAgilityPack?

In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?
Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];

Extract image sources from a web page, where img tags might be added when the page is rendered by javascript etcq

I want to extract the "" of all the images in a web page in C#/asp.net.
I am using:
WebClient client = new WebClient();
string mainSource = client.DownloadString(URL);
and searching mainSource string for "".
This method seems to work correctly, but only if all the images(" tags) are present in raw source code of the web page.
The image tags rendered by javascript etc are not being scanned in the above process.
Is there another way to do this?
Try this out
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//img[#src]"))
{
}

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.
After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source
Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}
Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

HTMLAgilityPack stripping out html

I'm sure this question has asked before and i've looked before I cant find the answer, or maybe I am just doing something wrong.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(indivdualfix[0]);
HtmlWeb hwObject = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldocObject = hwObject.Load(indivdualfix[0]);
HtmlNode body = htmldocObject.DocumentNode.SelectSingleNode("//body");
body.Attributes.Remove("style");
foreach (var a in body.Attributes.ToArray())
a.Remove();
string bodywork = body.InnerHtml.ToString();
The string body still returns all the html coding. I might be missing something really small here. What needs to be doen to remove all the html coding basically.
Use body.InnerText not body.InnerHtml

Categories

Resources