Html Agility Pack Load method issue - c#

I am using the Html Agility pack. When the Load method of HtmlDocument class is passed the URL like "http://www.stackoverflow.com" it says the URI is not in correct format.
doc.Load(TextBoxUrl.Text, Encoding.UTF8 );
the url I try is this http://www.stackoverflow.com/questions/846994/how-to-use-html-agility-pack

HAP can not load from url, only from file or from a string. Use WebClient or HttpWebRequest to get the page.
For example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (var wc = new WebClient())
{
doc.LoadHtml(wc.DownloadString(TextBoxUrl.Text));
}

Related

Get webpage source code with alt key code symbols using asp.net c#

I'm trying to get webpage source code using htmlagilitypack. This is my code to get source code and fill into multiline textbox:
var url = "http://www.example.com";
var web = new HtmlWeb();
var doc = web.Load(url);
sourcecodetxt.Text = doc.ToString();
code is working fine but if my webpage have some "Alt Codes Symbols" then symbol changed with some characters eg: ★ changed with ★
My question is how to get original symbol. Sorry for my bad english. Thanks in advance.
Try using WebClient and HtmlDocument's Load() method so you can specify the encoding:
WebClient client = new WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(client.OpenRead("http://www.example.com"), Encoding.UTF8);

Can't parse body of page

I am trying parse some href from one page, my code looks like:
WebClient webClient = new WebClient();
string htmlCode = webClient.DownloadString("https://www.firmy.cz/Auto-moto");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[contains(#class,'companyWrap')]");
string target = "";
foreach (HtmlNode link in collection)
{
target = target +"\n"+ link.Attributes["href"].Value;
}
On this page my doc.ParsedText haven't body <body id="root" class="root">
</body> but if i go to page i see elements of body. Can u tell me where is a problem?
Blockquote
If you view the source of the URL you are trying to parse (https://www.firmy.cz/Auto-moto), you can see that the body is empty.
It seems like the page is loading the content through JavaScript on the client side and will thus not be available for you to parse.

Empty Xml Document response from Api

I need information from imdb unoffical api "omdbapi".I am sending link in correct but when I get response the document is null.I am using htmlagiltypack.what am I doing wrong?
here is direct link:http://www.omdbapi.com/?i=tt2231253&plot=short&r=xml
string url = "http://www.omdbapi.com/?i=" + ImdbID + "&plot=short&r=xml";
HtmlWeb source = new HtmlWeb();
HtmlDocument document = source.Load(url);
Its no Html but a XML document you expect. Try this instead:
string url = "http://www.omdbapi.com/?i=tt2231253&plot=short&r=xml";
WebClient wc = new WebClient();
XDocument doc = XDocument.Parse(wc.DownloadString(url));
Console.WriteLine(doc);

Extract image sources from a web page, where img tags might be added when the page is rendered by javascript etcq

I want to extract the "" of all the images in a web page in C#/asp.net.
I am using:
WebClient client = new WebClient();
string mainSource = client.DownloadString(URL);
and searching mainSource string for "".
This method seems to work correctly, but only if all the images(" tags) are present in raw source code of the web page.
The image tags rendered by javascript etc are not being scanned in the above process.
Is there another way to do this?
Try this out
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//img[#src]"))
{
}

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.
After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source
Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}
Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

Categories

Resources