HTMLAgilityPack stripping out html

HTMLAgilityPack stripping out html - c#

I'm sure this question has asked before and i've looked before I cant find the answer, or maybe I am just doing something wrong.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(indivdualfix[0]);
HtmlWeb hwObject = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldocObject = hwObject.Load(indivdualfix[0]);
HtmlNode body = htmldocObject.DocumentNode.SelectSingleNode("//body");
body.Attributes.Remove("style");
foreach (var a in body.Attributes.ToArray())
a.Remove();
string bodywork = body.InnerHtml.ToString();
The string body still returns all the html coding. I might be missing something really small here. What needs to be doen to remove all the html coding basically.

Use body.InnerText not body.InnerHtml

Related

c# web scraping to get URL from html

I am trying to scrape a website and get a URL from it, I am using htmlagilitypack and the code below:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://putlocker.ist/scorpion-season-1-episode-1/");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//[#id='vplayer_media']/video"))
{
string whatever = link.OuterHtml;
textBox1.Text = whatever;
}
I used opera's developer tools to copy the XPath which came out as this:
//*[#id="vplayer_media"]/video
I couldn't use it because of double quotes so I replaced it with
#"//*[#id=""vplayer_media""]/video"
but I get the error:
Object reference not set to an instance of an object
What am I doing wrong?

Escape the double-quotes in your xpath:
"//[#id=\"vplayer_media\"]/video"
Or use double-double-quotes with a 'literal'
#"//[#id=""vplayer_media""]/video"

C# HTMLAgilityPack getting src. xpath is not valid

I am not able to get the xpath right. I am trying to get the image of any IMDB movie but it just seems not to work. This is my code of it.
// Getting the node
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id=\"title - overview - widget\"]/div[2]/div[3]/div[1]/a/img");
// Getting the attribute data
HtmlAttributeCollection attr = node.Attributes;
the attribute is null. every time but. the xpath does not work and i dont know why. it seems good to me.

You can use a simpler xpath
var url = "http://www.imdb.com/title/tt0816692/";
using (var client = new HttpClient())
{
var html = await client.GetStringAsync(url);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var img = doc.DocumentNode.SelectSingleNode("//img[#title='Trailer']")
?.Attributes["src"]?.Value;
//or
var poster = doc.DocumentNode.SelectSingleNode("//div[#class='poster']//img")
?.Attributes["src"]?.Value;
}

Can i read iframe through WebClient (i want the outer html)?

Well my program is reading a web target that somewhere in the body there is the iframe that i want to read.
My html source
<html>
...
<iframe src="http://www.mysite.com" ></iframe>
...
</html>
in my program i have a method that is returning the source as a string
public static string get_url_source(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadString(url);
}
}
My problem is that i want to get the source of the iframe when it's reading the source, as it would do in normal browsing.
Can i do this only by using WebBrowser Class or there is a way to do it within WebClient or even another class?
The real question:
How can i get the outer html given a url? Any appoach is welcomed.

After getting the source of the site, you can use HtmlAgilityPack to get the url of the iframe
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var src = doc.DocumentNode.SelectSingleNode("//iframe")
.Attributes["src"].Value;
then make a second call to get_url_source

Parse your source using HTML Agility Pack and then:
List<String> iframeSource = new List<String>();
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
iframeSource.Add(get_url_source(mainiFrame.Attributes["src"]));
If you are targeting a single iframe, try to identify it using ID attribute or something else so you can only retrieve one source:
String iframeSource;
HtmlDocument doc = new HtmlDocument();
doc.Load(url);
foreach (HtmlNode node in doc.DocumentElement.SelectNodes("//iframe"))
{
// Just an example for check, but you could use different approaches...
if (node.Attributes["id"].Value == 'targetframe')
iframeSource = get_url_source(node.Attributes["src"].Value);
}

Well i found the answer after some search and this is what i wanted
webBrowser1.Url = new Uri("http://www.mysite.com/");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
string InnerSource = webBrowser1.Document.Body.InnerHtml;
//You can use here OuterHtml too.

HTML Agility Pack 2

I am tring to scrap This Website .
The below Xpath expression working fine with FirePath firebug extension
html/body/table/tbody/tr[3]/td
But using same xpath expression the below code gives me null :
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("http://www.edb.gov.hk/templates/sch_list_print.asp?district=cw");
var collection= doc.DocumentNode.SelectNodes("html/body/table/tbody/tr[3]/td");
Can anyone help me on this. Thanks.

this works, looking at the source of the page you are trying to scrape there is no tbody inside of table.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("http://www.edb.gov.hk/templates/sch_list_print.asp?district=cw");
var collection= doc.DocumentNode.SelectNodes("html/body/table/tr[3]/td");
change your xpath to
html/body/table/tr[3]/td

Parsing HTML String [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 9 years ago.
Is there a way to parse HTML string in .Net code behind like DOM parsing...
i.e. GetElementByTagName("abc").GetElementByTagName("tag")
I've this code chunk...
private void LoadProfilePage()
{
string sURL;
sURL = "http://www.abcd1234.com/abcd1234";
WebRequest wrGETURL;
wrGETURL = WebRequest.Create(sURL);
//WebProxy myProxy = new WebProxy("myproxy",80);
//myProxy.BypassProxyOnLocal = true;
//wrGETURL.Proxy = WebProxy.GetDefaultProxy();
Stream objStream;
objStream = wrGETURL.GetResponse().GetResponseStream();
if (objStream != null)
{
StreamReader objReader = new StreamReader(objStream);
string sLine = objReader.ReadToEnd();
if (String.IsNullOrEmpty(sLine) == false)
{
....
}
}
}

You can use the excellent HTML Agility Pack.
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Take a look at using the Html Agility Pack
Example of its use:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}

You can use the HTML Agility Pack and a little XPath (it can even download the document for you):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.abcd1234.com/abcd1234");
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes("//abc//tag");

I've used the HTML Agility Pack to do this exact thing and I think it's great. It has been really helpful to me.

maybe this can help: What is the best way to parse html in C#?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTMLAgilityPack stripping out html - c#

Use body.InnerText not body.InnerHtml

Related

c# web scraping to get URL from html

C# HTMLAgilityPack getting src. xpath is not valid

Can i read iframe through WebClient (i want the outer html)?

HTML Agility Pack 2

Parsing HTML String [duplicate]

Categories

Resources