scrape text from web with agility - c#

Having real trouble locating the text from this website with the node
I've tried all sorts or xPaths inside the selectnodes brackets
does anyone have any ideas?
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.LoadFromBrowser("https://app.box.com/s/v2l2cd1mwhemijbigv88nyfk592rjei0");
HtmlNode[] nodes = doc.DocumentNode.SelectNodes("//#*[starts-with(local-name(),'bcpr9')]").ToArray();
foreach (HtmlNode item in nodes)
{
textBox1.Text = item.InnerText;
}

Your code will only put the text from the last node into the text box as you are overwriting it each loop of the for loop. Try this:
textBox1.Text += item.InnerText;

Related

How read content of a span tag using HtmlAgilityPack?

I'm using HtmlAgilityPack to scrap data from a link(site). There are many p tags, header and span tags in a site. I need to scrap data from a particular span tag.
var webGet = new HtmlWeb();
var document = webGet.Load(URL);
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
{
string strData = node.InnerText.Trim();
}
I had tried by using keyword on parent tag which was not working for all kind of URLs.
Please help me to fix it.
What is the error?
You can start by fixing this:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
it should be:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("//span"))
But I want exact data. For example, there are too many span tags in source as <span>abc</span>, <span>def</span>, <span>pqr</span>, <span>xyz</span>. I want the result as "pqr". Is there any option to get it by count of particular tag or by index?
If you want to get, for example, the third span tag from the root:
doc.DocumentNode.SelectSingleNode("//span[3]")
If you want to get the node containing the text "pqr":
doc.DocumentNode.SelectSingleNode("//span[contains(text(),'pqr')]");
You can use SelectNodes for the latter to get all span tags containing "pqr" in the text.

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

XML Parsing with HtmlAgilityPack

I'm parsing xml with HtmlAgilityPack on WebService worker role, but there is something wrong. When I select childnode "link" get empty char.
the xml like :
<link>
http://www.webtekno.com/google/google-ve-razer-dan-oyun-konsolu.html
</link>
my code for get link from rss is:
HtmlNodeCollection nodeList = doc.DocumentNode.SelectNodes("//item");
foreach (HtmlNode node in nodeList)
{
string newsUri = node.ChildNodes["link"].InnerText;
}
I think gets empty char cause link node includes new line and after link. How can I get link in the node?
Put that line before loading HtmlDocument
HtmlNode.ElementsFlags["link"] = HtmlElementFlag.Closed;
That is all.
By default, its value is HtmlElementFlag.Empty and treated like meta and img tags...

HtmlAgilityPack HtmlNodeCollection returning NULL , shouldn't

I made a simple program for fetching youtube users in comments.
This is the code
string html;
using (var client = new WebClient())
{
html = client.DownloadString("http://www.youtube.com/watch?v=ER5EnjskCvE");
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
List<string> data = new List<string>();
HtmlNodeCollection nodeCollection = doc.DocumentNode.SelectNodes("//*[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img");
foreach (HtmlNode node in nodeCollection)
{
data.Add(node.GetAttributeValue("alt",null));
}
But i have a problem that my nodeCollection is returning null.
For the XPath i used copy XPath option in chrome under F12
try this replace "*" , "div"
"/html/body//div[#id='comments-view']/ul[1]/li[1]/a/span/span/span/span/img"

HTML Agility Pack get all input fields

I found some code on the internet that finds all the href tags and changes them to google.com, but how can I tell the code to find all the input fields and put custom text in there?
This is the code I have right now:
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
att.Value = "http://www.google.com";
}
doc.Save("file.htm");
Please, can someone help me, I can't seem to find any information about this on the internet :(.
Change the XPath selector to //input to select all the input nodes:
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
{
HtmlAttribute att = input.Attributes["value"];
att.Value = "some text";
}
Your current code selected all a elements (that have a href attribute): "//a[#href]".
You want it to select all input elements: "//input".
Of course, the inner part of the loop will need to change to match what you are looking for.
I suggest you read up on XPath.

Categories

Resources