I found some code on the internet that finds all the href tags and changes them to google.com, but how can I tell the code to find all the input fields and put custom text in there?
This is the code I have right now:
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
att.Value = "http://www.google.com";
}
doc.Save("file.htm");
Please, can someone help me, I can't seem to find any information about this on the internet :(.
Change the XPath selector to //input to select all the input nodes:
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
{
HtmlAttribute att = input.Attributes["value"];
att.Value = "some text";
}
Your current code selected all a elements (that have a href attribute): "//a[#href]".
You want it to select all input elements: "//input".
Of course, the inner part of the loop will need to change to match what you are looking for.
I suggest you read up on XPath.
Related
Having real trouble locating the text from this website with the node
I've tried all sorts or xPaths inside the selectnodes brackets
does anyone have any ideas?
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.LoadFromBrowser("https://app.box.com/s/v2l2cd1mwhemijbigv88nyfk592rjei0");
HtmlNode[] nodes = doc.DocumentNode.SelectNodes("//#*[starts-with(local-name(),'bcpr9')]").ToArray();
foreach (HtmlNode item in nodes)
{
textBox1.Text = item.InnerText;
}
Your code will only put the text from the last node into the text box as you are overwriting it each loop of the for loop. Try this:
textBox1.Text += item.InnerText;
I'm trying to pull text from a "div" and to exclude everything else. Can you help me please ?!
<div class="article">
<div class="date">01.01.2000</div>
<div class="news-type">Breaking News</div>
"Here is the location of the text i would like to pull"
</div>
When I pull "article" class i get everything, but i'm unable/don't know how to exclude class="date", class="news-type", and everything in it.
Here is the code i use:
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]"))
{
name_text.text += node.InnerHtml.Trim();
}
Thank you!
Another way would be using XPath /text()[normalize-space()] to get non-empty, direct-child text nodes from the div elements :
var divs = doc.DocumentNode.SelectNodes("//div[contains(#class,'article')]");
foreach (HtmlNode div in divs)
{
var node = div.SelectSingleNode("text()[normalize-space()]");
Console.WriteLine(node.InnerText.Trim());
}
dotnetfiddle demo
output :
"Here is the location of the text i would like to pull"
You want the ChildNodes that are type HtmlTextNode. Untested suggested code:
var textNodes = node.ChildNodes.OfType<HtmlTextNode>();
if (textNodes.Any())
{
name_text.text += string.Join(string.Empty, textNodes.Select(tn => tn.InnerHtml));
}
I'm using HtmlAgilityPack to scrap data from a link(site). There are many p tags, header and span tags in a site. I need to scrap data from a particular span tag.
var webGet = new HtmlWeb();
var document = webGet.Load(URL);
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
{
string strData = node.InnerText.Trim();
}
I had tried by using keyword on parent tag which was not working for all kind of URLs.
Please help me to fix it.
What is the error?
You can start by fixing this:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("\\span"))
it should be:
foreach (HtmlNode node in document.DocumentNode.SelectNodes("//span"))
But I want exact data. For example, there are too many span tags in source as <span>abc</span>, <span>def</span>, <span>pqr</span>, <span>xyz</span>. I want the result as "pqr". Is there any option to get it by count of particular tag or by index?
If you want to get, for example, the third span tag from the root:
doc.DocumentNode.SelectSingleNode("//span[3]")
If you want to get the node containing the text "pqr":
doc.DocumentNode.SelectSingleNode("//span[contains(text(),'pqr')]");
You can use SelectNodes for the latter to get all span tags containing "pqr" in the text.
How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.
I'm parsing xml with HtmlAgilityPack on WebService worker role, but there is something wrong. When I select childnode "link" get empty char.
the xml like :
<link>
http://www.webtekno.com/google/google-ve-razer-dan-oyun-konsolu.html
</link>
my code for get link from rss is:
HtmlNodeCollection nodeList = doc.DocumentNode.SelectNodes("//item");
foreach (HtmlNode node in nodeList)
{
string newsUri = node.ChildNodes["link"].InnerText;
}
I think gets empty char cause link node includes new line and after link. How can I get link in the node?
Put that line before loading HtmlDocument
HtmlNode.ElementsFlags["link"] = HtmlElementFlag.Closed;
That is all.
By default, its value is HtmlElementFlag.Empty and treated like meta and img tags...