Html agility pack, how to add console output to List - c#

I am trying to output some strings from a certain website and i want to add each row into a List.
The output i am looking for looks like this:
string url = "https://thepiratebay.org/search/rick%20and%20morty/0/99/0";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
//upload list
List<string> uploadList = new List<string>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//table[#id='searchResult']/tr/td/font[#class='detDesc']"))
{
var input = node.InnerHtml.ToString();
//The [^0-9] expression is used to find any character that is NOT a digit, will replace with empty string
input = Regex.Replace(input, "([^0-9]+)"," ");
Console.WriteLine(input);
}
I need to store every row into a list in order to process each element of data and i can't manage to set doc.DocumentNode.SelectNodes("//table[#id='searchResult']/tr/td/font[#class='detDesc']") into an array

You can add your input into the list by using
uploadList.Add(Input);
in the foreach loop instead of printing it to the console and the somehow read it from there again.
And you might want to select the tr, td and font children as described in:
https://stackoverflow.com/a/15004032/2960293

Related

Store links into variable instead of text file

I am on very early learning curve of C#. I have a code for storing web links into text file. How I can store them into variable so I can loop through them later in the code and access each one separately?
string pdfLinksUrl = "https://www.nordicwater.com/products/waste-water/";
// Load HTML content
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);
// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
let href = linkNode.Attributes["href"].Value
where href.ToLower().StartsWith("https://www.nordicwater.com/product/")
select href;
// write all PDF links to file
System.IO.File.WriteAllLines(#"c:\temp\pdflinks.txt", pdfUrls.ToArray());
pdfUrls holds all of your URLs, you are using it when you are writing all of them into the file
You can use a foreach loop in order to loop through the URLs easily:
foreach (string url in odfUrls.ToArray()) {
Console.WriteLine($"PDF URL: {url}");
}

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

Use HtmlAgilityPack to parse HTML variable, not HTML document?

I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.
You can use the LoadHtml method of HtmlDocument class
Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.

SelectSingleNode returns the wrong result on a foreach

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
var nodes = doc.DocumentNode.SelectNodes("//div[#class=\"noprint res\"]/div");
if (nodes != null)
{
foreach (HtmlNode data in nodes)
{
// Works but not what I want
MessageBox.Show(data.InnerHtml);
// Should work ? but does not ?
MessageBox.Show(data.SelectSingleNode("//span[#class=\"pp-place-title\"]").InnerText);
}
}
I am trying to parse the results of a HTML, the initial node for the foreach, works just as expected and gives me a result of 10 items which matchs what I need.
When I get into the foreach, if I output the inner html of the data item it display the correct data but if I output the SelectSingleNode it will always display the data from the first item from the foreach, is that a normal behavior or am I doing something wrong ?
In order to resolve the issue I had to create a new html inside the foreach for every data item like this:
HtmlAgilityPack.HtmlDocument innerDoc = new HtmlAgilityPack.HtmlDocument();
innerDoc.LoadHtml(data.InnerHtml);
// Select what I need
MessageBox.Show(innerDoc.DocumentNode.SelectSingleNode("//span[#class=\"pp-place-title\"]").InnerText);
Then I get the correct per item data.
The page I was trying to get data from was http://maps.google.com/maps?q=consulting+loc:+US if u want to try and see what happens for yourself.
Basically I am reading the left side column for company names and the above happens.
By starting your XPath expression with //, you're searching in the entire document that contains the data node.
You should be able to use ".//[...]" to only check nodes within data.

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources