Store links into variable instead of text file - c#

I am on very early learning curve of C#. I have a code for storing web links into text file. How I can store them into variable so I can loop through them later in the code and access each one separately?
string pdfLinksUrl = "https://www.nordicwater.com/products/waste-water/";
// Load HTML content
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);
// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
let href = linkNode.Attributes["href"].Value
where href.ToLower().StartsWith("https://www.nordicwater.com/product/")
select href;
// write all PDF links to file
System.IO.File.WriteAllLines(#"c:\temp\pdflinks.txt", pdfUrls.ToArray());

pdfUrls holds all of your URLs, you are using it when you are writing all of them into the file
You can use a foreach loop in order to loop through the URLs easily:
foreach (string url in odfUrls.ToArray()) {
Console.WriteLine($"PDF URL: {url}");
}

Related

Htmlagilitypack doesnt get nodes.

I am using Htmlagilitypack in c#. But when i want to select images in a div at the url bottom, there are nothing found in selector. But i think i write right selector.
Codes are in fiddle. Thanks.
https://dotnetfiddle.net/NNIC3X
var url = "https://fotogaleri.haberler.com/unlu-sarkici-imaj-degistirdi-gorenler-gozlerine/";
//I will get the images src values in .col-big div at this url.
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big']//*/img");
//i am selecting all images in div.col-big. But there is nothing.
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
Your xpath is wrong because there is no div-tag that has class-attribtue with the value 'col-big'. There is however a div-tag that has a class attribute with the value 'col-big pull-left'. So try.
var htmlNodes = doc.DocumentNode.SelectNodes("//div[#class='col-big pull-left']//*/img");

Html agility pack, how to add console output to List

I am trying to output some strings from a certain website and i want to add each row into a List.
The output i am looking for looks like this:
string url = "https://thepiratebay.org/search/rick%20and%20morty/0/99/0";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
//upload list
List<string> uploadList = new List<string>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//table[#id='searchResult']/tr/td/font[#class='detDesc']"))
{
var input = node.InnerHtml.ToString();
//The [^0-9] expression is used to find any character that is NOT a digit, will replace with empty string
input = Regex.Replace(input, "([^0-9]+)"," ");
Console.WriteLine(input);
}
I need to store every row into a list in order to process each element of data and i can't manage to set doc.DocumentNode.SelectNodes("//table[#id='searchResult']/tr/td/font[#class='detDesc']") into an array
You can add your input into the list by using
uploadList.Add(Input);
in the foreach loop instead of printing it to the console and the somehow read it from there again.
And you might want to select the tr, td and font children as described in:
https://stackoverflow.com/a/15004032/2960293

Parse Compelete Web Page

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?
I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents
List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
list.Add(node.InnerText);
}
To get all descendant text nodes use something like
var textNodes = doc.DocumentNode.SelectNodes("//text()").
Select(t=>t.InnerText);
To get all non empty descendant text nodes
var textNodes = doc.DocumentNode.
SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText);
Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.

Use HtmlAgilityPack to parse HTML variable, not HTML document?

I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.
You can use the LoadHtml method of HtmlDocument class
Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.

C#, Html Agility, Selecting every paragraph within a div tag

How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");

Categories

Resources