Select specific text from website page

Select specific text from website page - c#

I get a webpage content using this code:
static void Main(string[] args)
{
using (var client = new WebClient())
{
var pageContent = client.DownloadString("http://www.modern-railways.com");
Console.WriteLine(pageContent);
Console.ReadLine();
}
}
This is what I get:
…….News: <span class='articleTitle'>Victoria Metrolink improvement begins</span></a></h1><p><a href='/view_article.asp?ID=7541&pubID=37&t=0&s=0&sO=both&p=1&i=10' class='summaryText' data-ajax='false'>Published 13 February 2014, 11:28</a></p><div class='articleContent ui-widget ui-widget-content ui-helper-clearfix ui-corner-all '….
I need to capture all the "articleTitle" and the published date in the pageContent in which there are several of them. How can I do that? I need some direction.

You can use regular expressions to accomplish your challenge:
var regex = new Regex(#"<span class='articleTitle'>(.+?)</span>");
var match = regex.Match(pageContent);
var result = match.Groups[1].Value;
The above code will work assuming that the tag is built in the exactly same way every time.
foreach (Match itemMatch in regex.Matches(pageContent))
{
var articleTitle= itemMatch.Groups[1].Value;
//TODO do what you need with the articleTitle (e.g. add to a list)
}

Related

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}

Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}

The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

HtmlAgilityPack search url link

I create a WindownsFormApplication for a group of friends. I'm using HtmlAgilityPack for my application.
I need to find all version of taco addon's , like this:
<li><a href='https://www.dropbox.com/s/nks140nf794tx77/GW2TacO_034r.zip?dl=0'>Download Build 034.1866r</a></li>
Additionally, I need to check the latest version for downloading the file with the url as in the code below:
public static bool Tacoisuptodate(string Version)
{
// Load HtmlDocuments
var doc = new HtmlWeb().Load("http://www.gw2taco.com/");
var body = doc.DocumentNode.SelectNodes("//body").Single();
// Sort out the document to take that he to interest us
//SelectNodes("//div"))
foreach (var node in doc.DocumentNode.SelectNodes("//div"))
{
// Check for null value
var classeValue = node.Attributes["class"]?.Value;
var idValue = node.Attributes["id"]?.Value;
var hrefValue = node.Attributes["href"]?.Value;
// We search <div class="widget LinkList" id="LinkList1" into home page >
if (classeValue == "widget LinkList" && idValue == "LinkList1")
{
foreach(HtmlNode content in node.SelectNodes("//li"))
{
Debug.Write(content.GetAttributeValue("href", false));
}
}
}
return false;
}
If somebody could help me, I would really appreciate it.

A single xpath is enough.
var xpath = "//h2[text()='Downloads']/following-sibling::div[#class='widget-content']/ul/li/a";
var doc = new HtmlAgilityPack.HtmlWeb().Load("http://www.gw2taco.com/");
var downloads = doc.DocumentNode.SelectNodes(xpath)
.Select(li => new
{
href = li.Attributes["href"].Value,
name = li.InnerText
})
.ToList();

System.ArgumentNullException when trying to access span with Xpath (C#)

So i've been trying to get a program working where I get info from google finance regarding different stock stats. So far I have not been able to get information out of spans. As of now I have hardcoded direct access to the apple stock.
Link to Apple stock: https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=NgItWIG1GIftsAHCn4zIAg
What i can't understand is that I receive correct output when I trying it in the chrome console with the following command:
$x("//*[#id=\"appbar\"]//div//div//div//span");
This is my current code in Visual studio 2015 with Html Agility Pack installed(I suspect a fault in currDocNodeCompanyName):
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
var histDoc = web.Load("https://www.google.com/finance/historical?q=NASDAQ%3AAAPL&ei=q9IsWNm4KZXjsAG-4I7oCA.html");
var histDocNode = histDoc.DocumentNode.SelectNodes("//*[#id=\"prices\"]//table//tr//td");
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
var currDocNodeCurrency = currDoc.DocumentNode.SelectNodes("//*[#id=\"ref_22144_elt\"]//div//div");
var currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
var histDocText = histDocNode.Select(node => node.InnerText);
var currDocCurrencyText = currDocNodeCurrency.Select(node => node.InnerText);
var currDocCompanyName = currDocNodeCompanyName.Select(node => node.InnerText);
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(currDocCompanyName.Take(2).ToString());
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I have been trying the Xpath expression [text] and received an output that i can work with when using the chrome console but not in VS. I have also been experimenting with a foreach-loop, a few suggested it to others.
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
///same as before
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
HtmlNodeCollection currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
///Same as before
List <string> blaList = new List<string>();
foreach (HtmlNode x in currDocNodeCompanyName)
{
blaList.Add(x.InnerText);
}
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(blaList[1]);
result.Add(blaList[2]);
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I would really appreciate if anyone could point me in the right direction.

If you check the contents of currDoc.DocumentNode.InnerHtml you will notice that there is no element with the id "appbar", therefore the result is correct, since the xpath doesn't return anything.
I suspect that the html element you're trying to find is generated by a script (js for example), and that explains why you can see it on the browser and not on the HtmlDocument object, since HtmlAgilityPack does not render scripts, it only download and parse the raw source code.

Issue in Parsing Json image in C#

net C#. I am trying to parse Json from a webservice. I have done it with text but having a problem with parsing image. Here is the Url from where I m getting Json
http://collectionking.com/rest/view/items_in_collection.json?args=122
And this is My code to Parse it
using (var wc = new WebClient()) {
JavaScriptSerializer js = new JavaScriptSerializer();
var result = js.Deserialize<ck[]>(wc.DownloadString("http://collectionking.com/rest/view/items_in_collection.json args=122"));
foreach (var i in result) {
lblTitle.Text = i.node_title;
imgCk.ImageUrl = i.["main image"];
lblNid.Text = i.nid;
Any help would be great.
Thanks in advance.
PS: It returns the Title and Nid but not the Image.
My class is as follows:
public class ck
{
public string node_title;
public string main_image;
public string nid; }

Your problem is that you are setting ImageUrl to something like this <img typeof="foaf:Image" src="http://... and not an actual url. You will need to further parse main image and extract the url to show it correctly.
Edit
This was a though nut to crack because of the whitespace. The only solution I could find was to remove the whitespace before parsing the string. It's not a very nice solution but I couldn't find any other way using the built in classes. You might be able to solve it properly using JSON.Net or some other library though.
I also added a regular expression to extract the url for you, though there is no error checking what so ever here so you'll need to add that yourself.
using (var wc = new WebClient()) {
JavaScriptSerializer js = new JavaScriptSerializer();
var result = js.Deserialize<ck[]>(wc.DownloadString("http://collectionking.com/rest/view/items_in_collection.json?args=122").Replace("\"main image\":", "\"main_image\":")); // Replace the name "main image" with "main_image" to deserialize it properly, also fixed missing ? in url
foreach (var i in result) {
lblTitle.Text = i.node_title;
string realImageUrl = Regex.Match(i.main_image, #"src=""(.*?)""").Groups[1].Value; // Extract the value of the src-attribute to get the actual url, will throw an exception if there isn't a src-attribute
imgCk.ImageUrl = realImageUrl;
lblNid.Text = i.nid;
}
}

Try This
private static string ExtractImageFromTag(string tag)
{
int start = tag.IndexOf("src=\""),
end = tag.IndexOf("\"", start + 6);
return tag.Substring(start + 5, end - start - 5);
}
private static string ExtractTitleFromTag(string tag)
{
int start = tag.IndexOf(">"),
end = tag.IndexOf("<", start + 1);
return tag.Substring(start + 1, end - start - 1);
}
It may help

Replacing A Node Using HtmlAgilityPack Throwing Strange Error

I have a webpage that displays a table the user can edit. After the edits are made I want to save the table as a .html file that I can convert to an image later. I am doing this by overriding the render method. However, I want to remove two buttons and a DropDownList from the final version so that I just get the table by itself. Here is the code I am currently trying:
protected override void Render(HtmlTextWriter writer)
{
using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new StringWriter()))
{
base.Render(htmlwriter);
string renderedContent = htmlwriter.InnerWriter.ToString();
string output = renderedContent.Replace(#"<input type=""submit"" name=""viewReport"" value=""View Report"" id=""viewReport"" />", "");
output = output.Replace(#"<input type=""submit"" name=""redoEdits"" value=""Redo Edits"" id=""redoEdits"" />", "");
var doc = new HtmlDocument();
doc.LoadHtml(output);
var query = doc.DocumentNode.Descendants("select");
foreach (var item in query.ToList())
{
var newNodeStr = "<div></div>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
File.WriteAllText(currDir + "\\outputFile.html", output);
writer.Write(renderedContent);
}
}
Where I have adapted this solution found in another SO post about replacing nodes with HtmlAgilityPack:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
and here is the rendered HTML I am trying to alter:
<select name="Archives" onchange="javascript:setTimeout('__doPostBack(\'Archives\',\'\')', 0)" id="Archives" style="width:200px;">
<option selected="selected" value="Dashboard_Jul-2012">Dashboard_Jul-2012</option>
<option value="Dashboard_Jun-2012">Dashboard_Jun-2012</option>
</select>
The two calls to Replace are working as expected and removing the buttons. However this line:
var query = doc.DocumentNode.Descendants("select");
is throwing this error:
Method not found: 'Int32 System.Environment.get_CurrentManagedThreadId()'.
Any advice is appreciated.
Regards.

Seems like you are using the .Net 4.5 Version of the Agility Pack in a project targeting .Net or lower, you just have to either change the reference of the Dll to the one compiled for your Framework version or change your project to .Net 4.5 (if you're using VS 2012 that is).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Select specific text from website page - c#

Related

Iterate through web pages and download PDFs

HtmlAgilityPack search url link

System.ArgumentNullException when trying to access span with Xpath (C#)

Issue in Parsing Json image in C#

Replacing A Node Using HtmlAgilityPack Throwing Strange Error

Categories

Resources