Get content from subpages using HTML Agility Pack for WP8

Get content from subpages using HTML Agility Pack for WP8 - c#

I've managed to parse HTML (content) from a newssite (ryfylke.net) and displayed it in my WP8-app. But how can i parse content from subpages (the "Read more" links)?
For now, when I click the links the app launches IE and displays the actual site. But what I would like to do is parse the content from the site and display it in the app.
EDIT (This is my current MainPage.xaml.cs)
protected async override void OnNavigatedTo(NavigationEventArgs e)
{
base.OnNavigatedTo(e);
string htmlPage = "";
using (var client = new HttpClient())
{
htmlPage = await client.GetStringAsync("http://ryfylke.net/kategori/nyheter/");
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);
List<Nyheter> nyheter = new List<Nyheter>();
foreach (var div in htmlDocument.DocumentNode.SelectNodes("//article[starts-with(#class, 'post-')]"))
{
Nyheter newNyheter = new Nyheter();
newNyheter.Link = div.SelectSingleNode(".//a[#href]").Attributes["href"].Value;
newNyheter.Bilde = div.SelectSingleNode(".//img[#class='attachment-entry-medium wp-post-image']").Attributes["src"].Value;
newNyheter.Tittel = div.SelectSingleNode(".//h2[#class='entry-title entry-small-title']").InnerText.Trim();
newNyheter.Sammendrag = div.SelectSingleNode(".//p[#class='entry-excerpt']").InnerText.Trim();
nyheter.Add(newNyheter);
}
lstNyheter.ItemsSource = nyheter;
}
And I then use public strings like this to use the content...
public string Bilde { get; set; }

Related

HtmlDocument get a incomplete page

I am doing a small web scraping project, and I am having a problem with the function that takes the html code. The web that I inspect in the browser is different from the web that downloads the method (for the same URL).
I have tried to improve the coding process, but to no avail. The same thing happens for "i=2".
static void Main(string[] args)
{
string prefixurl = "https://www.aaabbbcccdddeee.de/en/do-business-with-finland/finnish-suppliers/finnish-suppliers-results?query=africa";
for (int i = 1; i < 18; i++)
{
string url = prefixurl;
if (i > 1)
{
url = prefixurl + "&page=" + i;
}
var doc = GetDocument(url);
var links = GetBusinessLinks(url);
List<Empresa> empresas = GetBusiness(links);
Export(empresas);
}
}
static List<string> GetBusinessLinks(string url)
{
var doc = GetDocument(url);
var linkNodes = doc.DocumentNode.SelectNodes("/html/body/section/div/div/div/div[2]/div[2]//a");
// //a[#class=\"btn bf-ghost-button\"]
var baseUri= new Uri(url);
var links = new List<string>();
//The problem its there, in the incomplete page the program haven't found nodes
foreach (var node in linkNodes)
{
var link = node.Attributes["href"].Value;
bool business = link.Contains("companies");
if (business)
{
link = new Uri(baseUri, link).AbsoluteUri;
links.Add(link);
}
}
return links;
}
static HtmlDocument GetDocument(string url)
{
var web = new HtmlWeb();
HtmlDocument doc = new HtmlDocument()
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
doc = web.Load(url);
return doc;
}
´´´

Your suggestion has made me suspect where I should continue looking, thanks.
I have used PupperSharp in non-headless mode.
https://betterprogramming.pub/web-scraping-using-c-and-net-d99a085dace2

How to read specific values from response

This is the page I'm using for documentation https://lichess.org/api#operation/player
I want to get player usernamename, rating, and title.
My code.
public class Player {
public string username;
public double rating;
public string title;
}
HttpClient client = new HttpClient();
client.BaseAddress = new Uri("https://lichess.org/");
HttpResponseMessage response = client.GetAsync("player/top/200/bullet").Result;
Here I'm getting response, But I have no clue how to take only properties that I need and store it in a list of players.

After a discussion with you on this problem, it was found that the response that you are receiving is a HTML string, therefore you need to deal with this case differently. I was playing around with the HTML that you have posted in the comments and I was able to parse the string with HTML Agility Pack which can be found here. You can also download this pack from the Nuget Package Manager in Visual Studio.
I am giving you a very basic example of the parsing process that I tried out:
public class ProcessHtml()
{
List<Player> playersList = new List<Player>();
//Get your HTML loaded from a URL. Giving me SSL exceptions so took a different route
//var url = "https://lichess.org/player/top/200/bullet";
//var web = new HtmlWeb();
//var doc = web.Load(url);
//Get your HTML loaded as a file in my case
var doc = new HtmlDocument();
doc.Load("C:\\Users\\Rahul\\Downloads\\CkBsZtvf.html", Encoding.UTF8);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//tbody"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
int i = 0;
Player player = new Player();
//Since there are 4 rounds per tr, hence get only what is required based on loop condition
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
if(i==1)
{
player.username = cell.InnerText;
}
if(i==2)
{
player.rating = Convert.ToDouble(cell.InnerText);
}
if(i==3)
{
player.title = cell.InnerText;
}
i++;
}
playersList.Add(player);
}
}
var finalplayerListCopy = playersList;
}
public class Player
{
public string username;
public double rating;
public string title;
}
After running this, your finalplayerListCopy has a count of 200 and an example data would look like:
Obviously, you would have to play with the data and tailor it as per your need. I hope this helps you out.
Cheers!

from what Ive read from the documentation
async Task<Player> getPlayerAsync(string path)
{
Player player= null;
HttpResponseMessage response = await client.GetAsync(path);
if (response.IsSuccessStatusCode)
{
player = await response.Content.ReadAsAsync<Player>();
}
return player;
}
getPlayerAsync("https://lichess.org/player/top/200/bullet");

html agility pack url scraping-- getting full html link

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to me in the output the links are just extensions of the actual website but not the full url link like http://www.foo/bar/foobar.com. All I will get is "/foobar". Is there a way to get the full links of the url with the code below?
Thanks!
static void Main(string[] args)
{
List<string> linksToVisit = ParseLinks("https://www.facebook.com");
}
public static List<string> ParseLinks(string email)
{
WebClient webClient = new WebClient();
byte[] data = webClient.DownloadData(email);
string download = Encoding.ASCII.GetString(data);
HashSet<string> list = new HashSet<string>();
var doc = new HtmlDocument();
doc.LoadHtml(download);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var n in nodes)
{
string href = n.Attributes["href"].Value;
list.Add(href);
}
return list.ToList();
}

You can check the HREF value if it's relative URL or absolute.
Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.
static void Main(string[] args)
{
List<string> linksToVisit = ParseLinks("https://www.facebook.com");
}
public static List<string> ParseLinks(string urlToCrawl)
{
WebClient webClient = new WebClient();
byte[] data = webClient.DownloadData(urlToCrawl);
string download = Encoding.ASCII.GetString(data);
HashSet<string> list = new HashSet<string>();
var doc = new HtmlDocument();
doc.LoadHtml(download);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var n in nodes)
{
string href = n.Attributes["href"].Value;
list.Add(GetAbsoluteUrlString(urlToCrawl, href));
}
return list.ToList();
}
Function to convert Relative URL to Absolute
static string GetAbsoluteUrlString(string baseUrl, string url)
{
var uri = new Uri(url, UriKind.RelativeOrAbsolute);
if (!uri.IsAbsoluteUri)
uri = new Uri(new Uri(baseUrl), uri);
return uri.ToString();
}

You can't get the complete url because in the href attribute there isn't the complete url. Example:
In your case the page contains relative urls. You need to do this:
string href = email + n.Attributes["href"].Value;
In this way you will have the full url. The better solution is to check if url is relative or absolute and, if the url is relative, add email at the beginning of the url otherwise no.

C# Web Crawler/Parser/Spider [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm new in a C# and WinForms I want to create a web crawler (parser) - which can parse a web pages and showing them hierarchically. + I don't know how to make bot crawling with a specific hyper-link depth.
So I think I have 2 questions:
How to make bot crawling with specified link depth?
How to show all hyperlinks hierarchically?
P.S. I would be great if it'll be a code samples.
P.P.S. have 1 button = button1; and 1 richtextbox = richTextBox1;
Here is my code: I know it's very ugly.... (all code in a one button):
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
//Declaration
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
Match m;
string anotherTest = #"(((ht){1}tp[s]?://)[-a-zA-Z0-9#:%_\+.~#?&\\]+)";
List<string> savedUrls = new List<string>();
List<string> titles = new List<string>();
//Go to this URL:
string url = UrlTextBox.Text = "http://www.yahoo.com";
if (!(url.StartsWith("http://") || url.StartsWith("https://")))
url = "http://" + url;
//Scrape Whole Html code:
string s = sr.ReadToEnd();
try
{
// Get Urls:
m = Regex.Match(s, anotherTest,
RegexOptions.IgnoreCase | RegexOptions.Compiled,
TimeSpan.FromSeconds(1));
while (m.Success)
{
savedUrls.Add(m.Groups[1].ToString());
m = m.NextMatch();
}
// Get TITLES:
Match m2 = Regex.Match(s, #"<title>\s*(.+?)\s*</title>");
if (m2.Success)
{
titles.Add(m2.Groups[1].Value);
}
//Show Title:
richTextBox1.Text += titles[0] + "\n";
//Show Urls:
TrimUrls(ref savedUrls);
}
catch (RegexMatchTimeoutException)
{
Console.WriteLine("The matching operation timed out.");
}
sr.Close();
}
private void TrimUrls(ref List<string> urls)
{
List<string> d = urls.Distinct().ToList();
foreach (var v in d)
{
if (v.IndexOf('.') != -1 && v != "http://www.w3.org")
{
richTextBox1.Text += v + "\n";
}
}
}
}
}
And one more question:
Is Anybody know how to save it in XML like a tree?

I would also highly recommend you the HTML Agility Pack.
With the Html Agility Pack you can do something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var urls = new List<String>();
doc.DocumentNode.SelectNodes("//a").ForEach(x =>
{
urls.Add(x.Attributes["href"].Value);
});
Edit:
You can do something like this, but please add some exception handling to it.
public class ParsResult
{
public ParsResult Parent { get; set; }
public String Url { get; set; }
public Int32 Depth { get; set; }
}
__
private readonly List<ParsResult> _results = new List<ParsResult>();
private Int32 _maxDepth = 5;
public void Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
if (depth >= _maxDepth) return;
String html;
using (var wc = new WebClient())
html = wc.DownloadString(urlToCheck ?? parent.Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var aNods = doc.DocumentNode.SelectNodes("//a");
if (aNods == null || !aNods.Any()) return;
foreach (var aNode in aNods)
{
var url = aNode.Attributes["href"];
if (url == null)
continue;
var result = new ParsResult
{
Depth = depth,
Parent = parent,
Url = url.Value
};
_results.Add(result);
Console.WriteLine("{0} - {1}", depth, result.Url);
Foo(depth: depth + 1, parent: result);
}

If you need parse such structured data (xhtml), try to look at xpath: http://msdn.microsoft.com/en-us/library/ms256086.aspx
(You should also put your logic in to dedicated objects, not just let it be in GUI layer. You will appreciate it later.)

Build HtmlGenericControl from a string of full html

I want to be able to add attributes to a string of html without having to build a parser to handle the html. In one specific case, I want to be able to extract the id of the html or insert an id to the html server side.
Say I have:
string stringofhtml = "<img src=\"someimage.png\" alt=\"the image\" />";
I would like to be able to do something like:
HtmlGenericControl htmlcontrol = new HtmlGenericControl(stringofhtml);
htmlcontrol.Attributes["id'] = "newid";
OR
int theid = htmlcontrol.Attributes["id"];
This is just a way that I can access/add attributes of the html strings that I have.

You can do this:
HtmlGenericControl ctrl = new HtmlGenericControl();
ctrl.InnerHtml = "<img src=\"someimage.png\" alt=\"the image\" />";
You could always use a LiteralControl too, instead of an HtmlGenericControl:
LiteralControl lit = new LiteralControl(stringOfHtml);

I do not think there is a control available which will provide you with the functionality you are looking for.
Below I have made use of the HtmlAgility pack to parse/query the HTML and created a new control subclassing the Literal control.
This control accepts an HTML string, checks to ensure it contains at least a single element and provides access to get/set that elements attributes.
Example usage
string image = "<img src=\"someimage.png\" alt=\"the image\" />";
HtmlControlFromString htmlControlFromString = new HtmlControlFromString(image);
htmlControlFromString.Attributes["id"] = "image2";
string id = htmlControlFromString.Attributes["id"];
Control
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI.WebControls;
using HtmlAgilityPack;
public class HtmlControlFromString : Literal
{
private HtmlDocument _document = new HtmlDocument();
private HtmlNode _htmlElement;
public AttributesCollection Attributes { get; set; }
public HtmlControlFromString(string html)
{
_document.LoadHtml(html);
if (_document.DocumentNode.ChildNodes.Count > 0)
{
_htmlElement = _document.DocumentNode.ChildNodes[0];
Attributes = new AttributesCollection(_htmlElement);
Attributes.AttributeChanged += new EventHandler(Attributes_AttributeChanged);
SetHtml();
}
else
{
throw new InvalidOperationException("Argument does not contain a valid html element.");
}
}
void Attributes_AttributeChanged(object sender, EventArgs e)
{
SetHtml();
}
void SetHtml()
{
Text = _htmlElement.OuterHtml;
}
}
public class AttributesCollection
{
public event EventHandler AttributeChanged;
private HtmlNode _htmlElement;
public string this[string attribute]
{
get
{
HtmlAttribute htmlAttribute = _htmlElement.Attributes[attribute];
return htmlAttribute == null ? null : htmlAttribute.Value;
}
set
{
HtmlAttribute htmlAttribute = _htmlElement.Attributes[attribute];
if (htmlAttribute == null)
{
htmlAttribute = _htmlElement.OwnerDocument.CreateAttribute(attribute);
htmlAttribute.Value = value;
_htmlElement.Attributes.Add(htmlAttribute);
}
else
{
htmlAttribute.Value = value;
}
EventHandler attributeChangedHandler = AttributeChanged;
if (attributeChangedHandler != null)
attributeChangedHandler(this, new EventArgs());
}
}
public AttributesCollection(HtmlNode htmlElement)
{
_htmlElement = htmlElement;
}
}
Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get content from subpages using HTML Agility Pack for WP8 - c#

Related

HtmlDocument get a incomplete page

How to read specific values from response

html agility pack url scraping-- getting full html link

C# Web Crawler/Parser/Spider [closed]

Build HtmlGenericControl from a string of full html

Categories

Resources