Scrape InnerText from body of website C# - c#

I'm trying to gather the data from this website: http://services.runescape.com/m=hiscore_oldschool/index_lite.ws?player=f2pshrympy
using HtmlAgilityPack;
using System;
var webGet = new HtmlWeb();
var document = webGet.Load("http://services.runescape.com/m=hiscore_oldschool/index_lite.ws?player=f2pshrympy");
var bodyText = document.DocumentNode.SelectNodes("/html/body/text()");
Console.WriteLine(bodyText);
Console.ReadLine();
When the program is run nothing is printed to the console and there are no errors.
screenshot of the console
I'm guessing that nothing is being found with the XPath "/html/body/text()", any ideas how I can go around fixing this?

Your page is pure text. So you don't need any tool like HtmlAgilityPack to parse it. Just download it and use it.
using (var wc = new WebClient())
{
var bodyText = wc.DownloadString("http://services.runescape.com/m=hiscore_oldschool/index_lite.ws?player=f2pshrympy");
}

Related

How to get all Wikipedia hyperlinks from a webpage using HttpClient in C#

I want to get all hyperlinks from Wikipedia page that lead to another Wikipedia page in C#.
For example:
On the screenshot above you can see that I only want to get the links that lead to another Wiki article (red rects), even though there are another links on the page. I have written a function that scrapes every link on the page and returns a HashSet of them, its body is as follows:
private async Task<HashSet<string>> GetPages(CrawlerPage page)
{
var client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", "C# console program");
var htmlContent = await client.GetStringAsync(page.mainLink);
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlContent);
var programmerLinks = htmlDoc.DocumentNode
.Descendants("li")
.Where(node => !node.GetAttributeValue("class", "").Contains("tocsection")).ToList();
HashSet<string> wikiLinks = new();
foreach (var link in programmerLinks)
{
if (link.FirstChild.Attributes.Count > 0)
wikiLinks.Add("https://en.wikipedia.org/" + link.FirstChild.Attributes[0].Value);
}
return wikiLinks;
}
The function works fine, but it scrapes everything. Have a look at the screenshot below:
You can see that the things in red rects are the links that I want to get, the rest is junk (links not needed by me).
I figured out that all of these links are under <p> tag in HTML, and the links are in <a href> but I still cannot figure out how to get these concrete links.
Can you tell me how can I get these desired links?
Thanks!
I tried to come up with something like the code below, which should get only the items you need within the /wiki/.
I took the liberty to use HtmlAgilityPack since it's a well documented library.
using System;
using System.Collections.Generic;
using HtmlAgilityPack;
using System.Net.Http;
using System.Linq;
public class Program
{
public static void Main()
{
string url = "https://en.wikipedia.org/wiki/Axis_powers";
string result = "";
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = client.GetAsync(url).Result)
{
using (HttpContent content = response.Content)
{
result = content.ReadAsStringAsync().Result;
}
}
}
var links = ParseLinks(result).Where(x => x.Contains("/wiki/") && !x.Contains("https://")).ToList();
foreach (var link in links){
Console.WriteLine(link.ToString());
}
List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
}
}

Get webpage source code with alt key code symbols using asp.net c#

I'm trying to get webpage source code using htmlagilitypack. This is my code to get source code and fill into multiline textbox:
var url = "http://www.example.com";
var web = new HtmlWeb();
var doc = web.Load(url);
sourcecodetxt.Text = doc.ToString();
code is working fine but if my webpage have some "Alt Codes Symbols" then symbol changed with some characters eg: ★ changed with ★
My question is how to get original symbol. Sorry for my bad english. Thanks in advance.
Try using WebClient and HtmlDocument's Load() method so you can specify the encoding:
WebClient client = new WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(client.OpenRead("http://www.example.com"), Encoding.UTF8);

Get Page Main Content using the URL

I need to be able to get the page main content from a certain url.
a very good example on what i need to do is the following: http://embed.ly/docs/explore/preview?url=http%3A%2F%2Fedition.cnn.com%2F2012%2F08%2F20%2Fworld%2Fmeast%2Fflight-phobia-boy-long-way-home%2Findex.html%3Fiid%3Darticle_sidebar
I am using asp.net with C# language.
Parsing html pages and guessing the main content is not an easy process. I would recomment to use NReadability and HtmlAgilityPack
Here is an example how it could be done. Main text is always in div with id readInner after NReadability transcoded the page.
string url = "http://.......";
var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);
if (b)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
var text = doc.DocumentNode.SelectSingleNode("//div[#id='readInner']")
.InnerText;
}
Man,
I guess it's made using the implementation of WebClient Class or WebRequest Class. With it you can download all content of page then using any data mining algorithm, you can get the information you want.
[]'s

WebRequest using Mozilla Firefox

I need to have access at the HTML of a Facebook page, to extract from it some data. So, I need to create a WebRequest.
Example:
My code worked well for other sites, but for Facebook, I must be logged in to can access the HTML.
How can I use Firefox data for creating a WebRequest for Facebook page?
I tried this:
List<string> HTML_code = new List<string>();
WebRequest request = WebRequest.Create(URL);
using (WebResponse response = request.GetResponse())
using (StreamReader stream = new StreamReader(response.GetResponseStream()))
{
string line;
while ((line = stream.ReadLine()) != null)
{
HTML_code.Add(line);
}
}
...but the HTML resulted is the HTML of Facebook Home Page when I am not logged in.
If what you are trying to is retrieve the number of likes from a Facebook page, you can use Facebook's Graph API service. Just too keep it simple, this is what I basically did in the code:
Retrieve the Facebook page's data. In this case I used the Coke page's data since it was an example FB had listed.
Parse the returned Json using Json.Net. There are other ways to do this, but this just keeps it simple, and you can get Json.Net over at Codeplex. The documentation that I looked for my code was from this page in the docs. Their documentation will also help you with parsing and serializing even more Json if you need to.
Then that basically translates in to this code. Just note that I left out all the fancy exception handling to keep it simple as using networking is not always reliable! Also don't forget to include the Json.Net library in your project!
Usings:
using System.IO;
using System.Net;
using Newtonsoft.Json.Linq;
Code:
string url = "https://graph.facebook.com/cocacola";
WebClient client = new WebClient();
string jsonData = string.Empty;
// Load the Facebook page info
Console.WriteLine("Connecting to Facebook...");
using (Stream data = client.OpenRead(url))
{
using (StreamReader reader = new StreamReader(data))
{
jsonData = reader.ReadToEnd();
}
}
// Get number of likes from Json data
JObject jsonParsed = JObject.Parse(jsonData);
int likes = (int)jsonParsed.SelectToken("likes");
// Write out the result
Console.WriteLine("Number of Likes: " + likes);

How can i get website text wothout using web browser?

I tryed to do a webbrowser that fro, him i get the text.
But insted of getting the text is downloading to my computer the file.
How can i get this text without using it?
Thanks
You can use a WebClient:
string output = string.Empty;
using (WebClient wc = new WebClient())
{
output = wc.DownloadString("http://stackoverflow.com");
}

Categories

Resources