C# Scrape data from wiki page (screen-scraping)

C# Scrape data from wiki page (screen-scraping) - c#

I want to scrape a Wiki page. Specifically, this one.
My app will allow users to enter the registration number of the vehicle (for example, SBS8988Z) and it will display the related information (which is on the page itself).
For example, if the user enters SBS8988Z into a text field in my application, it should look for the line on that wiki page
SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
and return SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen).
My code so far is (copied and edited from various websites)...
WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";
getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!
HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;
List<string> anchorTags = new List<string>();
foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
string att = deployment.OuterHtml;
anchorTags.Add(att);
}
However, I am getting a an ArgumentException was unhandled - Illegal Characters in path.
What is wrong with the code? Is there an easier way to do this? I'm using HtmlAgilityPack but if there is a better solution, I'd be glad to comply.

What's wrong with the code? To be blunt, everything. :P
The page is not formatted in the way you are reading it. You can't hope to get the desired contents that way.
The contents of the page (the part we're interested in) looks something like this:
<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
<!-- ... -->
<b>SBS8987B</b>
(SLBP 192/194*)
<br>
<b>SBS8988Z</b>
(SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
<br>
<b>SBS8989X</b>
(SLBP SP)
<br>
<!-- ... -->
</p>
Basically we need to find the b elements that contain the registration number we are looking for. Once we find that element, get the text and put it together to form the result. Here it is in code:
static string GetVehicleInfo(string reg)
{
var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";
// HtmlWeb is a helper class to get pages from the web
var web = new HtmlAgilityPack.HtmlWeb();
// Create an HtmlDocument from the contents found at given url
var doc = web.Load(url);
// Create an XPath to find the `b` elements which contain the registration numbers
var xpath = "//h2[span/#id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
+ "/following-sibling::p[1]" // move to the first `p` element (where the actual content is in) after the header
+ "/b"; // select the `b` elements
// Get the elements from the specified XPath
var deployments = doc.DocumentNode.SelectNodes(xpath);
// Create a LINQ query to find the requested registration number and generate a result
var query =
from b in deployments // from the list of registration numbers
where b.InnerText == reg // find the registration we're looking for
select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)
// The query should yield exactly one result (or we have a problem) or none (null)
var content = query.SingleOrDefault();
// Decode the content (to convert stuff like "&" to "&")
var decoded = System.Net.WebUtility.HtmlDecode(content);
return decoded;
}

Related

crawling price gives null , HtmlAgilityPack (C#)

Im trying to get stock data from a website with webcrawler as a hobby project. I got the link to work, i got the Name of the stock but i cant get the price... I dont know how to handle the html code. Here is my code,
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var divs = htmlDocument.DocumentNode.Descendants("div").Where(n => n.GetAttributeValue("class", "").Equals("Flexbox__StyledFlexbox-sc-1ob4g1e-0 eYavUv Row__StyledRow-sc-1iamenj-0 foFHXj Rows__AlignedRow-sc-1udgki9-0 dnLFDN")).ToList();
var stocks = new List<Stock>();
foreach (var div in divs)
{
var stock = new Stock()
{
Name = div.Descendants("a").Where(a=>a.GetAttributeValue("class","").Equals("Link__StyledLink-sc-apj04t-0 foCaAq NameCell__StyledLink-sc-qgec4s-0 hZYbiE")).FirstOrDefault().InnerText,
changeInPercent = div.Descendants("span").Where((a)=>a.GetAttributeValue("class", "").Equals("Development__StyledDevelopment-sc-hnn1ri-0 kJLDzW")).FirstOrDefault()?.InnerText
};
stocks.Add(stock);
}
foreach (var stock in stocks)
{
Console.WriteLine(stock.Name + " ");
}
I got the Name correct, but i dont really know how the get the ChangeInPercent.... I will past in the html code below,
The top highlight show where i got the name from, and the second one is the "span" i want. I want the -4.70
Im a litle bit confused when it comes to get the data with my code. I tried everything. My changeInPercent property is a string.
it has to be the code somehow...

There's probably an easier to select a single attribute/node than the way you're doing it right now:
If you know the exact XPath expression to select the node you're looking for, then you can do the following:
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var changeInPercent = htmlDocument.DocumentNode
.SelectSingleNode("//foo/bar")
.InnerText;
Getting the right XPath expression (the //foo/bar example above) is the tricky part. But this can be found quite easy using your browser's dev tools. You can navigate to the desired element and just copy it's XPath expression - simple as that! See here for a sample on how to copy the expression.

52/5000 How to get a certain value in html code by c #

i want search spacial value in html code by webbrowser in c#. for example html code<span class="pulser " data-dollari="164.843956376000000" eq_toman="_XcUOV" pulser-change="_OiuVD" pre-dollari="164.964899983000000">$164.97</span>i need Getting the value "164.964899983000000" and another value html code.

If I understand you correctly, you want to get an element from a site and get its attribute values like 'pre-dollari'.
For c#, you can use ScrapySharp , it's a library where you can simulate a webbrowser and scrape its contents. You can use it alongside htmlAgilityPack
to effectively traverse the elements.
So for your case, it could look like this.
// get your Url
Uri url = new Uri("Yoursite.com");
// open up the browser
ScrapingBrowser browser = new ScrapingBrowser();
// navigate to your page
WebPage page = browser.NavigateToPage(url, HttpVerb.Post, "", null);
// find your element, convert to a list and take the first result [0]
HtmlNode node2 = page.Find("span", By.Class("pulser")).ToList()[0];
// and now you can get the attribute by name and put it in a variable
string attributeValue = node2.GetAttributeValue("pre-dollari", "not found");
// attributeValue = 164.964899983000000

How to get the query string from the URL to my scraper

i'm currently building a scraper that gets data from an airlines website.
https://www.norwegian.com/uk/booking/flight-tickets/farecalendar/?D_City=OSL&A_City=RIX&TripType=1&D_Day=17&D_Month=201910&dFare=57&IncludeTransit=false&CurrencyCode=GBP&mode=ab#/?origin=OSL&destination=RIX&outbound=2019-10&adults=1&direct=true&oneWay=true&currency=GBP
My objective is to get a link from each of these calendar days (from 1 to 31)
I am using a HTTP Analyser and if I pass a query it returns this in the Query String window :
/pixel;r:1875159210;labels=_fp.event.Default;rf=0;a=p-Sne09sHM2G2M2;url=https://www.norwegian.com/uk/ipc/availability/avaday?AdultCount=1&A_City=RIX&D_City=OSL&D_Month=201910&D_Day=17&IncludeTransit=false&TripType=1&CurrencyCode=GBP&dFare=57&mode=ab;ref=https://www.norwegian.com/uk/booking/flight-tickets/farecalendar/?D_City=OSL&A_City=RIX&TripType=1&D_SelectedDay=01&D_Day=01&D_Month=201910&IncludeTransit=false&CurrencyCode=GBP&mode=ab;fpan=0;fpa=P0-2049656399-1568351608065;ns=0;ce=1;qjs=1;qv=4c19192-20180628134937;cm=;je=0;sr=1920x1080x24;enc=n;dst=1;et=1568366731754;tzo=-60;ogl=
How do I pass each of these queries to a scraper?
EDIT: I should've probably said that I need the program to loop through each flight and change the day (in this case from 1 to 31) in the URL.
My scraper is pretty basic, it can do basic websites that have links and it can show things like Titles, Articles, etc..
I should probably add that my aim is to display the destination, prices, time for travel, etc... which are something that I would know how to do.
Hope you can understand this. Thanks!
This is what I currently have and I will modify it to suit my needs.
public void ScrapeData(string page)
{
var web = new HtmlWeb();
var doc = web.Load(page);
var Articles = doc.DocumentNode.SelectNodes("//*[#class = 'article-single']");
foreach (var article in Articles)
{
var header = HttpUtility.HtmlDecode(article.SelectSingleNode(".//li[#class = 'article-header']").InnerText);
var description = HttpUtility.HtmlDecode(article.SelectSingleNode(".//li[#class = 'article-copy']").InnerText);
Debug.Print($"Title: {header} \n + Description: {description}");
_entries.Add(new EntryModel { Title = header, Description = description });
}
}

That URL returns a calendar comprised of buttons with the fare info and day on them, so you'll have to parse the returned HTML to find the individual day and then the fare from that cell.
So it seems easy to hit the URL, then loop through each table cell in the calendar section for the sub-divs in the DOM that contain the relevant day and fare info. Fortunately they have an aria-label for both these items so they are easy to locate.

Get website's html to a textbox

I was trying this but keep getting the error that gecko doesn't contain a definition for innerHTML..
GeckoElement g2element = null;
g2element = (GeckoElement)mainbrowsersrc.Document.GetElementByTagName("html");
rich1.Text = g2element.InnerHtml; // 48.066
or
rich1.Text = mainbrowsersrc.Document.GetElementsByTagName("html").innerHtml;

If you need the HTML of the entire page, then you should go with
(mainbrowsersrc.Document.DocumentElement as Gecko.DOM.GeckoHtmlHtmlElement)?.InnerHtml;
Please notice that the error that you get is because there is no method .GetElementByTagName(name); - the method is called GetElementsByTagName(name) - plural form.
This is because the tg name is not unique and the method returns a collection of elements with the same tag name - for example a collection of li (list item) elements.
Consequently, if you want to get a particular element by tag name, you should do something like:
string html = mainbrowsersrc.Document.GetElementsByTagName("html").FirstOrDefault().innerHtml;
//or
html = mainbrowsersrc.Document.GetElementsByTagName("html")[0].innerHtml;

C# grab urls using htmlagility

Okay so I have this list of URLs on this webpage, I am wondering how do I grab the URLs and add them to a ArrayList?
http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A
I only want the URLs which are in the list, look at it to see what I mean. I tried doing it myself and for whatever reason it takes all of the other URLs except for the ones I need.
http://pastebin.com/a7hJnXPP

Using Html Agility Pack
using (var wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
var links = doc.DocumentNode.SelectSingleNode("//div[#class='lst']")
.Descendants("a")
.Select(x => x.Attributes["href"].Value)
.ToArray();
}

If you want only the ones in the list, then the following code should work (this is assuming you have the page loaded into an HtmlDocument already)
List<string> hrefList = new List<string>(); //Make a list cause lists are cool.
foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(#href, 'id=')]"))
{
//Append animenewsnetwork.com to the beginning of the href value and add it
// to the list.
hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}
//a[contains(#href, 'id=')] Breaking this XPath down as follows:
//a Select all <a> nodes...
[contains(#href, 'id=')] ... that contain an href attribute that contains the text id=.
That should be enough to get you going.
As an aside, I would suggest not listing each link in its own messagebox considering there are around 500 links on that page. 500 links = 500 messageboxes :(

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Scrape data from wiki page (screen-scraping) - c#

Related

crawling price gives null , HtmlAgilityPack (C#)

52/5000 How to get a certain value in html code by c #

How to get the query string from the URL to my scraper

Get website's html to a textbox

C# grab urls using htmlagility

Categories

Resources