i'm currently building a scraper that gets data from an airlines website.
https://www.norwegian.com/uk/booking/flight-tickets/farecalendar/?D_City=OSL&A_City=RIX&TripType=1&D_Day=17&D_Month=201910&dFare=57&IncludeTransit=false&CurrencyCode=GBP&mode=ab#/?origin=OSL&destination=RIX&outbound=2019-10&adults=1&direct=true&oneWay=true¤cy=GBP
My objective is to get a link from each of these calendar days (from 1 to 31)
I am using a HTTP Analyser and if I pass a query it returns this in the Query String window :
/pixel;r:1875159210;labels=_fp.event.Default;rf=0;a=p-Sne09sHM2G2M2;url=https://www.norwegian.com/uk/ipc/availability/avaday?AdultCount=1&A_City=RIX&D_City=OSL&D_Month=201910&D_Day=17&IncludeTransit=false&TripType=1&CurrencyCode=GBP&dFare=57&mode=ab;ref=https://www.norwegian.com/uk/booking/flight-tickets/farecalendar/?D_City=OSL&A_City=RIX&TripType=1&D_SelectedDay=01&D_Day=01&D_Month=201910&IncludeTransit=false&CurrencyCode=GBP&mode=ab;fpan=0;fpa=P0-2049656399-1568351608065;ns=0;ce=1;qjs=1;qv=4c19192-20180628134937;cm=;je=0;sr=1920x1080x24;enc=n;dst=1;et=1568366731754;tzo=-60;ogl=
How do I pass each of these queries to a scraper?
EDIT: I should've probably said that I need the program to loop through each flight and change the day (in this case from 1 to 31) in the URL.
My scraper is pretty basic, it can do basic websites that have links and it can show things like Titles, Articles, etc..
I should probably add that my aim is to display the destination, prices, time for travel, etc... which are something that I would know how to do.
Hope you can understand this. Thanks!
This is what I currently have and I will modify it to suit my needs.
public void ScrapeData(string page)
{
var web = new HtmlWeb();
var doc = web.Load(page);
var Articles = doc.DocumentNode.SelectNodes("//*[#class = 'article-single']");
foreach (var article in Articles)
{
var header = HttpUtility.HtmlDecode(article.SelectSingleNode(".//li[#class = 'article-header']").InnerText);
var description = HttpUtility.HtmlDecode(article.SelectSingleNode(".//li[#class = 'article-copy']").InnerText);
Debug.Print($"Title: {header} \n + Description: {description}");
_entries.Add(new EntryModel { Title = header, Description = description });
}
}
That URL returns a calendar comprised of buttons with the fare info and day on them, so you'll have to parse the returned HTML to find the individual day and then the fare from that cell.
So it seems easy to hit the URL, then loop through each table cell in the calendar section for the sub-divs in the DOM that contain the relevant day and fare info. Fortunately they have an aria-label for both these items so they are easy to locate.
Related
Im trying to get stock data from a website with webcrawler as a hobby project. I got the link to work, i got the Name of the stock but i cant get the price... I dont know how to handle the html code. Here is my code,
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var divs = htmlDocument.DocumentNode.Descendants("div").Where(n => n.GetAttributeValue("class", "").Equals("Flexbox__StyledFlexbox-sc-1ob4g1e-0 eYavUv Row__StyledRow-sc-1iamenj-0 foFHXj Rows__AlignedRow-sc-1udgki9-0 dnLFDN")).ToList();
var stocks = new List<Stock>();
foreach (var div in divs)
{
var stock = new Stock()
{
Name = div.Descendants("a").Where(a=>a.GetAttributeValue("class","").Equals("Link__StyledLink-sc-apj04t-0 foCaAq NameCell__StyledLink-sc-qgec4s-0 hZYbiE")).FirstOrDefault().InnerText,
changeInPercent = div.Descendants("span").Where((a)=>a.GetAttributeValue("class", "").Equals("Development__StyledDevelopment-sc-hnn1ri-0 kJLDzW")).FirstOrDefault()?.InnerText
};
stocks.Add(stock);
}
foreach (var stock in stocks)
{
Console.WriteLine(stock.Name + " ");
}
I got the Name correct, but i dont really know how the get the ChangeInPercent.... I will past in the html code below,
The top highlight show where i got the name from, and the second one is the "span" i want. I want the -4.70
Im a litle bit confused when it comes to get the data with my code. I tried everything. My changeInPercent property is a string.
it has to be the code somehow...
There's probably an easier to select a single attribute/node than the way you're doing it right now:
If you know the exact XPath expression to select the node you're looking for, then you can do the following:
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var changeInPercent = htmlDocument.DocumentNode
.SelectSingleNode("//foo/bar")
.InnerText;
Getting the right XPath expression (the //foo/bar example above) is the tricky part. But this can be found quite easy using your browser's dev tools. You can navigate to the desired element and just copy it's XPath expression - simple as that! See here for a sample on how to copy the expression.
I'm a beginner programmer working on a small webscraper in C#. The purpose is to take a hospital's public website, grab the data for each doctor, their department, phone and diploma info, and display it in a Data Grid View. It's a public website, and as far as I'm concerned, the website's robots.txt allows this, so I left everything in the code as it is.
I am able to grab each data (name, department, phone, diploma) separately, and can successfully display them in a text box.
// THIS WORKS:
string text = "";
foreach (var nodes in full)
{
text += nodes.InnerText + "\r\n";
}
textBox1.Text = text;
However, when I try to pass the data on to the data grid view using a class, the foreach loop only goes through the first name and fills the data grid with that.
foreach (var nodes in full)
{
var Doctor = new Doctor
{
Col1 = full[0].InnerText,
Col2 = full[1].InnerText,
Col3 = full[2].InnerText,
Col4 = full[3].InnerText,
};
Doctors.Add(Doctor);
}
I spent a good few hours looking for solutions but none of what I've found have been working, and I'm at the point where I can't decide if I messed up the foreach loop somehow, or if I'm not doing something according to HTML Agility Pack's rules. It lets me iterate through for the textbox, but not the foreach. Changing full[0] to nodes[0] or nodes.InnerText doesn't seem to solve it either.
link to public gist file (where you can see my whole code)
screenshot
Thank you for the help in advance!
The problem is how you're selecting the nodes from the page. full contains all individual names, departments etc. in a flat list, which means full[0] is the name of the first doctor while full[4] is the name of the next. Your for-loop doesn't take that into account, as you (for every node) always access full[0] to full[3] - so, only the properties of the first doctor.
To make your code more readable I'd split it up a bit to first make a list of all the card-elements for each doctor and then select the individual parts within the loop:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://klinikaikozpont.unideb.hu/doctor_finder");
const string doctorListItem = "div[contains(#class, 'doctor-list-item-model')]";
const string cardContent = "div[contains(#class, 'card-content')]";
var doctorCards = doc.DocumentNode.SelectNodes($"//{doctorListItem}/{cardContent}");
var doctors = new List<Doctor>();
foreach (var card in doctorCards)
{
var name = card.SelectSingleNode("./h3")?.InnerText;
const string departmentNode = "div[contains(#class, 'department-name')]";
var department = card.SelectSingleNode($"./{departmentNode}/p")?.InnerText;
// other proprties...
doctors.Add(new Doctor{NameAndTitle = name, Department = department});
}
// I took the liberty to make this class easier to understand
public class Doctor
{
public string NameAndTitle { get; set; }
public string Department { get; set; }
// Add other properties
}
Check out the code in action.
im very new to testing and have no training in automated tests so please bare with me if i say stupid things but ill try the best i can.
Bascially i am trying to assert that a specific employee in the employee list has the status of 'leaver'.
This is what i have tried (and other variations with the different classes)
Assert.Equal("image-tile__badge background-color--status-leaver ng-star-inserted", Driver.FindElement(By.XPath("//*[contains(#class,'image-tile__content-header') and contains(text(),'End Date, Contract') and contains(#class, 'image-tile__badge')]")).GetAttribute("Class"));
Assert.Equal("image-tile__badge background-color--status-leaver ng-star-inserted", Driver.FindElement(By.XPath("//*[contains(#class,'image-tile__content-header') and contains(text(),'End Date, Contract')]")).FindElement(By.XPath("//*[contains(#class, 'image-tile__badge')]")).GetAttribute("Class"));
The last one finds the element when the status is 'new', but when i change the employee status to 'leaver', it still returns as 'new' so possibly looking at another employee with a 'new' status.
Hopefully this is enough info, let me know if more is needed (this is my first ever post!)
HTML code in image below
[HTML code on Chrome]
[1]: https://i.stack.imgur.com/kUxkf.png
Summary: im trying to assert that the Employee "End Date, Contract" has the status of leaver (aka the leaver class "image-tile__badge background-color--status-leaver ng-star-inserted")
Thanks everyone for their help!
One of my devs managed to take #noldors example and modify it a bit so heres what ended up working for me:
var newElmList1 = Driver.FindElements(By.CssSelector("div.background-color--status-leaver")).ToList();
List<string> newNames1 = new List<string>();
foreach (var newElm in newElmList1)
{
var newName1 = newElm.FindElement(By.XPath(".."))
.FindElement(By.CssSelector("div.image-tile__content-header")).Text;
newNames.Add(newName1);
}
if (!newNames.Contains("End Date, Contract"))
{
throw new Exception("Exception Error on leaver Person");
}
As per your screenshot i fill it's better if you try using Xpath
var elmList = Driver.FindElements(By.Xpath("//div[contains(text(),'leaver')]")).ToList();
i hope it will help you
Thank You.
According to your screenshot, you can find all elements with 'Leaver' specific class with this;
var leaverElmList = Driver.FindElements(By.CssSelector("div.background-color--status-leaver")).ToList();
List<string> leaverNames = new List<string>();
foreach (var leaverElm in leaverElmList) {
var leaverName = leaverElm.FindElement(By.XPath(".."))
.FindElement(By.CssSelector("div.image-tile__content-header"));
.Text()
leaverNames.Add(leaverName);
}
Enddate, Contract which is not related to the div that contains Leaver. It's direct parent is the image-tile div
I want to scrape a Wiki page. Specifically, this one.
My app will allow users to enter the registration number of the vehicle (for example, SBS8988Z) and it will display the related information (which is on the page itself).
For example, if the user enters SBS8988Z into a text field in my application, it should look for the line on that wiki page
SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
and return SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen).
My code so far is (copied and edited from various websites)...
WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";
getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!
HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;
List<string> anchorTags = new List<string>();
foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
string att = deployment.OuterHtml;
anchorTags.Add(att);
}
However, I am getting a an ArgumentException was unhandled - Illegal Characters in path.
What is wrong with the code? Is there an easier way to do this? I'm using HtmlAgilityPack but if there is a better solution, I'd be glad to comply.
What's wrong with the code? To be blunt, everything. :P
The page is not formatted in the way you are reading it. You can't hope to get the desired contents that way.
The contents of the page (the part we're interested in) looks something like this:
<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
<!-- ... -->
<b>SBS8987B</b>
(SLBP 192/194*)
<br>
<b>SBS8988Z</b>
(SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
<br>
<b>SBS8989X</b>
(SLBP SP)
<br>
<!-- ... -->
</p>
Basically we need to find the b elements that contain the registration number we are looking for. Once we find that element, get the text and put it together to form the result. Here it is in code:
static string GetVehicleInfo(string reg)
{
var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";
// HtmlWeb is a helper class to get pages from the web
var web = new HtmlAgilityPack.HtmlWeb();
// Create an HtmlDocument from the contents found at given url
var doc = web.Load(url);
// Create an XPath to find the `b` elements which contain the registration numbers
var xpath = "//h2[span/#id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
+ "/following-sibling::p[1]" // move to the first `p` element (where the actual content is in) after the header
+ "/b"; // select the `b` elements
// Get the elements from the specified XPath
var deployments = doc.DocumentNode.SelectNodes(xpath);
// Create a LINQ query to find the requested registration number and generate a result
var query =
from b in deployments // from the list of registration numbers
where b.InnerText == reg // find the registration we're looking for
select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)
// The query should yield exactly one result (or we have a problem) or none (null)
var content = query.SingleOrDefault();
// Decode the content (to convert stuff like "&" to "&")
var decoded = System.Net.WebUtility.HtmlDecode(content);
return decoded;
}
I have a c# asp.net MVC project. I am doing a a search and want to access the search results on the details page of one of the search results.
This is so that I can have < prev | next > links on the detail pages that will link to the next property in the last search results.
My approach so far is to put the search results object into a session variable, but I can't figure out the code to actually access it. When I do a watch on Session["SearchResults"] below, I can see the records in the Result View, which seems to hold an array.
I know someone is going to tell me I'm thinking about this all wrong, and I can't wait to be enlightened.
Someone suggested I should just store the last search results on the repository as a public property, would that be a better option? Or can someone recommend an altogether better way of doing what I need to do?
This is my controller
public ActionResult Search(int? page, Search search)
{
search.regusApi = Convert.ToBoolean(ConfigurationManager.AppSettings["regusApiLiveInventory"]);
Session["SearchResults"] = MeetingRoomRepositoryWG.Search(search).AsPagination(page ?? 1, 7);
return View(new SearchResultsWG { SearchCriteria = search, Locations = MeetingRoomRepositoryWG.Search(search).AsPagination(page ?? 1, 7) });
}
public ActionResult NiceDetails(String suburb, String locationName, int id)
{
**Here I want to acceess the session variable**
return View(MeetingRoomRepositoryWG.RoomDetails(id).First());
}
Here is the code from the repository:
public static List<Location> Search(Search search)
{
String LongLatString = search.LongLat;
LongLatString = LongLatString.Substring(1, LongLatString.Length - 2);
var LonLatSplit = LongLatString.Split(',');
var latitude = Convert.ToDecimal(LonLatSplit[0]);
var longitude = Convert.ToDecimal(LonLatSplit[1]);
using (var context = new MyContext())
{
var query = context.Locations.Include("Location_LonLats").ToList();
query.OrderBy(x => (Convert.ToDecimal(x.Location_LonLats.Lat) - latitude) * (Convert.ToDecimal(x.Location_LonLats.Lat) - latitude)
+ (Convert.ToDecimal(x.Location_LonLats.Lon) - longitude) * (Convert.ToDecimal(x.Location_LonLats.Lon) - longitude));
return query;
}
}
Not sure how large the data is you search against but, it's better to not store search results at all. It will scale very poorly and become a resource hog quite easily. Why store e.g. 500 pages of search results if the user only ends up looking at 1 or 3?
How long are you going to hold these potentially large result sets in session storage? For how many users?
Just do a page based search, essentially redoing the search for each "next" click the client does. A good index or something like Lucene.net can help if your searches are too slow.