Html Agility Pack - reading div InnerText in table - c#

My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.
In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.
Click here for first picture
I'm trying to accomplish this using following path:
"//div[#class='kal']//table//tr[2]/td[1]/div[#class='cipars']"
But I'm getting following Error:
Click here for Error message picture
Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.

So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.
Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.
Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.
Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:
//div[#class='kal']//table//descendant::div[#class='cipars']
This will return all of the calendar items (ie 1 through 30).
However, to get all the items in a particular row, you can just stick that tr into the query:
//div[#class='kal']//table//tr[3]/descendant::div[#class='cipars']
This would return 2 to 8 (the second row of calendar items).
To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:
//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']
Hopefully this is enough to show the issue at least.
Edit
Although you do have an XPath problem, you also have another issue.
The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.
Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).
Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.
The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?
Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.
So to send some POST data, we generate a method like so.....
public static string SendPost(string url, string postData)
{
string webpageContent = string.Empty;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = byteArray.Length;
using (Stream webpageStream = webRequest.GetRequestStream())
{
webpageStream.Write(byteArray, 0, byteArray.Length);
}
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
{
webpageContent = reader.ReadToEnd();
}
}
return webpageContent;
}
We can call it like so:
string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");
How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.
The responseBody that is returned is the physical HTML of just the table for the calendar.
What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.
var document = new HtmlDocument();
document.LoadHtml(webpageContent);
Now, we stick that original XPath in:
var node = document.DocumentNode.SelectSingleNode("//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']");
Now, we print out what should hopefully be "3":
Console.WriteLine(node.InnerText);
My output, running it locally, is indeed: 3.
However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.

Related

HTTPWebRequest acts differently from same request sent through browser

I am working with the shift4shop web api. These guys used to be known as threeDCart if that helps anyone. Its an eCommerce platform.
we are trying to apply a promotion code to an open cart.
support has verified there is no api-way to do that.
there is an url that will apply the promotion. This is often emailed to customers so they can apply the promo if they choose to.
we can paste the correct url in chrome, brave, edge, or firefox and it correctly applies the promotion.
We used private tabs for the different browser tests and the browsers were 'cold'. we launched the browser and immediately entered the URL.
We are thinking this eliminates the possibility that there are cookies that are necessary.
https://www.mywebsite.com/continue_order.asp?orderkey=CDC886A7O4Srgyn278668&ApplyPromo=40pro
However, when I try to do this in C#, i get a response that is redirected a page that says 'The cart is empty'.
The promotion is not applied
I am stumped as to how the website would respond differently to the same URL when it comes from a browser as opposed to the c# system.net library.
here is the c# code I am using
using System.Net;
//i really create this using my data, but this is the resulting url
string url = "https://www.mywebsite.com/continue_order.asp?orderkey=CDC886A7O4Srgyn278668&ApplyPromo=40pro"
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
string result = "";
using (StreamReader rdr = new StreamReader(response.GetResponseStream()))
{
result = rdr.ReadToEnd();
}
You can also call ".view_cart.asp" w the same parameters and the browsers will cause the promo to be applied.
I have tried setting the method to [ , GET, get ]
There has to be something about the request settings that are preventing this from working.
I do not know what else to try.
Any thoughts are appreciated.
As per shift4shop support, the continue_order.asp has a 302.
the browsers land on continue_order.asp and process that page.
They then continue on to view_order.asp.
The 2 pages together perform functionality that you can not get by just calling continue_order.asp
Thanks to savoy w/ shift4Shop for helping on that.

HTTP Response Codes C#

Here is the code down below
List<int> j = new List<int>();
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(result.SiteURL);
webRequest.AllowAutoRedirect = false;
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
j.Add((int)response.StatusCode);
What i want to do is, get all the response codes, seperate them(like 2xx,3xx 4xx-5xx) and put them in different lists. Because i need their numbers like how many 4xx responses are there or how many 200 responses are there. Or is there another way to do it?
result.SiteURL is the URL that for the responses. The problem is the last line of the code doesn't return or get anything. What am i missing here?
edit: The main problem is that whatever i try i only get 1 response code and that is mostly 200:OK. But, for youtube.com(ect) there must be 74 OK(200) responses, 1 No Content(204) response and 2 Moved Permanently(301) responses according to https://tools.pingdom.com/#!/fMjhr/youtube.com. How am i going to get them?
You misunderstand the result shown by pingdom.
Pingdom requests a web page just like a browser would: It loads the page itself, as well as all resources references by the page: style sheets, scripts, images, etc.
Your code only loads the main HTML page, which has great availability and always returns 200 OK.
If you want to reproduce pingdom's results, you'll need to parse the HTML page and load the page's resources as well. Keep in mind that parsing HTML is a non-trivial task (browser vendors put a lot of effort in it), so you might want to reconsider whether this is worth your time.

Getting specific information from a website C# windows store app

I am trying to get specific information from a website. Right Now I have this html string as you can see my code, the html source code of the website is placed in "responseText". I know I can do this with If's statement but it would be really tedious. I'm a newbie so I have no idea what I'm doing with this. I'm sure there must be another easier way to retrieve information from a website... This is c# for windows store so I can't use webclient. This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something? I just want to do this for a webpage and I know the variables I want because I looked at the html code of the webpage. Isn't it a way to request a list of variables with its information from the website? I'm just kind of lost here. So basically I just want to get specific information from a website in c#, I'm making an app for windows store.
StringBuilder sb = new StringBuilder();
// used on each read operation
byte[] buf = new byte[8192];
// prepare the web page we will be asking for
HttpClient searchClient;
searchClient = new HttpClient();
searchClient.MaxResponseContentBufferSize = 256000;
HttpResponseMessage response = await searchClient.GetAsync(url);
response.EnsureSuccessStatusCode();
responseText = await response.Content.ReadAsStringAsync();
This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something?
What "variables"? You get the HTML - that's the response from the web server. If you want to strip that HTML, that's up to you. You might want to use HTML Tidy to make it more pleasant to work with, but the business of extracting relevant information from HTML is up to you. HTML isn't designed to be machine-readable as a raw information source - it's meant to be mark-up to present to humans.
You should investigate whether the information is available in a more machine-friendly source, with no presentation information etc. For example, there may be some way of getting the data as JSON or XML.

How to obtain data from a website and process it in C#

I would like to make a console application that can query a website for data, process it, and then display it. ie- Access http://data.mtgox.com/ and then display the rate for currency X to currency Y.
I am able to get a large string of text via WebClient and StreamReader (though I don't really understand them), and I imagine that I could trim down the string to what I want and then loop the query with a delay to enable updating of the data without running the program again. However I'm only speculating and it seems like there would be a more efficient way of accessing data than this. Am I missing something?
EDIT - The general consensus seems to be to use JSON to do this; which is exactly was I was looking for! Thanks guys!
It looks like that site provides an API serving up JSON data. If you don't know what JSON is you need to look into that, but basically you could create object models representing this JSON. If you have the latest version of VS2012 you can copy the JSON and right click, hit paste special, then paste as class. This will automatically generate models for you. You then contact the API, retrieve the JSON, deserialize it into your models, and do whatever you want from there.
A better way:
string url = "http://data.mtgox.com/";
HttpWebRequest myWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
myWebRequest.Method = "GET"; // This can also be "POST"
HttpWebResponse myWebResponse = (HttpWebResponse)myWebRequest.GetResponse();
StreamReader myWebSource = new StreamReader(myWebResponse.GetResponseStream());
string myPageSource = string.Empty;
myPageSource= myWebSource.ReadToEnd();
myWebResponse.Close();
// Do something with the data in myPageSource
Once you have the data, look at JSON.NET to parse it

crawling / scraping a search form based webpages

I want to crawl/scrape a webpage which has a form
to be precise following is the URL
http://lafayetteassessor.com/propertysearch.cfm
The problem is, i want to make a search and save the result in a webpage.
my search string will always give a unique page, so result count won't be a problem.
the search over there doesn't search on URL (e.g. google searching url contains parameters to search). How can i search from starting page (as above) and get the result page ?
please give me some idea.
I am using C#/.NET.
If you look at the forms on that page, you will notice that they use the POST method, rather than the GET method. As I'm sure you know, GET forms pass their parameters as part of the URL, eg mypage?arg1=value&arg2=value
However, for POST requests, you need to pass the parameters as the request body. It takes the same format, it's just passed in differently. To do this, use code similar to this:
HttpRequest myRequest = (HttpRequest)WebRequest.Create(theURL);
myRequest.Method = "post";
using(TextWriter body = new StreamWriter(myRequest.GetRequestStream())) {
body.Write("arg1=value1&arg2=value2");
}
WebResponse theResponse = myRequest.GetResponse();
//do stuff with the response
Don't forget that you still need to escape the arguments, etc.

Categories

Resources