I am trying to get specific information from a website. Right Now I have this html string as you can see my code, the html source code of the website is placed in "responseText". I know I can do this with If's statement but it would be really tedious. I'm a newbie so I have no idea what I'm doing with this. I'm sure there must be another easier way to retrieve information from a website... This is c# for windows store so I can't use webclient. This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something? I just want to do this for a webpage and I know the variables I want because I looked at the html code of the webpage. Isn't it a way to request a list of variables with its information from the website? I'm just kind of lost here. So basically I just want to get specific information from a website in c#, I'm making an app for windows store.
StringBuilder sb = new StringBuilder();
// used on each read operation
byte[] buf = new byte[8192];
// prepare the web page we will be asking for
HttpClient searchClient;
searchClient = new HttpClient();
searchClient.MaxResponseContentBufferSize = 256000;
HttpResponseMessage response = await searchClient.GetAsync(url);
response.EnsureSuccessStatusCode();
responseText = await response.Content.ReadAsStringAsync();
This codes get the string but isn't there is a way I can remove the html code and only leave the variables or something?
What "variables"? You get the HTML - that's the response from the web server. If you want to strip that HTML, that's up to you. You might want to use HTML Tidy to make it more pleasant to work with, but the business of extracting relevant information from HTML is up to you. HTML isn't designed to be machine-readable as a raw information source - it's meant to be mark-up to present to humans.
You should investigate whether the information is available in a more machine-friendly source, with no presentation information etc. For example, there may be some way of getting the data as JSON or XML.
Related
Is there a way to get the fully rendered html of a web page using WebClient instead of the page source? I'm trying to scrape some data from the page's html. My current code is like this:
WebClient client = new WebClient();
var result = client.DownloadString("https://somepageoutthere.com/");
//using CsQuery
CQ dom = result;
var someElementHtml = dom["body > main];
WebClient will only return the URL you requested. It will not run any javacript on the page (which runs on the client) so if javascript is changing the page DOM in any way, you will not get that through webclient.
You are better off using some other tools. Look for those that will render the HTML and javascript in the page.
I don't know what you mean by "fully rendered", but if you mean "with all data loaded by ajax calls", the answer is: no, you can't.
The data which is not present in the initial html page is loaded through javascript in the browser, and WebClient has no idea what javascript is, and cannot interpret it, only browsers do.
To get this kind of data, you need to identify these calls (if you don't know the url of the data webservice, you can use tools like Fiddler), simulate/replay them from your application, and then, if successful, get response data, and extract data from it (will be easy if data comes as json, and more tricky if it comes as html)
better use http://html-agility-pack.net
it has all the functionality to scrap web data and having good help on the site
I'm trying to do some web scraping from a simple form in C#.
My issue is trying to figure out the action to post to and how to work out the post params.
The form I am trying to submit has:
<form method="post" action="./"
As the page sits at www.foobar.com I am creating a WebRequest object in my C# code and posting to this address.
The other issue with this is that I am not sure of the post values as the inputs only have ids not names:
<input name="ctl00$MainContent$txtSearchName" type="text" maxlength="8" id="MainContent_txtSearchName" class="input-large input-upper">
So I read this: c# - programmatically form fill and submit login, amongst others and my code looks like this:
var httpRequest = WebRequest.Create("https://www.foobar.com/");
var values = "SearchName=Foo&SearchLastName=Bar";
byte[] send = Encoding.Default.GetBytes(values);
httpRequest.Method = "POST";
httpRequest.ContentType = "application/x-www-form-urlencoded";
httpRequest.ContentLength = send.Length;
Stream sout = httpRequest.GetRequestStream();
sout.Write(send, 0, send.Length);
sout.Flush();
sout.Close();
WebResponse res = httpRequest.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string returnvalue = sr.ReadToEnd();
File.WriteAllText(#"C:\src\test.html", returnvalue);
However, the resulting html page that is created does not show the search results, it shows the initial search form.
I am assuming the post is failing. My questions are around post I am making.
Does action="./" mean it posts back to the same page?
Do I need to submit all the form values (or can I get away with only submitting one or two)?
Is there any way to infer what the correct post parameter names are from the form?
Or am I missing something completely about web scraping and submitting forms in server side code?
What I would suggest is not doing all of this work manually, but letting your computer take a bit of the workload. You can use a tool such as Fiddler and the Fiddler Request To Code Plugin in order to programmatically generate the C# code for duplicating the web request. You can then modify it to take whatever dynamic input you may need.
If this isn't the route you'd like to take, you should make sure that you are requesting this data with the correct cookies (if applicable) and that you are supplying ALL POST data, no matter how menial it may seem.
My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.
In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.
Click here for first picture
I'm trying to accomplish this using following path:
"//div[#class='kal']//table//tr[2]/td[1]/div[#class='cipars']"
But I'm getting following Error:
Click here for Error message picture
Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.
So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.
Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.
Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.
Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:
//div[#class='kal']//table//descendant::div[#class='cipars']
This will return all of the calendar items (ie 1 through 30).
However, to get all the items in a particular row, you can just stick that tr into the query:
//div[#class='kal']//table//tr[3]/descendant::div[#class='cipars']
This would return 2 to 8 (the second row of calendar items).
To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:
//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']
Hopefully this is enough to show the issue at least.
Edit
Although you do have an XPath problem, you also have another issue.
The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.
Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).
Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.
The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?
Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.
So to send some POST data, we generate a method like so.....
public static string SendPost(string url, string postData)
{
string webpageContent = string.Empty;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = byteArray.Length;
using (Stream webpageStream = webRequest.GetRequestStream())
{
webpageStream.Write(byteArray, 0, byteArray.Length);
}
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
{
webpageContent = reader.ReadToEnd();
}
}
return webpageContent;
}
We can call it like so:
string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");
How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.
The responseBody that is returned is the physical HTML of just the table for the calendar.
What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.
var document = new HtmlDocument();
document.LoadHtml(webpageContent);
Now, we stick that original XPath in:
var node = document.DocumentNode.SelectSingleNode("//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']");
Now, we print out what should hopefully be "3":
Console.WriteLine(node.InnerText);
My output, running it locally, is indeed: 3.
However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.
I would like to make a console application that can query a website for data, process it, and then display it. ie- Access http://data.mtgox.com/ and then display the rate for currency X to currency Y.
I am able to get a large string of text via WebClient and StreamReader (though I don't really understand them), and I imagine that I could trim down the string to what I want and then loop the query with a delay to enable updating of the data without running the program again. However I'm only speculating and it seems like there would be a more efficient way of accessing data than this. Am I missing something?
EDIT - The general consensus seems to be to use JSON to do this; which is exactly was I was looking for! Thanks guys!
It looks like that site provides an API serving up JSON data. If you don't know what JSON is you need to look into that, but basically you could create object models representing this JSON. If you have the latest version of VS2012 you can copy the JSON and right click, hit paste special, then paste as class. This will automatically generate models for you. You then contact the API, retrieve the JSON, deserialize it into your models, and do whatever you want from there.
A better way:
string url = "http://data.mtgox.com/";
HttpWebRequest myWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
myWebRequest.Method = "GET"; // This can also be "POST"
HttpWebResponse myWebResponse = (HttpWebResponse)myWebRequest.GetResponse();
StreamReader myWebSource = new StreamReader(myWebResponse.GetResponseStream());
string myPageSource = string.Empty;
myPageSource= myWebSource.ReadToEnd();
myWebResponse.Close();
// Do something with the data in myPageSource
Once you have the data, look at JSON.NET to parse it
I need to call .php page from my aspx.cs page.but I don't want to load the page.I just want to call the page and the page will give me XLM response that I need to store in DB.I am trying this with the Ajax,but according to this link.We are not be able to call cross domain page from ajax.
In short I want to read the data from php page using asp.net code.
can anybody please help me out.
Update :: Is the P3P policy will usefull in case of cross domain page calling.
I got the solutions,thanks for your help.
create a new WebClient object
WebClient client = new WebClient();
string url = "http://testurl.com/test.php";
create a byte array for holding the returned data
byte[] html = client.DownloadData(url);
use the UTF8Encoding object to convert the byte
array into a string
UTF8Encoding utf = new UTF8Encoding();
get the converted string
string mystring = utf.GetString(html);
If I understand you correctly, your problem is that you want to do a cross-domain ajax call - which is not possible. The way to go around this is to make a call to your own backend which then fetches the data from the other site and sensd it back to the browser. Remember to do any needed safety check in the back end - depending on how much you trust the other domain of cource... (but even if you trust it 100%, it may be hacked or have some other problems that makes it return something else than what you tink it returns)