In summary, what I'm trying to do is "open" a page driver.Navigate().GoToUrl("http://somepage.com") and then immediately block the response from "http://somepage.com/something.asmx/GetStuff" so that I can verify that some element has some class before the response is loaded: driver.FindElement(By.CssSelector("buttom.some-button")).GetAttribute("class").Contains("disabled") and then the disabled.
Is something like this possible, and if so, how do I go about it?
My question is similar to Selenium Webdriver c# without waiting for page to load in what it's trying to achieve.
Cast your instance of IWebDriver (FirefoxDriver, ChromeDriver, etc) to IJavacriptExecutor and replace the jQuery $.ajax() method with a stub, such as:
var driver = Driver as IJavaScriptExecutor
driver.ExecuteScript("window.originalAjax = $.ajax")
driver.ExecuteScript("$.ajax = function() {}")
// navigate to page, check class
driver.ExecuteScript("$.ajax = window.originalAjax")
So when your request calls into $.ajax it will hit a blank method.
This has the downside that you cannot easily get the request to 'continue' after blocking it, as no request was ever created. You would have to refresh the page without doing the above steps which could give some sort of false positive.
Related
While using RedirectResult Redirect(string url) with "about:blank" it returns error that side cannot be reached and it loops infinitely inside my controller method
Can you please advice how could I handle about:blank case?
The redirect is basically done by using two declares on Headers - the url and the status.
So the very basic of redirect is this two lines of code (for the "about:blank") where for security reasons is forbitten and response with ERR_UNSAFE_REDIRECT
Response.StatusCode = 301;
Response.RedirectLocation = "about:blank";
So this can not be done using redirect and header declarations.
There is a way around - use JavaScript and run this line
<script>window.location.href = 'about:blank';</script>
Alternative consider to create an empty page and redirect to this empty existing page - and avoid the about:blank
Is there a way to get the fully rendered html of a web page using WebClient instead of the page source? I'm trying to scrape some data from the page's html. My current code is like this:
WebClient client = new WebClient();
var result = client.DownloadString("https://somepageoutthere.com/");
//using CsQuery
CQ dom = result;
var someElementHtml = dom["body > main];
WebClient will only return the URL you requested. It will not run any javacript on the page (which runs on the client) so if javascript is changing the page DOM in any way, you will not get that through webclient.
You are better off using some other tools. Look for those that will render the HTML and javascript in the page.
I don't know what you mean by "fully rendered", but if you mean "with all data loaded by ajax calls", the answer is: no, you can't.
The data which is not present in the initial html page is loaded through javascript in the browser, and WebClient has no idea what javascript is, and cannot interpret it, only browsers do.
To get this kind of data, you need to identify these calls (if you don't know the url of the data webservice, you can use tools like Fiddler), simulate/replay them from your application, and then, if successful, get response data, and extract data from it (will be easy if data comes as json, and more tricky if it comes as html)
better use http://html-agility-pack.net
it has all the functionality to scrap web data and having good help on the site
My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.
In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.
Click here for first picture
I'm trying to accomplish this using following path:
"//div[#class='kal']//table//tr[2]/td[1]/div[#class='cipars']"
But I'm getting following Error:
Click here for Error message picture
Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.
So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.
Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.
Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.
Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:
//div[#class='kal']//table//descendant::div[#class='cipars']
This will return all of the calendar items (ie 1 through 30).
However, to get all the items in a particular row, you can just stick that tr into the query:
//div[#class='kal']//table//tr[3]/descendant::div[#class='cipars']
This would return 2 to 8 (the second row of calendar items).
To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:
//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']
Hopefully this is enough to show the issue at least.
Edit
Although you do have an XPath problem, you also have another issue.
The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.
Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).
Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.
The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?
Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.
So to send some POST data, we generate a method like so.....
public static string SendPost(string url, string postData)
{
string webpageContent = string.Empty;
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = byteArray.Length;
using (Stream webpageStream = webRequest.GetRequestStream())
{
webpageStream.Write(byteArray, 0, byteArray.Length);
}
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
{
webpageContent = reader.ReadToEnd();
}
}
return webpageContent;
}
We can call it like so:
string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");
How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.
The responseBody that is returned is the physical HTML of just the table for the calendar.
What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.
var document = new HtmlDocument();
document.LoadHtml(webpageContent);
Now, we stick that original XPath in:
var node = document.DocumentNode.SelectSingleNode("//div[#class='kal']//table//tr[3]//td[#class='datums'][2]/div[#class='cipars']");
Now, we print out what should hopefully be "3":
Console.WriteLine(node.InnerText);
My output, running it locally, is indeed: 3.
However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.
I am developing a local server using self-hosted ServiceStack. I hardcoded a demo webpage and allow it to be accessed at localhost:8080/page:
public class PageService : IService<Page>
{
public object Execute(Page request)
{
var html = System.IO.File.ReadAllText(#"demo_chat2.html");
return html;
}
}
// set route
public override void Configure(Container container)
{
Routes
.Add<Page>("/page")
.Add<Hello>("/hello")
.Add<Hello>("/hello/{Name}");
}
It works fine for Chrome/Firefox/Opera, however, IE would treat the url as a download request and promote "Do you want to open or save page from localhost?"
What shall I do to let IE treat the url as a web page? (I already added doctype headers to the demo page; but that cannot prevent IE from treating it as a download request.)
EDIT.
Ok. I used Fiddler to check the response when accessing localhost. The responses that IE and Firefox get are exactly the same. And in the header the content type is written as:
Content-Type: text/html,text/html
Firefox treats this content type as text/html, however IE does not recognize this content type (it only recognizes a single text/html)!
So this leads me to believe that this is due to a bug in SS.
Solution
One solution is to explicitly set the content type:
return new HttpResult(
new MemoryStream(Encoding.UTF8.GetBytes(html)), "text/html");
I don't know what is your exact problem is. If you want to serve html page, there is a different way to do that.
Servicestack support razor engine as plugin that is useful if you like to serve html page and also you can bind data with it. Different way of doing this is explain here. razor.servicestack.net . This may be useful. Let me know if you need any additional details.
I am using this code:
HttpWebResponse objHttpWebResponse = (HttpWebResponse)objHttpWebRequest.GetResponse();
return new StreamReader(objHttpWebResponse.GetResponseStream()).ReadToEnd();
I get the page content successfully, but my problem is that there are some dynamic content that are populated by javascript functions on the page and it seems that the content is fetched before those functions finished executing, so those parts of the page are returned not populated with data, is there any way to solve this "Wait for page until it's completely loaded including all contents".
Edit:
Regarding "#ElDog" answer, i tried the following code but with no luck to:
WebBrowser objWebBrowser = new WebBrowser();
objWebBrowser.DocumentCompleted += objWebBrowser_DocumentCompleted;
objWebBrowser.Navigate(url);
and at the document complete event i executed the following code:
string content = ((WebBrowser)(sender)).Document.Body.InnerHtml;
But still the javascript functions didn't execute.
HttpWebRequest is not going to execute java scripts at all. It just gives you what a web browser gets in response. To execute java scripts you would need a web browser emulation in your code.