Parsing HTML in C# that is updating constantly

Parsing HTML in C# that is updating constantly - c#

I have a webpage that is displaying some data using AJAX queries. I would need to parse some of this data in a C# program.
Problem is that when I look at the source code of my webpage, this is not showing up the data, as this is being generated automatically by an AJAX script and modifying the DOM.
If I select everything on the webpage and do "Inspect Element" with Chrome, I have the full HTML code with the data I want to extract that are in various tables.
What I've tried is doing a webBrowser1.Navigate("www.site.com"), and then in my webBrowser1_DocumentCompleted() event, I'm doing this:
var name = webBrowser1.Document.GetElementById("table_1_r_7_c_2");
Problem is that webBrowser1 is not returning the full HTML code, as some code is generated by the AJAX queries.
Does anyone know how I could achieve this behavior in C#?

The DocumentCompleted event is a bit misleading because it will also fire for each AJAX request on the page. You can do something like this to check if it's the actual page that's loaded, or some other variant to look for specific requests.
private void OnDocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if (e.Url.AbsolutePath == webBrowser1.Url.AbsolutePath)
{
// page loaded
}
}

Related

Html Agility Pack, Web scraping [duplicate]

How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}

You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

Adding search parameter to url to enable direct search from address bar

I have a few old sites that I want to add routing parameters. They were coded without using mvc so there are no global.asax with the handy MVC settings.
Currently I have a page with the url abc.com/xyz that has a search function. I can input a query which would send me to another page but it has the same url. I want to make it so that if i put some variation of the url abc.com/xyz?search='what_You_Query', it gives me the searched page. Right now that url sends me to the page where I input my query.
The website is coded in C# and html and saved in aspx files. The webpages also make use of jscripts
I'd appreciate any help I can get
Edit: Seems like there was some confusion, there is a search box that allows the user to query on the webpage. What I want is to allow users to directly link to a searched page.

You'd need to capture that on the page load - check the query string (https://msdn.microsoft.com/en-us/library/ms524784%28v=vs.90%29.aspx) and if search is in it, redirect to the search page.
MODIFIED TO INCLUDE MORE DETAILS
I'm assuming your working with web forms (Microsoft alternative to MVC). You would need to add a server-side (http://www.seguetech.com/blog/2013/05/01/client-side-server-side-code-difference) Page_Load event (https://msdn.microsoft.com/en-us/library/6w2tb12s.aspx). There the code would look something like this:
protected void Page_Load(object sender, EventArgs e)
{
if(Request.QueryString["search"] != null)
Response.Redirect("/search?" + UrlEncode(Request.QueryString["search"]), true);
}
Please note I have not tested the code and am going by memory a bit here - but that should do the trick.

Scraping data dynamically generated by JavaScript in html document using C#

How can I scrape data that are dynamically generated by JavaScript in html document using C#?
Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.
On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.
I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...
Thank you very much!

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.
You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx
The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.
EDIT (example):
Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");
webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);
private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");
foreach (HtmlElement div in divs)
{
//do something
}
}

You could take a look at a tool like Selenium for scraping pages which has Javascript.
http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

Setting Content in RadEditor

I have visited the Telerik's website and viewed their demos etc...
But I am having problems trying to load content (html) in the RadEditor.
I have a Button_Click event where I get my html string and then set it to the RadEditor. The RadEditor is inside a RadWindow and only becomes visible when the button is clicked.
protected void btnSubmitHtml_Click(object sender, EventArgs e)
{
RadEditor1.Content = "<p>hello there</p>";
RadWindow1.Visible = true;
}
This doesn't show the html inside the RadEditor for some odd reason. I suspect it is the page life cycle that is involved with this problem.
Are there any suggestions to solve this?

I have encountered this problem multiple times and never found a "Proper" resolution.
However, a great work around is to simply set the content from the clientside via injected script. The end result is the same, and if you can tolerate the 10 millisecond delay, worthy of consideration.
EDIT after comment requested reference
Basically all you need to get an instance of the editor using ASP.NET WebForms $find function. That takes the html ID of the root of the rendered object and returns the client side viewModel if one exists.
The $(setEditorInitialContent) call at the end assumes that jQuery is present and delays the execution of the function till page load.
<telerik:radeditor runat="server" ID="RadEditor1">
<Content>
Here is sample content!
</Content>
</telerik:radeditor>
<script type="text/javascript">
function setEditorInitialContent() {
var editor = $find("<%=RadEditor1.ClientID%>"); //get a reference to RadEditor client object
editor.set_html("HEY THIS IS SOME CONTENT INTO YOUR EDITOR!!!!");
}
$(setEditorInitialContent);
</script>

Take a look here to see how to get a RadEditor to work in a RadWindow: http://www.telerik.com/help/aspnet-ajax/window-troubleshooting-radeditor-in-radwindow.html.
Said shortly, here is what you need to have in the OnClientShow event of the RadWindow:
function OnClientShow()
{
$find("<%=RadEditor1.ClientID %>").onParentNodeChanged();
}

To edit Html code only you can add -
EnableTextareaMode="true"
Add this property to the RadEditor.
I suspect that the way the control tries to interpret the html might be one of the problems. The other thing that may be causing this problem is the page life cycle.

is there a straightforward way to retrieve text that is rendered by the browser but is not hard-coded in the actual html file?

I'm trying to retrieve data from a webpage but I cannot do it by making a web request and parsing the resulting html file because the actual text that I'm trying to retrieve is not in the html file! I imagine that this text is pulled using some script and for that reason it's not in the html file. For all I know I'm looking at the wrong data, but assuming that my theory is correct, is there a straightforward way to retrieve whatever text is displayed by the browser (Firefox or IE) rather than attempt to fetch the text from the html file?

Assuming you are referring to text that has been generated using Javascript in the browser.
You can use PhantomJS to achieve this: http://phantomjs.org/
It is essentially a headless browser that will process Javascript.
You may need to run this as ane xternal program but Im sure you can do that through C#

Your other option would be to open the web page in a WebBrowser object which should execute the scripts, and then you can get the HtmlDocument object and go from there.
Take a look at this example...
private void test()
{
WebBrowser wBrowser1 = new WebBrowser();
wBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wBrowser1_DocumentCompleted);
wBrowser1.Url = new Uri("Web Page URL");
}
void wBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlDocument document = (sender as WebBrowser).Document;
// get elements and values accordingly.
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing HTML in C# that is updating constantly - c#

Related

Html Agility Pack, Web scraping [duplicate]

Adding search parameter to url to enable direct search from address bar

Scraping data dynamically generated by JavaScript in html document using C#

Setting Content in RadEditor

is there a straightforward way to retrieve text that is rendered by the browser but is not hard-coded in the actual html file?

Categories

Resources