Logic for Implementing a Dynamic Web Scraper in C#

Logic for Implementing a Dynamic Web Scraper in C# - c#

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows:
Get the URL from the user.
Load the Web page in the IE UI control(embedded browser) in WINForms.
Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
When the User wishes to persist the location (the HTML DOM location) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits.
Assume that the loaded website is a pricelisting site and the quoted rate keeps on changing, the idea is to persist the DOM hierarchy so that I can traverse it next time.
I would be able to do this if all the HTML elements had their id attributes. In the case where the id is null , i am not able to accomplish this .
Could someone suggest a valid idea on this (a bare minimum code snippet if possible).?
It would be helpful , even if you can share some online resources.
thanks,
vijay

One approach is to build a stack of tags/styles/id down to the element which you want to select.
From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.
Example:
<html>
<body>
<!-- lots of html -->
<div id="main">
<div>
<span>
<div class="pricearea">
<table> <!-- with price data -->
For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.
Using styles/classes might also be used to create your path. It's your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.
If the layout seldom changes, this would let you navigate to the same location each time.
I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.
Screen scraping is fun, but it's difficult to get it 100% for all pages. Good luck!

After a bit of googling , i encountered a fairly simple solution . Below attached is the sample snippet.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;// loads the HTML DOM
IHTMLSelectionObject selection = HtmlDoc.selection;// Fetches the currently selected HTML Element.
IHTMLTxtRange range = (IHTMLTxtRange)selection.createRange();
IHTMLElement parentElement = range.parentElement();// Identifies the parent element
targetSourceIndex = parentElement.sourceIndex;
//dataLocation = range.parentElement().id;
MessageBox.Show(range.text);//range.parentElement().sourceIndex
}
I used a Embedded Web Browser in a Winforms applications, which loads the HTML DOM of the current web page.
The IHTMLElement instance exposes a property named 'SourceIndex' which allocates a unique id to each of the html elements.
One can store this SourceIndex to the DB and Query for the content at that location. using the following code.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;
IHTMLElement targetElement = null;
foreach (IHTMLElement domElement in HtmlDoc.all)
{
if (domElement.sourceIndex == int.Parse(node.InnerText))// fetching the persisted data from the XML file.
{
targetElement = domElement;
break;
}
}
MessageBox.Show(targetElement.innerText); //range.parentElement().sourceIndex
}

Related

Anglesharp HTML parser doesn't seem to be parsing document deep enough to access desired element

So I'm trying to scrape a website using AngleSharp and want to access a particular button that is nested deep in the site. I have logged out the parsed document html with document.DocumentElement.OuterHtml
but can only see so far into the document:
<div class="l-propertySearch-paginationAndSearchFooter" data-test="pagination">
<div data-bind="component: 'pagination'"></div>
</div>
</div>
However, when I inspect the page in the web browser, I can see the additional layers necessary to access the button:
As you can see, the div with the data-bind attribute title "component: 'pagination'" open up further but doesn't display this in the log - this is why, I suspect, I can't retrieve the element.
I've experimented with document.QuerySelectorAll("button" and get back a list of buttons but not the one I'm after - it's like the particular block I want doesn't exist. Any ideas what I'm doing wrong?

As far as I understand that button you are looking for is created with javascript and does not exist in original source code. That is the reason you can't access that button with anglesharp. Right click on website and click View page source (Ctrl + U on chrome) and look for your button there. That is what anglesharp sees not html inside inspect element.

HTML parsers not finding table element on a web page

I'm trying to get to this element: //*[#id="table-matches"]/table on this page: http://www.oddsportal.com/matches/soccer/20140221/
I want to get the table that contains matches. Table starts under Kick off time tab. The element I'm looking for is 'table class=" table-main"' and it is inside the element 'div id="table-matches" style="display: block;"'
I tried getting this document with HtmlAgilityPack in C# and I can find 'div' element, but it says that it doesn't have any child nodes (there should be a table child node). If I try to get the table, the result is null. Here is the code:
var webGet = new HtmlWeb();
var document = webGet.Load("http://www.oddsportal.com/matches/soccer/20140221/");
var div = document.DocumentNode.SelectNodes("//div[#id='table-matches']");
var table = document.DocumentNode.SelectNodes("//*[#id='table-matches']/table");
var table2 = document.DocumentNode.SelectNodes("//table");
So, div variable contains the div element (but it has no child nodes), table variable is null, even table2 variable contains 4 elements, but none of them are desired table.
I figured there is a problem with HtmlAgilityPack and tried to get the whole web page with Python. So I got the whole HTML document in a text file and searched the text file and I can find div element but it is empty. There is no table element inside. Why is that? Why can I see table element in chrome or internet explorer, but when I download html there is no such element?
Here is the python code:
url = urllib.urlopen("http://www.oddsportal.com/matches/")
document = url.read()
htmlOddsPortal = open("htmlOddsPortal.txt", "w")
htmlOddsPortal.write(document)
Here is the element in the final text document:
<div id="table-matches"></div> <!-- END PAGE BODY -->

Table is loaded with JavaScript (probably with AJAX) so you won't get it with webGet.Load(). You only get HTML that server returns in response.
You can check this if you (in Chrome) open Console (F12), click on Settings and check Disable JavaScript, then refresh page. You will see blank content.
I had same problem, but I worked in java, and I have used HTMLUnit to solve this. Probably there is similar tool for C#, or you can check if HtmlAgilityPack is able to do asynchronous call or something like WebBrowser component.

How to programmatically load a HTML document in order to add to the document's <head>?

We are supplied with HTML 'wrapper' files from the client, which we need to insert out content into, and then render the HTML.
Before we render the HTML with our content inserted, I need to add a few tags to the <head> section of the client's wrapper, such as references to our script files, css and some meta tags.
So what I'm doing is
string html = File.ReadAllText(wrapperLocation, Encoding.GetEncoding("iso-8859-1"));
and now I have the complete HTML. I then search for a pre-defined content well in that string and insert our content into that, and render it.
How can I create an instance of a HTML document and modify the <head> section as required?
edit: I don't want to reference System.Windows.Forms so WebBrowser is not an option.

I haven't tried this library myself, but this would probably fit the bill: http://htmlagilitypack.codeplex.com/

You can use https://github.com/jamietre/CsQuery to edit an html dom.
var dom = CQ.Create(html);
var dom = CQ.CreateFromUrl("http://www.jquery.com");
dom.Select("div > span")
.Eq(1)
.Text("Change the text content of the 2nd span child of each div");
Just select the head and add to it.

I use the WebBrowser control as host, and navigate/alter the document through its Document property.
Nice documentation and samples at the link above.

Are you using MasterPages?
This seems like the most obvious use of them.
The MasterPage has <asp:ContentPlaceHolder>'s for all the points where you want the content to go.
In our app we have a base controller that overrides all the View() overloads so that it reads in the name of the MasterPage from the web.config. That way customising the app is as simple as a new MasterPage and from a Controllers point of view there is no code change since our base class handles the MasterPage/web.config stuff.

I couldn't get an automated solution to this, so it came down to a hack:
public virtual void PopulateCssTag(string tags)
{
// tags is a pre-compsed string containing all the tags I need.
this.Wrapper = this.Wrapper.Replace("</head>", tags + "</head>");
}

How determine css text of each node in html

How can I iterate over HTML nodes of a web page and get the CSS Text of each node in it? I need something like what Firebug is doing, if you click on a Node, it gives you complete list of all CSS Texts associated with that Node (even inherited styles).
My main problem is not actually iterating over HTML nodes. I am doing it with Html Agility Pack library. I just need to get complete CSS for each node.
p.s. I am sorry, I should have explained that I want to do this in C# (not javascript)

I found the following code snippet useful for all element in the page and 'CurrentStyle' property of them shows their computed style:
HTMLDocument doc = (HTMLDocument)axWebBrowser1.Document;
var body = (HTMLBody)doc.body;//current style
var childs = (IHTMLDOMChildrenCollection)body.childNodes;
var currentelementType = (HTMLBody)childs.item(0);
var width = currentelementType.currentStyle.width;
Note that according to my prev post axWebBrowser1 is a WebBrowser control.

If you want the current styles for an element, look into getComputedStyle(), but if you want the inheritance too then you may have to implement the style cascade. Firebug does quite a lot of work behind the scenes to generate what you see!

You can get the CSS text from the style attribute like this:
node.getAttribute('style')
Or if you want style you can iterate through the keys and values in
node.style
If you want to grab the entire computed style of the element and not just the CSS applied in the style attribute, read this article on computed and cascaded styles.

You can use WebBrowser control in C# to access the htm document object and cast its body tag as following:
HTMLDocument doc = (HTMLDocument)axWebBrowser1.Document;
var body = (HTMLBody)doc.body;
But before that you should add com refrence: MSHTML to you project.
here you could access body.currentStyle that show you all its styles that might be css or inline styles.

You can try for (property in objName) operator as seen here.

I'm not sure if you can simply get "all" CSS properties using JavaScript to be honest, you could look into the [DOMNode].currentStyle, [DOMNode].style and document.defaultView.getComputedStyle thingamajiggy's. They should contain the 'current' style they had. What you could then do is have an array of all CSS properties you want to test and simply loop them through a function of your own that gets the CSS property for everything using forementioned methods (depending on which browser). I usually attempt the DOMNode.style[property] first as this is "inline" javascript and always rules over everything, then I sniff if the browser uses the .currentStyle method or .getComputedStyle and use the correct one.
It's not perfect and you might need to clean up some things (height: auto; to the actual current height, some browsers might return RGB colours instead of HEX) etc.
So, yes, I don't know of anything prefab that you can use in Javascript.

HttpWebRequest reades only homepage

Hi I tried to read a page using HttpWebRequest like this
string lcUrl = "http://www.greatandhra.com";
HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(lcUrl);
loHttp.Timeout = 10000; // 10 secs
loHttp.UserAgent = "Code Sample Web Client";
HttpWebResponse loWebResponse = (HttpWebResponse)loHttp.GetResponse();
Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(), enc);
string lcHtml = loResponseStream.ReadToEnd();
mydiv.InnerHtml = lcHtml;
// Response.Write(lcHtml);
loWebResponse.Close();
loResponseStream.Close();
i can able to read that page and bind it to mydiv. But when i click on any one of links in that div it is not displaying any result. Because my application doesnt contain entire site. So what we will do now.
Can somebody copy my code and test it plz
Nagu

I'm fairly sure you can't insert a full page in a DIV without breaking something. In fact the whole head tag may be getting skipped altogether (and any javascript code there may not be run). Considering what you seem to want to do, I suggest you use an IFRAME with a dynamic src, which will also hopefully lift some pressure off your server (which wouldn't be in charge of fetching the html to be mirrored anymore).

If you really want a whole page of HTML embedded in another, then the IFRAME tag is probably the one to use, rather than the DIV.
Rather than having to create a web request and have all that code to retrieve the remote page, you can just set the src attribute of the IFRAME to point ot the page you want it to display.
For example, something like this in markup:
<iframe src="<%=LcUrl %>" frameborder="0"></iframe>
where LcUrl is a property on your code-behind page, that exposes your string lcUrl from your sample.
Alternatively, you could make the IFRAME runat="server" and set its src property programatically (or even inject the innerHTML in a way sismilar to your code sample if you really wanted to).

The code you are putting inside .InnerHtml of the div contains the entire page (including < html >, < body >, < /html > and < /body> ) which can cause a miriad of problems with any number of browsers.
I would either move to an iframe, or consider some sort of parsing the HTML for the remote site and displaying a transformed version (ie. strip the HTML ,BODY, META tags, replace some link URLs, etc).

But when i click on any one of links in that div it is not displaying any result
Probably because the links in the download page are relative... If you just copy the HTML into a DIV in your page, the browser considers the links relative to the current URL : it doesn't know about the origin of this content. I think the solution is to parse the downloaded HTML, and convert relative URLs in href attributes to absolute URLs

If you want to embed it, you need to strip everything but the body part. That means that you have to parse your string lcHTML for <body....> and remove everything before and includeing the body tag. You must also strip away everything from </body>. Then you need to parse the string for all occurences of <a href="....."> that do not start with http:// and include h t t p://www.greatandhra.com or set <base target="h t t p://www.greatandhra.com"> in your head section.
If you don't want to embed, simply clear the response buffer and stream the lcHTML string back to the browser.
PS: I had to write all h t t p with spaces to be able to post this.

Sounds like what you are trying to do is display a different site embedded in your site. For this to work by dropping it into a div you would have to extract the code between the body tags as it wouldn't be valid with html and head in the middle of another page.
The links won't work because you've now taken that page out of context in your site so you'd also have to rewrite any links on the page that are relative (i.e. don't start with http) to point to a page on your site which will then fetch the other sites page and display them back in your site, or you could add the url of the site you're grabbing to the beginning of all the relative links so they link back to that site.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Logic for Implementing a Dynamic Web Scraper in C# - c#

Related

Anglesharp HTML parser doesn't seem to be parsing document deep enough to access desired element

HTML parsers not finding table element on a web page

How to programmatically load a HTML document in order to add to the document's <head>?

How determine css text of each node in html

HttpWebRequest reades only homepage

Categories

Resources