I'm a bit confused on how to extract specific href links from an HTML page. There are certainly a good amount of examples, but they seem to cover either gathering an href when theres just one on the page, or gathering all the links.
So I currently push the HTML document into a text file using HttpWebRequest, HttpWebResponse, and StreamReader.
Here's my little sample I'm working with, this just downloads the URL of my choice and saves it to a text file.
protected void btnURL_Click(object sender, EventArgs e)
{
string url = txtboxURL.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
//lblResponse.Text = sr.ReadToEnd();
string urldata = sr.ReadToEnd();
if (File.Exists(#"C:\Temp\test.txt"))
{
File.Delete(#"C:\Temp\test.txt");
}
File.Create(#"C:\Temp\test.txt").Close();
File.WriteAllText(#"C:\Temp\test.txt", urldata);
sr.Close();
response.Close();
}
I can search the entire text file for a href, but there are a lot of them on each page, and the ones I'm looking for are sectioned in a <nav> tag, and then they are all in <div> tags with the same class, sort of like this:
<nav class="deptVertNav>
<div class="acTrigger">
<a href="*this is what I need to get*" ....
....
</a>
</div>
<div class="acTrigger">
<a href="*etc*" ....
....
</a>
</div>
<div class="acTrigger">
<a href="*etc*" ....
....
</a>
</div>
</nav>
Essentially I'm trying to create a text crawler/scraper to retrieve links. The current pages I'm working with start at a main page with links down the side on a navigation bar. Those links in the navigation bar are what I want to get to so I may download each of those page's content, and then retrieve the real data I'm looking for. So this is all just one big parse job, and I am terrible at parsing. If I can figure out how to parse this first main page then I will be able to parse the sub pages.
I don't want anyone to just give me the answer, I just want to know what a good method of parsing would be in this situation. IE how do I narrow the parse down to just those tags, and then what would be a good dynamic way to store those links so I can access them later? I hope this makes sense.
EDIT: Well I am now attempting to use HtmlAgilityPack with much confusion. To my knowledge this will retrieve all the nodes that are a <div class="acTrigger"> that are within the page I load:
var div = html.DocumentNode.SelectNodes("//div[#class='acTrigger']");
The next question is how I get inside the <div> tag and into the <a> tag, and then retrieve the href value, and store it.
Instead of trying to manually parse the text file, I would recommend placing the HTML in a HtmlDocument control (https://msdn.microsoft.com/en-us/library/system.windows.forms.htmldocument(v=vs.110).aspx) or WebBrowser control (https://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(v=vs.110).aspx). This allows you to access the elements already parsed. From there you can easily find all DIV elements with the appropriate class, and then the A element inside of that.
Take a look at the Selenium Web Driver library. Then grab the urls as needed.
IWebElement anchorUrl1 = driver.FindElement(By.XPath("//nav[#class='deptVertNav']/div[1]/a[1]"));
string urlText1 = anchorUrl1.Text;
IWebElement anchorUrl2 = driver.FindElement(By.XPath("//nav[#class='deptVertNav']/div[2]/a[1]"));
string urlText2 = anchorUrl2.Text;
If all you want to do is click on them, then:
driver.FindElement(By.XPath("//nav[#class='deptVertNav']/div[1]/a[1]")).Click();
Related
my program need to get the RSS link then go to read the rss.
I found when i parse to layer <div id="titleRSS_7224" class="rss"></div>
the <a>inside of it gone.
i'm using the HtmlAgilityPack
i can see the<a> from the google chrome
<div id="titleRSS_7224" class="rss">
<a title="RSS 2.0" target="_blank" rel="nofollow" href="/rss/media/bz0xMiZmbHBsPTIxMjEzNjYsMjAsODQwLDAmZng9.rss"></a>
</div>
my code is:
HtmlDocument temNode= new HtmlDocument();
string temStr = page.DocumentNode.SelectSingleNode(longPath).InnerHtml;
temNode.LoadHtml(page.DocumentNode.SelectSingleNode(longPath).InnerHtml);
then I check both temStr and temNode, the <a> is no in there.
i get another idea that is to do:
HtmlNode temNode = page.DocumentNode.SelectSingleNode("//a[#title='RSS 2.0']");
this works.
but just want to know why the first method does not work.
Perhaps if you just select the single node instead of the innerhtml youll be able to enumerate the childnodes?
Just spitballing though as im not familiar with that api.
I need to write a page, can use PHP or .NET, that will display the unmodified html for an element of another page.
The other page may not have valid HTML, but we want it to be returned unmodified. We will not be selecting based on the invalid elements, but will select their parent element and need them returned unmodified.
An example HTML page that my page will be fetching:
<body>
<div>
<p>test1</p>
<br>
<p>test2
<p>test3</p>
</div>
</body>
So far everything I have tried attempts to fix the HTML, it makes the br in the example self closing and the second paragraph tags gets closed.
Is there anything out there that can do this?
Thanks!
I am using WatiN and trying to scrape an image URL from a weblink, based on the fields class. Viewing the sites code the images info displays as this:
//images code
<div class="doc-banner-icon">
<img src="https://website.com/image.jpg">
</div>
//text code
<div id="doc-original-text">
Once upon a time, in a land far far away...
</div>
What I want to do is use a WatiN call to find that img link. I thought I could use something like the Find.ByClass() call to find specifically that area of the code, but I can't seem to figure out how to get the line of text contained within that class. When I use the Find.ById() on a different field and sent to string it pulls the text content of that area. Below is what I am trying.
using (myIE)
{
//loads the website
myIE.GoTo(txtbxWeblink.Text);
string infoText = myIE.Div(Find.ByClass("doc-banner-icon")).ToString();
//This will successfully return the text fields text.
string imageText = myIE.Div(Find.ById("doc-original-text")).ToString();
}
EDIT - It appears that I may need to use a different call on myIE, there is also myIE.Image, myIE.Link etc, I don't know much about this all still so not sure if Div is the right call here.
Try this...
string infoText = myIE.Div(Find.ByClass("doc-banner-icon")).Images.First().Src;
string imageText = myIE.Div(Find.ById("doc-original-text")).Text;
Hi I tried to read a page using HttpWebRequest like this
string lcUrl = "http://www.greatandhra.com";
HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(lcUrl);
loHttp.Timeout = 10000; // 10 secs
loHttp.UserAgent = "Code Sample Web Client";
HttpWebResponse loWebResponse = (HttpWebResponse)loHttp.GetResponse();
Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(), enc);
string lcHtml = loResponseStream.ReadToEnd();
mydiv.InnerHtml = lcHtml;
// Response.Write(lcHtml);
loWebResponse.Close();
loResponseStream.Close();
i can able to read that page and bind it to mydiv. But when i click on any one of links in that div it is not displaying any result. Because my application doesnt contain entire site. So what we will do now.
Can somebody copy my code and test it plz
Nagu
I'm fairly sure you can't insert a full page in a DIV without breaking something. In fact the whole head tag may be getting skipped altogether (and any javascript code there may not be run). Considering what you seem to want to do, I suggest you use an IFRAME with a dynamic src, which will also hopefully lift some pressure off your server (which wouldn't be in charge of fetching the html to be mirrored anymore).
If you really want a whole page of HTML embedded in another, then the IFRAME tag is probably the one to use, rather than the DIV.
Rather than having to create a web request and have all that code to retrieve the remote page, you can just set the src attribute of the IFRAME to point ot the page you want it to display.
For example, something like this in markup:
<iframe src="<%=LcUrl %>" frameborder="0"></iframe>
where LcUrl is a property on your code-behind page, that exposes your string lcUrl from your sample.
Alternatively, you could make the IFRAME runat="server" and set its src property programatically (or even inject the innerHTML in a way sismilar to your code sample if you really wanted to).
The code you are putting inside .InnerHtml of the div contains the entire page (including < html >, < body >, < /html > and < /body> ) which can cause a miriad of problems with any number of browsers.
I would either move to an iframe, or consider some sort of parsing the HTML for the remote site and displaying a transformed version (ie. strip the HTML ,BODY, META tags, replace some link URLs, etc).
But when i click on any one of links in that div it is not displaying any result
Probably because the links in the download page are relative... If you just copy the HTML into a DIV in your page, the browser considers the links relative to the current URL : it doesn't know about the origin of this content. I think the solution is to parse the downloaded HTML, and convert relative URLs in href attributes to absolute URLs
If you want to embed it, you need to strip everything but the body part. That means that you have to parse your string lcHTML for <body....> and remove everything before and includeing the body tag. You must also strip away everything from </body>. Then you need to parse the string for all occurences of <a href="....."> that do not start with http:// and include h t t p://www.greatandhra.com or set <base target="h t t p://www.greatandhra.com"> in your head section.
If you don't want to embed, simply clear the response buffer and stream the lcHTML string back to the browser.
PS: I had to write all h t t p with spaces to be able to post this.
Sounds like what you are trying to do is display a different site embedded in your site. For this to work by dropping it into a div you would have to extract the code between the body tags as it wouldn't be valid with html and head in the middle of another page.
The links won't work because you've now taken that page out of context in your site so you'd also have to rewrite any links on the page that are relative (i.e. don't start with http) to point to a page on your site which will then fetch the other sites page and display them back in your site, or you could add the url of the site you're grabbing to the beginning of all the relative links so they link back to that site.
I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows:
Get the URL from the user.
Load the Web page in the IE UI control(embedded browser) in WINForms.
Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
When the User wishes to persist the location (the HTML DOM location) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits.
Assume that the loaded website is a pricelisting site and the quoted rate keeps on changing, the idea is to persist the DOM hierarchy so that I can traverse it next time.
I would be able to do this if all the HTML elements had their id attributes. In the case where the id is null , i am not able to accomplish this .
Could someone suggest a valid idea on this (a bare minimum code snippet if possible).?
It would be helpful , even if you can share some online resources.
thanks,
vijay
One approach is to build a stack of tags/styles/id down to the element which you want to select.
From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.
Example:
<html>
<body>
<!-- lots of html -->
<div id="main">
<div>
<span>
<div class="pricearea">
<table> <!-- with price data -->
For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.
Using styles/classes might also be used to create your path. It's your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.
If the layout seldom changes, this would let you navigate to the same location each time.
I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.
Screen scraping is fun, but it's difficult to get it 100% for all pages. Good luck!
After a bit of googling , i encountered a fairly simple solution . Below attached is the sample snippet.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;// loads the HTML DOM
IHTMLSelectionObject selection = HtmlDoc.selection;// Fetches the currently selected HTML Element.
IHTMLTxtRange range = (IHTMLTxtRange)selection.createRange();
IHTMLElement parentElement = range.parentElement();// Identifies the parent element
targetSourceIndex = parentElement.sourceIndex;
//dataLocation = range.parentElement().id;
MessageBox.Show(range.text);//range.parentElement().sourceIndex
}
I used a Embedded Web Browser in a Winforms applications, which loads the HTML DOM of the current web page.
The IHTMLElement instance exposes a property named 'SourceIndex' which allocates a unique id to each of the html elements.
One can store this SourceIndex to the DB and Query for the content at that location. using the following code.
if (webBrowser.Document != null)
{
IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;
IHTMLElement targetElement = null;
foreach (IHTMLElement domElement in HtmlDoc.all)
{
if (domElement.sourceIndex == int.Parse(node.InnerText))// fetching the persisted data from the XML file.
{
targetElement = domElement;
break;
}
}
MessageBox.Show(targetElement.innerText); //range.parentElement().sourceIndex
}