How can I find and scrape by Class using WatiN?

How can I find and scrape by Class using WatiN? - c#

I am using WatiN and trying to scrape an image URL from a weblink, based on the fields class. Viewing the sites code the images info displays as this:
//images code
<div class="doc-banner-icon">
<img src="https://website.com/image.jpg">
</div>
//text code
<div id="doc-original-text">
Once upon a time, in a land far far away...
</div>
What I want to do is use a WatiN call to find that img link. I thought I could use something like the Find.ByClass() call to find specifically that area of the code, but I can't seem to figure out how to get the line of text contained within that class. When I use the Find.ById() on a different field and sent to string it pulls the text content of that area. Below is what I am trying.
using (myIE)
{
//loads the website
myIE.GoTo(txtbxWeblink.Text);
string infoText = myIE.Div(Find.ByClass("doc-banner-icon")).ToString();
//This will successfully return the text fields text.
string imageText = myIE.Div(Find.ById("doc-original-text")).ToString();
}
EDIT - It appears that I may need to use a different call on myIE, there is also myIE.Image, myIE.Link etc, I don't know much about this all still so not sure if Div is the right call here.

Try this...
string infoText = myIE.Div(Find.ByClass("doc-banner-icon")).Images.First().Src;
string imageText = myIE.Div(Find.ById("doc-original-text")).Text;

Related

Change text in HTML that I'm grabbing from a text file

What I'm doing is grabbing the HTML code for my header from a text file.
But once a User logs in I want it to say Welcome "Username" at the top, which is a dropdown to account settings, cart, etc...
So since I'm inserting the HTML into a DIV on page load, I don't actually have access to any of the elements inside in C#.
How would I go about doing this? Is there any way to access something like a (p id="name")'s inner text, after its loaded in from the text file?
Would like to do this with C# not JS please.
Edit: I have a work around for now, but I am still interested in better answers.
headerText = headerText.Replace("::Username::", Session["Username"] as string);
Here is my code for grabbing the HTML and pasting it in.
string headerText = File.ReadAllText(Server.MapPath("~/global/header.html"));
string footerText = File.ReadAllText(Server.MapPath("~/global/footer.html"));
headerText = headerText.Replace("::Username::", Session["Username"] as string);
divHeader.InnerHtml = headerText;
divFooter.InnerHtml = footerText;
To be more clear, is there anyway to access something like
<asp:Panel ID="panelAccount" runat="server">
which is stored in another HTML file.

I have done similar work in C# earlier, I use HTMLAgilityPack for doing this kind of works. Later I start using Anglesharp since it has a very good CSS selector based support.
Try anglesharp and you can modify the HTML tag like you do in jQuery.

As much as I liked Adrians response, I found out a better way to do what I needed to do, I could do with a MasterPage. I definitely be using Adrians answer for a load of other things though so it still holds valid.
Then in the C# file associated with the master page you create a property like this.
public Panel PanelAccount
{
get { return panelAccount; }
set { panelAccount = value; }
}
Then from the regular WebForm, you can call "PanelAccount" to access that property.
Here is a tutorial for how to do that.
Thanks for everyones downvotes with no inputs, you guys are stars!

Retrieving certain href links from html C#

I'm a bit confused on how to extract specific href links from an HTML page. There are certainly a good amount of examples, but they seem to cover either gathering an href when theres just one on the page, or gathering all the links.
So I currently push the HTML document into a text file using HttpWebRequest, HttpWebResponse, and StreamReader.
Here's my little sample I'm working with, this just downloads the URL of my choice and saves it to a text file.
protected void btnURL_Click(object sender, EventArgs e)
{
string url = txtboxURL.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
//lblResponse.Text = sr.ReadToEnd();
string urldata = sr.ReadToEnd();
if (File.Exists(#"C:\Temp\test.txt"))
{
File.Delete(#"C:\Temp\test.txt");
}
File.Create(#"C:\Temp\test.txt").Close();
File.WriteAllText(#"C:\Temp\test.txt", urldata);
sr.Close();
response.Close();
}
I can search the entire text file for a href, but there are a lot of them on each page, and the ones I'm looking for are sectioned in a <nav> tag, and then they are all in <div> tags with the same class, sort of like this:
<nav class="deptVertNav>
<div class="acTrigger">
<a href="*this is what I need to get*" ....
....
</a>
</div>
<div class="acTrigger">
<a href="*etc*" ....
....
</a>
</div>
<div class="acTrigger">
<a href="*etc*" ....
....
</a>
</div>
</nav>
Essentially I'm trying to create a text crawler/scraper to retrieve links. The current pages I'm working with start at a main page with links down the side on a navigation bar. Those links in the navigation bar are what I want to get to so I may download each of those page's content, and then retrieve the real data I'm looking for. So this is all just one big parse job, and I am terrible at parsing. If I can figure out how to parse this first main page then I will be able to parse the sub pages.
I don't want anyone to just give me the answer, I just want to know what a good method of parsing would be in this situation. IE how do I narrow the parse down to just those tags, and then what would be a good dynamic way to store those links so I can access them later? I hope this makes sense.
EDIT: Well I am now attempting to use HtmlAgilityPack with much confusion. To my knowledge this will retrieve all the nodes that are a <div class="acTrigger"> that are within the page I load:
var div = html.DocumentNode.SelectNodes("//div[#class='acTrigger']");
The next question is how I get inside the <div> tag and into the <a> tag, and then retrieve the href value, and store it.

Instead of trying to manually parse the text file, I would recommend placing the HTML in a HtmlDocument control (https://msdn.microsoft.com/en-us/library/system.windows.forms.htmldocument(v=vs.110).aspx) or WebBrowser control (https://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(v=vs.110).aspx). This allows you to access the elements already parsed. From there you can easily find all DIV elements with the appropriate class, and then the A element inside of that.

Take a look at the Selenium Web Driver library. Then grab the urls as needed.
IWebElement anchorUrl1 = driver.FindElement(By.XPath("//nav[#class='deptVertNav']/div[1]/a[1]"));
string urlText1 = anchorUrl1.Text;
IWebElement anchorUrl2 = driver.FindElement(By.XPath("//nav[#class='deptVertNav']/div[2]/a[1]"));
string urlText2 = anchorUrl2.Text;
If all you want to do is click on them, then:
driver.FindElement(By.XPath("//nav[#class='deptVertNav']/div[1]/a[1]")).Click();

Code in C# that creates redirect buttons in HTML

I am trying to create a C# string builder that will later build an html page based on information in a database and user input.
What I am having trouble with is creating two buttons in the string where one button redirects to page a.asp and the other button redirects to b.asp
I have tried multiple methods but none seem to work. Here is my latest version of code but I might be way off track:
in my page.asp.cs file:
responseString +=
"<div>"
+"<table><tr>"
+"<td><button id='submitSave' type='submit'>Save</button></td>"
+ "<td><button id='continueBatch' onserverclick=\"OnClickButton\" type='submit' runat'server'>Continue Batch</button></td>"
+ "<td><button id='submitDelete' onclick='confirmDialog()' type='button'>Delete</button></td>"
+ "</tr></table></div>";
and it points to the method also in page.asp.cs:
public void OnClickButton()
{
//Redirect to New
Response.Redirect(String.Format("New.aspx?fmtypeP={0}&formverP={1}", FormTypeS, FormVersionS));
}
and last but not least I have in the page.asp page the following:
<form action="save.aspx" method="POST">
I know the form action will need to change but I just wanted to let you know I currently had it in place.
Am I working in the right direction? Is there an easier way to accomplish my task? If not what am I doing wrong?

I would rely on ajax to solve this problem if possible. One solution would be to use onclick and call an existing javascript method like you seem to have done for the confirmDialog() on the delete button. Maybe use jquery's wrapper for the ajax call: http://api.jquery.com/jquery.ajax/. Let the server method return the redirect link and use window.location = the return value in the success-method.
Or if you know what the redirect links should be when building the string you can make the redirect directly: onclick='window.location = "your redirect link"'

You can not use server-side controls like that. In your code they just added to response stream and not being compile and stuff. You need to add controls to your aspx page or add them in code like actual controls via new(), not like strings.

Pull specific IDs from HTML code C#

I am making a program that you can log into facebook, go to someone's wall and post on it. I have all of this working, if i manually enter in the ID that i go and find in the HTML for the textview box where you write the status and the Post button where you click to post. My problem is that the ID for this textbox and the post button changes every single time you go to the page or the page refreshes. So you cant just code in the ID to get it like thisHtmlElement element = browser.Document.GetElementById("some-ID"); I need to somehow parse the HTML code and pull out the specific IDs for the textview and the button every time the page refreshes. I have looked into html agility pack and it hasn't really helped me any. Can anyone help?
EDIT
Here is the line of code I am trying to get the ID out of for the text area where you post. You can see that it is just a random garble of letters and some numbers at the end. It changes every time you load the page. And I haven't looked at the facebook api would it help any?
<DIV class="innerWrap"><TEXTAREA aria-expanded="false" onkeydown='Bootloader.loadComponents(["control-textarea"], function() { TextAreaControl.getInstance(this) }.bind(this)); ' id="ujkzdk454" class="DOMControl_placeholder uiTextareaAutogrow input mentionsTextarea textInput" title="Write something..." role="textbox" aria-owns="typeahead_list_ujkzdk453" name="xhpc_message" autocomplete="off" aria-autocomplete="list" placeholder="Write something..." aria-label="Write something...">Write something...</TEXTAREA></DIV></DIV></DIV><INPUT

I agree you should probably be using the Facebook API but anyway, how about some thing like this with jquery?
var textarea = $('textarea', '.innerWrap');
var button = $('.innerWrap').next('input');
and if you still want the ids
var textareaId = textarea.attr('id')
var buttonId = button.attr('id')

HttpWebRequest reades only homepage

Hi I tried to read a page using HttpWebRequest like this
string lcUrl = "http://www.greatandhra.com";
HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(lcUrl);
loHttp.Timeout = 10000; // 10 secs
loHttp.UserAgent = "Code Sample Web Client";
HttpWebResponse loWebResponse = (HttpWebResponse)loHttp.GetResponse();
Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(), enc);
string lcHtml = loResponseStream.ReadToEnd();
mydiv.InnerHtml = lcHtml;
// Response.Write(lcHtml);
loWebResponse.Close();
loResponseStream.Close();
i can able to read that page and bind it to mydiv. But when i click on any one of links in that div it is not displaying any result. Because my application doesnt contain entire site. So what we will do now.
Can somebody copy my code and test it plz
Nagu

I'm fairly sure you can't insert a full page in a DIV without breaking something. In fact the whole head tag may be getting skipped altogether (and any javascript code there may not be run). Considering what you seem to want to do, I suggest you use an IFRAME with a dynamic src, which will also hopefully lift some pressure off your server (which wouldn't be in charge of fetching the html to be mirrored anymore).

If you really want a whole page of HTML embedded in another, then the IFRAME tag is probably the one to use, rather than the DIV.
Rather than having to create a web request and have all that code to retrieve the remote page, you can just set the src attribute of the IFRAME to point ot the page you want it to display.
For example, something like this in markup:
<iframe src="<%=LcUrl %>" frameborder="0"></iframe>
where LcUrl is a property on your code-behind page, that exposes your string lcUrl from your sample.
Alternatively, you could make the IFRAME runat="server" and set its src property programatically (or even inject the innerHTML in a way sismilar to your code sample if you really wanted to).

The code you are putting inside .InnerHtml of the div contains the entire page (including < html >, < body >, < /html > and < /body> ) which can cause a miriad of problems with any number of browsers.
I would either move to an iframe, or consider some sort of parsing the HTML for the remote site and displaying a transformed version (ie. strip the HTML ,BODY, META tags, replace some link URLs, etc).

But when i click on any one of links in that div it is not displaying any result
Probably because the links in the download page are relative... If you just copy the HTML into a DIV in your page, the browser considers the links relative to the current URL : it doesn't know about the origin of this content. I think the solution is to parse the downloaded HTML, and convert relative URLs in href attributes to absolute URLs

If you want to embed it, you need to strip everything but the body part. That means that you have to parse your string lcHTML for <body....> and remove everything before and includeing the body tag. You must also strip away everything from </body>. Then you need to parse the string for all occurences of <a href="....."> that do not start with http:// and include h t t p://www.greatandhra.com or set <base target="h t t p://www.greatandhra.com"> in your head section.
If you don't want to embed, simply clear the response buffer and stream the lcHTML string back to the browser.
PS: I had to write all h t t p with spaces to be able to post this.

Sounds like what you are trying to do is display a different site embedded in your site. For this to work by dropping it into a div you would have to extract the code between the body tags as it wouldn't be valid with html and head in the middle of another page.
The links won't work because you've now taken that page out of context in your site so you'd also have to rewrite any links on the page that are relative (i.e. don't start with http) to point to a page on your site which will then fetch the other sites page and display them back in your site, or you could add the url of the site you're grabbing to the beginning of all the relative links so they link back to that site.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I find and scrape by Class using WatiN? - c#

Try this... string infoText = myIE.Div(Find.ByClass("doc-banner-icon")).Images.First().Src; string imageText = myIE.Div(Find.ById("doc-original-text")).Text;

Related

Change text in HTML that I'm grabbing from a text file

Retrieving certain href links from html C#

Code in C# that creates redirect buttons in HTML

Pull specific IDs from HTML code C#

HttpWebRequest reades only homepage

Categories

Resources