I am in the same situation at the guy who asked this question. I need to get some data from a website saved as a string.
My problem here is, that the website i need to save data from, requires the user to be logged in to view the data...
So here my plan was to make the user go to the website using the WebBrowser, then login and when the user is on the right page, click a button which will automaticly save the data.
I want to use a similar method to the one used, in the top answer at the other question that i linked to in the start.
string data = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
I tried doing things like this:
string data = webBrowser1.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
But you can't do "webBrowser1.DocumentNode.SelectNodes"
I also saw that the answer on the other question says, that he uses HtmlAgilityPack, but i tried to download it, and i have no idea what to do with it..
Not the best with C#, so please don't comment too complicated answers. Or at least try to make it understandable.
Thanks in advance :)
Here is the an example of HtmlAgilityPack usage:
public string GetData(string htmlContent)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(htmlContent);
if (htmlDoc.DocumentNode != null)
{
string data = htmlDoc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
if(!string.IsNullOrEmpty(data))
return data;
}
return null;
}
Edit: If you want to emulate some actions in browser I would suggest you to use Selenium instead of regular WebBrowser control. Here is the link where to download it: http://www.seleniumhq.org/ or use NuGet to download it. This is a good question on how to use it: How do I use Selenium in C#?.
Related
I have problem with browse button and switching to file dialog. I cannot use my file path control and just send there my string with file path and file itself, as it's readonly and in fact some behind control is my input filepath.
Here's my code
driver.FindElement(By.Id("browseButton")).Click();
driver.SwitchTo().ActiveElement().SendKeys(filepath);
Above code fills my control for file path, as i can see that on UI. But my open file dialog is still opened and i do not know how to close it and submit my upload.
Uploading files in Selenium can be a pain, to say the least. The real problem comes from the fact that it does not support dialog boxes such as file upload and download.
I go over this in an answer to another question, so I will just copy/paste my answer from there here. The code examples should actually be relevant in your case, since you are using C#:
Copied from previous answer on question here:
Selenium Webdriver doesn't really support this. Interacting with non-browser windows (such as native file upload dialogs and basic auth dialogs) has been a topic of much discussion on the WebDriver discussion board, but there has been little to no progress on the subject.
I have, in the past, been able to work around this by capturing the underlying request with a tool such as Fiddler2, and then just sending the request with the specified file attached as a byte blob.
If you need cookies from an authenticated session, WebDriver.magage().getCookies() should help you in that aspect.
edit: I have code for this somewhere that worked, I'll see if I can get ahold of something that you can use.
public RosterPage UploadRosterFile(String filePath){
Face().Log("Importing Roster...");
LoginRequest login = new LoginRequest();
login.username = Prefs.EmailLogin;
login.password = Prefs.PasswordLogin;
login.rememberMe = false;
login.forward = "";
login.schoolId = "";
//Set up request data
String url = "http://www.foo.bar.com" + "/ManageRoster/UploadRoster";
String javaScript = "return $('#seasons li.selected') .attr('data-season-id');";
String seasonId = (String)((IJavaScriptExecutor)Driver().GetBaseDriver()).ExecuteScript(javaScript);
javaScript = "return Foo.Bar.data.selectedTeamId;";
String teamId = (String)((IJavaScriptExecutor)Driver().GetBaseDriver()).ExecuteScript(javaScript);
//Send Request and parse the response into the new Driver URL
MultipartForm form = new MultipartForm(url);
form.SetField("teamId", teamId);
form.SetField("seasonId", seasonId);
form.SendFile(filePath,LoginRequest.sendLoginRequest(login));
String response = form.ResponseText.ToString();
String newURL = StaticBaseTestObjs.RemoveStringSubString("http://www.foo.bar.com" + response.Split('"')[1].Split('"')[0],"amp;");
Face().Log("Navigating to URL: "+ newURL);
Driver().GoTo(new Uri(newURL));
return this;
}
Where MultiPartForm is:
MultiPartForm
And LoginRequest/Response:
LoginRequest
LoginResponse
The code above is in C#, but there are equivalent base classes in Java that will do what you need them to do to mimic this functionality.
The most important part of all of that code is the MultiPartForm.SendFile method, which is where the magic happens.
One of the many ways to do that is to remove the disable attribute and then use typical selenium SendKeys() to accomplish that
public void test(string path)
{
By byId = By.Id("removeAttribute");
const string removeAttribute = #"document.getElementById('browseButton').removeAttribute('disabled');";
((IJavaScriptExecutor)Driver).ExecuteScript(removeAttribute);
driver.FindElement(byId).Clear();
driver.FindElement(byId).SendKeys(path);
}
You can use this Auto IT Script to Handle File Upload Option.
Auto IT Script for File Upload:
AutoItSetOption("WinTitleMatchMode","2") ; set the select mode to
Do
Sleep ("1000")
until WinExists("File Upload")
WinWait("File Upload")
WinActivate("File Upload")
ControlFocus("File Upload","","Edit1")
Sleep(2000)
ControlSetText("File Upload" , "", "Edit1", $CmdLineRaw)
Sleep(2000)
ControlClick("File Upload" , "","Button1");
Build and Compile the above code and place the EXE in a path and call it when u need it.
Call this Once you click in the Browse Button.
Process p = System.Diagnostics.Process.Start(txt_Browse.Text + "\\File Upload", DocFileName);
p.WaitForExit();
I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:
Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
Select the element in the page that you want the XPath for.
Right click the element in the "Elements" tab.
Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per #knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
I looked and Metacritic.com doesn't have an API.
You can use an HttpWebRequest to get the contents of a website as a string.
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
og:title
og:type
og:url
og:image
og:site_name
og:description
The format of each tag is: meta name="og:title" content="In a World..."
I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
I'd recomend you WebsiteParser - it's based on HtmlAgilityPack (mentioned by Hanlet EscaƱo) but it makes web scraping easier with attributes and css selectors:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);
Nuget link
I want to make a small application that would read a title from current opened youtube video from my firefox or chrome browser and save it in .txt file on my computer.
I need an idea on how to accomplish this. Is it somehow possible to access tabs opened in firefox or chrome via c#?
Do you understand me? I want to somehow parse the data from browser from seleceted tab and save it into .txt file.
Would I have to use greasemonkey scripts for this?
If the tab is currently active then you could do this in C#:
string browser = "Firefox"; //or change to chrome/iexplore
var browserProc = Process.GetProcessesByName(browser)
.Where(b => b.MainWindowTitle.Contains("YouTube"))
.FirstOrDefault();
if (browserProc != null)
{
string mainTitle = browserProc.MainWindowTitle;
}
You can then parse the relevant parts of mainTitle if you need to.
You could use Win32 API calls to do this. FindWindowEx, GetWindowText, etc.
http://msdn.microsoft.com/en-us/library/windows/desktop/ff468919(v=vs.85).aspx
I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!
I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.
Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.
I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.