It seems that Im encountering quite a few problems in a simple attempt to parse some HTML. As practice, I'm writting a mutli-threaded web crawler that starts with a list of sites to crawl. This gets handed down through a few classes, which should eventually return the content of the sites back to my system. This seems rather straightforward, but I've had no luck in either of the following tasks:
A. Convert the content of a website ( In string format, from an HttpWebRequest Stream ) to an HtmlDocument ( Cannot create a new instance of an HtmlDocument? Doesn't make much sense... ) by using the HtmlDocument.Write() Method.
or
B. Collect an HtmlDocument via a WebBrowser instance.
Here is my code as it exists, any advice would be great...
public void Start()
{
if (this.RunningThread == null)
{
Console.WriteLine( "Executing SiteCrawler for " + SiteRoot.DnsSafeHost);
this.RunningThread = new Thread(this.Start);
this.RunningThread.SetApartmentState(ApartmentState.STA);
this.RunningThread.Start();
}
else
{
try
{
WebBrowser BrowserEmulator = new WebBrowser();
BrowserEmulator.Navigate(this.SiteRoot);
HtmlElementCollection LinkCollection = BrowserEmulator.Document.GetElementsByTagName("a");
List<PageCrawler> PageCrawlerList = new List<PageCrawler>();
foreach (HtmlElement Link in LinkCollection)
{
PageCrawlerList.Add(new PageCrawler(Link.GetAttribute("href"), true));
continue;
}
return;
}
catch (Exception e)
{
throw new Exception("Exception encountered in SiteCrawler: " + e.Message);
}
}
}
This code seems to do nothing when it passes over the 'Navigate' method. I've attempted allowing it to open in a new window, which pops a new instance of IE, and proceeds to navigate to the specified address, but not before my program steps over the navigate method. I've tried waiting for the browser to be 'not busy', but it never seems to pick up the busy attribute anyway. I've tried creating a new document via the Browser.Document.OpenNew() so that I might populate it with data from a WebRequest stream, however as Im sure you can assume I get back a Null Pointer exception when I try to reach through the 'Document' portion of that statement. I've done some research and this appears to be the only way to create a new HtmlDocument.
As you can see, this method is intended to kick off a 'PageCrawler' for every link in a specified page. I am sure that I could parse through the HTML character by character to find all of the links, after using an HttpWebRequest and collecting the data from the stream, but this is far more work than should be necessary to complete this.
If anyone has any advice it would be greatly appreciated. Thank you.
If this is a console application, then it will not work since the console application doesn't have a message pump (which is required for the WebBrowser to process messages).
If you run this in a Windows Forms application, then you should handle the DocumentCompleted event:
WebBrowser browserEmulator = new WebBrowser();
browserEmulator.DocumentCompleted += OnDocumentCompleted;
browserEmulator.Navigate(this.SiteRoot);
Then implement the method that handles the event:
private void OnDocCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = sender as WebBrowser;
if (wb.Document != null)
{
List<string> links = new List<string>();
foreach (HtmlElement element in wb.Document.GetElementsByTagName("a"))
{
links.Add(element.GetAttribute("href"));
}
foreach (string link in links)
{
Console.WriteLine(link);
}
}
}
If you want to run this in a console application, then you need to use a different method for downloading pages. I would recommend that you use the WebRequest/WebResponse and then use the HtmlAgilityPack to parse the HTML. The HtmlAgilityPack will generate an HtmlDocument for you and you can get the links from there.
Additionally, if you're interested in learning more about building scalable web crawlers, then check out the following links:
How to crawl billions of pages?
Designing a web crawler
Good luck!
Related
I am wondering, if its possible to display the first image of google picture search in the visual studio windows form.
The way I imagine this would work, is that a person enters a string, then the app googles the string, copies the first image, and then displays it in the app itself.
Thank you.
EDIT: Please consider, that I am a beginner in C# programming, so if you are going to use some difficult coding or suggest to use some APIs, could you please explain in more detail how to do so, thank you.
Short answer, Yes.
We know the URL to get an image is
https://www.google.co.uk/search?q=plane&tbm=isch&site=imghp
On the Form create a PictureBox(Call it pbImage), a TextBox(Call it tbSearch), a Button(Call it btnLookup).
Using the Nuget Package Manager (Tools-> Nuget.. -> Manage..), select browse and search for HtmlAgilityPack. Click the your project on the right and then click install.
When we send a request to google using System.Net.WebClient there is no javascript being executed (however this can be done by some trickery with the winforms web browser).
As there is no javascript the page will be rendered differently to what you are used to. Inspecting the page without javascript tells us the following flow of the page:
Within the document body a table with a class called 'images_table'
Within that we can find several img elements.
Here is a code listing:
private void btnLookup_Click(object sender, EventArgs e)
{
string templateUrl = #"https://www.google.co.uk/search?q={0}&tbm=isch&site=imghp";
//check that we have a term to search for.
if (string.IsNullOrEmpty(tbSearch.Text))
{
MessageBox.Show("Please supply a search term"); return;
}
else
{
using (WebClient wc = new WebClient())
{
//lets pretend we are IE8 on Vista.
wc.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)");
string result = wc.DownloadString(String.Format(templateUrl, new object[] { tbSearch.Text }));
//we have valid markup, this will change from time to time as google updates.
if (result.Contains("images_table"))
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(result);
//lets create a linq query to find all the img's stored in that images_table class.
/*
* Essentially we get search for the table called images_table, and then get all images that have a valid src containing images?
* which is the string used by google
eg https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQmGxh15UUyzV_HGuGZXUxxnnc6LuqLMgHR9ssUu1uRwy0Oab9OeK1wCw
*/
var imgList = from tables in doc.DocumentNode.Descendants("table")
from img in tables.Descendants("img")
where tables.Attributes["class"] != null && tables.Attributes["class"].Value == "images_table"
&& img.Attributes["src"] != null && img.Attributes["src"].Value.Contains("images?")
select img;
byte[] downloadedData = wc.DownloadData(imgList.First().Attributes["src"].Value);
if (downloadedData != null)
{
//store the downloaded data in to a stream
System.IO.MemoryStream ms = new System.IO.MemoryStream(downloadedData, 0, downloadedData.Length);
//write to that stream the byte array
ms.Write(downloadedData, 0, downloadedData.Length);
//load an image from that stream.
pbImage.Image = Image.FromStream(ms);
}
}
}
}
}
Using System.Net.WebClient a request is sent to google using the url specified in the template string.
Adding headers makes the request looks more genuine. WebClient is used to download the markup, this is stored in result.
HtmlAgilityPack.HtmlDocument create a document object, we then load the data that was stored in result.
A Linq query is obtains the img elements, taking the first in that list we download the data and store it in a byte array.
With that data a memory stream is created (this should be encapsulated in a using().)
Write the data into the memory stream, then load that stream into the picture boxes image.
i want to work on a scraper program which will search keyword in google. i have problem in starting my scraper program.
my problem is:
let suppose window application(c#) have 2 textboxes and a button control. first textbox have "www.google.com" and the 2nd textbox contain keywork for example:
textbox1: www.google.com
textbox2: "cricket"
i want code to add to the button click event that will search cricket in google. if anyone have a programing idea in c# then plz help me.
best regards
i have googled my problem and found solution to the above problem...
we can use google API for this purpose...when we add reference to google api then we will add the following namespace in our program...........
using Google.API.Search;
write the following code in button click event
var client = new GwebSearchClient("http://www.google.com");
var results = client.Search("google api for .NET", 100);
foreach (var webResult in results)
{
//Console.WriteLine("{0}, {1}, {2}", webResult.Title, webResult.Url, webResult.Content);
listBox1.Items.Add(webResult.ToString ());
}
test my solution and give comments .........thanx everybody
I agree with Paqogomez that you don't appear to have put much work into this but I also understand that it can be hard to get started. Here is some sample code that should get you on the right path.
private void button1_Click(object sender, EventArgs e)
{
string uriString = "http://www.google.com/search";
string keywordString = "Test Keyword";
WebClient webClient = new WebClient();
NameValueCollection nameValueCollection = new NameValueCollection();
nameValueCollection.Add("q", keywordString);
webClient.QueryString.Add(nameValueCollection);
textBox1.Text = webClient.DownloadString(uriString);
}
This code will search for "Test Keyword" on Google and return the results as a string.
The problems with what you are asking is Google is going to return your result as HTML that you will need to parse. I really think you need to do some research on the Google API and what is needed to programmatically request data from Google. Start your search here Google Developers.
Hope this helps get you started on the right path.
You can use the WebClient class and DownloadString method
for searches. Use the regex for matching urls from result string.
For example:
WebClient Web = new WebClient();
string Source=Web.DownloadString("https://www.google.com/search?client=" + textbox2.text);
Regex regex =new Regex(#“ ^http(s)?://([\w-]+.)+[\w-]+(/[\w%&=])?$”);
MatchCollection Collection=regex.Matches(source);
List<string> Urls=new List<string>();
foreach (Match match in Collection)
{
Urls.Add(match.ToString());
}
I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:
Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
Select the element in the page that you want the XPath for.
Right click the element in the "Elements" tab.
Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per #knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
I looked and Metacritic.com doesn't have an API.
You can use an HttpWebRequest to get the contents of a website as a string.
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
og:title
og:type
og:url
og:image
og:site_name
og:description
The format of each tag is: meta name="og:title" content="In a World..."
I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
I'd recomend you WebsiteParser - it's based on HtmlAgilityPack (mentioned by Hanlet Escaño) but it makes web scraping easier with attributes and css selectors:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);
Nuget link
There are a lot threads about this but none of them were clear and none I tried actually even worked right.
What is the code to get the contents of the entire web browser control (even that which is off screen)?
It looks like they did have:
webBrowser1.DrawToBitmap(); // but its unsupported and doesnt work
3rd party api - not wasting my time
.DrawToBitmap and nonanswer-links
100 wrong answers
just takes a screenshot
Try to make sure you are calling the method in the DocumentCompleted event.
webBrowser1.Width = wb.Document.Body.ScrollRectangle.Width;
webBrowser1.Height = wb.Document.Body.ScrollRectangle.Height;
Bitmap bitmap = new Bitmap(webBrowser1.Width, webBrowser1.Height);
webBrowser1.DrawToBitmap(bitmap, new Rectangle(0, 0, webBrowser1.Width, webBrowser1.Height));
I was working on a similiar function in my project last week, read a few posts on this topic including your links. I'd like to share my experience:
The key part of this function is System.Windows.Forms.WebBrowser.DrawToBitmap method.
but its unsupported and doesnt work
It is supported and does work, but not always works fine. In some circumstances you will get a blank image screenshot(in my experience, the more complex html it loads, the more possible it fails. In my project only very simple and well-formatted htmls will be loaded into the WebBrowser control so I never get blank images).
Anyway I have no 100% perfect solution either. Here is part of my core code and hope it helps (it works on ASP.NET MVC 3).
using (var browser = new System.Windows.Forms.WebBrowser())
{
browser.DocumentCompleted += delegate
{
using (var pic = new Bitmap(browser.Width, browser.Height))
{
browser.DrawToBitmap(pic, new Rectangle(0, 0, pic.Width, pic.Height));
pic.Save(imagePath);
}
};
browser.Navigate(Server.MapPath("~") + htmlPath); //a file or a url
browser.ScrollBarsEnabled = false;
while (browser.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
{
System.Windows.Forms.Application.DoEvents();
}
}
I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!