i want to work on a scraper program which will search keyword in google. i have problem in starting my scraper program.
my problem is:
let suppose window application(c#) have 2 textboxes and a button control. first textbox have "www.google.com" and the 2nd textbox contain keywork for example:
textbox1: www.google.com
textbox2: "cricket"
i want code to add to the button click event that will search cricket in google. if anyone have a programing idea in c# then plz help me.
best regards
i have googled my problem and found solution to the above problem...
we can use google API for this purpose...when we add reference to google api then we will add the following namespace in our program...........
using Google.API.Search;
write the following code in button click event
var client = new GwebSearchClient("http://www.google.com");
var results = client.Search("google api for .NET", 100);
foreach (var webResult in results)
{
//Console.WriteLine("{0}, {1}, {2}", webResult.Title, webResult.Url, webResult.Content);
listBox1.Items.Add(webResult.ToString ());
}
test my solution and give comments .........thanx everybody
I agree with Paqogomez that you don't appear to have put much work into this but I also understand that it can be hard to get started. Here is some sample code that should get you on the right path.
private void button1_Click(object sender, EventArgs e)
{
string uriString = "http://www.google.com/search";
string keywordString = "Test Keyword";
WebClient webClient = new WebClient();
NameValueCollection nameValueCollection = new NameValueCollection();
nameValueCollection.Add("q", keywordString);
webClient.QueryString.Add(nameValueCollection);
textBox1.Text = webClient.DownloadString(uriString);
}
This code will search for "Test Keyword" on Google and return the results as a string.
The problems with what you are asking is Google is going to return your result as HTML that you will need to parse. I really think you need to do some research on the Google API and what is needed to programmatically request data from Google. Start your search here Google Developers.
Hope this helps get you started on the right path.
You can use the WebClient class and DownloadString method
for searches. Use the regex for matching urls from result string.
For example:
WebClient Web = new WebClient();
string Source=Web.DownloadString("https://www.google.com/search?client=" + textbox2.text);
Regex regex =new Regex(#“ ^http(s)?://([\w-]+.)+[\w-]+(/[\w%&=])?$”);
MatchCollection Collection=regex.Matches(source);
List<string> Urls=new List<string>();
foreach (Match match in Collection)
{
Urls.Add(match.ToString());
}
Related
I am trying to parse Google play store HTML page in C# .NET core. Unfortunately, Google does not provide APIs to get the mobile application info (such as version, last update ...), while Apple does. This is why I am trying to parse the HTML page and then get the info needed.
However, it seems they published a new version recently, where a user has to press on an arrow button to be able to see the info of the app displayed in a popup.
In order to understand more, consider the example of WhatsApp application: https://play.google.com/store/apps/details?id=com.whatsapp&hl=en
In order to get the info of this app (like release date, version ...), the user has to press now on the arrow near "About this app".
Previously, the below code was working perfectly:
var id = "com.whatsapp";
var language = "en";
var url = string.Format("https://play.google.com/store/apps/details?id={0}&hl={1}", id, language);
string result;
WebClient client = new WebClient();
client.Encoding = System.Text.UTF8Encoding.UTF8;
result = client.DownloadString(url);
MatchCollection matches = Regex.Matches(result, "<div class=\"hAyfc\">.*?
<span class=\"htlgb\"><div class=\"IQ1z0d\"><span class=\"htlgb\">(?<content>.*?)
</span></div></span></div>");
objAndroidDetails.updated = matches[0].Groups["content"].Value;
objAndroidDetails.version = matches[3].Groups["content"].Value;
...
But now, it's not the case anymore for two reasons:
The regular expression is not valid anymore
The client.DownloadString(url) downloads only the code before triggering the button to display the info, thus I will not be able to extract it bcz it's not available :)) .
So, anybody can help me to solve the issue #2 ? I need somehow to trigger the button in order to be able to match the HTML code needed and extract it.
Thanks
I am in the same situation at the guy who asked this question. I need to get some data from a website saved as a string.
My problem here is, that the website i need to save data from, requires the user to be logged in to view the data...
So here my plan was to make the user go to the website using the WebBrowser, then login and when the user is on the right page, click a button which will automaticly save the data.
I want to use a similar method to the one used, in the top answer at the other question that i linked to in the start.
string data = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
I tried doing things like this:
string data = webBrowser1.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
But you can't do "webBrowser1.DocumentNode.SelectNodes"
I also saw that the answer on the other question says, that he uses HtmlAgilityPack, but i tried to download it, and i have no idea what to do with it..
Not the best with C#, so please don't comment too complicated answers. Or at least try to make it understandable.
Thanks in advance :)
Here is the an example of HtmlAgilityPack usage:
public string GetData(string htmlContent)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(htmlContent);
if (htmlDoc.DocumentNode != null)
{
string data = htmlDoc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
if(!string.IsNullOrEmpty(data))
return data;
}
return null;
}
Edit: If you want to emulate some actions in browser I would suggest you to use Selenium instead of regular WebBrowser control. Here is the link where to download it: http://www.seleniumhq.org/ or use NuGet to download it. This is a good question on how to use it: How do I use Selenium in C#?.
I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:
Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
Select the element in the page that you want the XPath for.
Right click the element in the "Elements" tab.
Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per #knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
I looked and Metacritic.com doesn't have an API.
You can use an HttpWebRequest to get the contents of a website as a string.
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
og:title
og:type
og:url
og:image
og:site_name
og:description
The format of each tag is: meta name="og:title" content="In a World..."
I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
I'd recomend you WebsiteParser - it's based on HtmlAgilityPack (mentioned by Hanlet Escaño) but it makes web scraping easier with attributes and css selectors:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);
Nuget link
I'm trying to get the FINAL source of a webpage. I am using webclient openRead method, but this method is only returning the initial page source. After the source downloads, there is a javascript that runs and collect the data that I need in a different format and my method will be looking for something that got completely changed.
What I am talking about is exactly like the difference between:
right-click on a webpage -> select view source
access the developer tools
Look at this site to know what I am talking about: http://www.augsburg.edu/history/fac_listing.html and watch how any of the email is displayed using each option. I think what happening is that the first will show you the initial load of the page. The second will show you the final page html. The webclient only lets me do option #1.
here is the code that will only return option #1. Oh I need to do this from a console application. Thank you!
private static string GetReader(string site)
{
WebClient client = new WebClient();
try
{
data = client.OpenRead(site);
reader = new StreamReader(data);
}
catch
{
return "";
}
return reader.ReadToEnd();
}
I've found a solution to my problem.
I ended up using Selenium-WebDriver PageSource property. It worked beautifully!
Learn about Selenium and Webdriver. It is an easy thing to learn. It helps for testing and on this!
It seems that Im encountering quite a few problems in a simple attempt to parse some HTML. As practice, I'm writting a mutli-threaded web crawler that starts with a list of sites to crawl. This gets handed down through a few classes, which should eventually return the content of the sites back to my system. This seems rather straightforward, but I've had no luck in either of the following tasks:
A. Convert the content of a website ( In string format, from an HttpWebRequest Stream ) to an HtmlDocument ( Cannot create a new instance of an HtmlDocument? Doesn't make much sense... ) by using the HtmlDocument.Write() Method.
or
B. Collect an HtmlDocument via a WebBrowser instance.
Here is my code as it exists, any advice would be great...
public void Start()
{
if (this.RunningThread == null)
{
Console.WriteLine( "Executing SiteCrawler for " + SiteRoot.DnsSafeHost);
this.RunningThread = new Thread(this.Start);
this.RunningThread.SetApartmentState(ApartmentState.STA);
this.RunningThread.Start();
}
else
{
try
{
WebBrowser BrowserEmulator = new WebBrowser();
BrowserEmulator.Navigate(this.SiteRoot);
HtmlElementCollection LinkCollection = BrowserEmulator.Document.GetElementsByTagName("a");
List<PageCrawler> PageCrawlerList = new List<PageCrawler>();
foreach (HtmlElement Link in LinkCollection)
{
PageCrawlerList.Add(new PageCrawler(Link.GetAttribute("href"), true));
continue;
}
return;
}
catch (Exception e)
{
throw new Exception("Exception encountered in SiteCrawler: " + e.Message);
}
}
}
This code seems to do nothing when it passes over the 'Navigate' method. I've attempted allowing it to open in a new window, which pops a new instance of IE, and proceeds to navigate to the specified address, but not before my program steps over the navigate method. I've tried waiting for the browser to be 'not busy', but it never seems to pick up the busy attribute anyway. I've tried creating a new document via the Browser.Document.OpenNew() so that I might populate it with data from a WebRequest stream, however as Im sure you can assume I get back a Null Pointer exception when I try to reach through the 'Document' portion of that statement. I've done some research and this appears to be the only way to create a new HtmlDocument.
As you can see, this method is intended to kick off a 'PageCrawler' for every link in a specified page. I am sure that I could parse through the HTML character by character to find all of the links, after using an HttpWebRequest and collecting the data from the stream, but this is far more work than should be necessary to complete this.
If anyone has any advice it would be greatly appreciated. Thank you.
If this is a console application, then it will not work since the console application doesn't have a message pump (which is required for the WebBrowser to process messages).
If you run this in a Windows Forms application, then you should handle the DocumentCompleted event:
WebBrowser browserEmulator = new WebBrowser();
browserEmulator.DocumentCompleted += OnDocumentCompleted;
browserEmulator.Navigate(this.SiteRoot);
Then implement the method that handles the event:
private void OnDocCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = sender as WebBrowser;
if (wb.Document != null)
{
List<string> links = new List<string>();
foreach (HtmlElement element in wb.Document.GetElementsByTagName("a"))
{
links.Add(element.GetAttribute("href"));
}
foreach (string link in links)
{
Console.WriteLine(link);
}
}
}
If you want to run this in a console application, then you need to use a different method for downloading pages. I would recommend that you use the WebRequest/WebResponse and then use the HtmlAgilityPack to parse the HTML. The HtmlAgilityPack will generate an HtmlDocument for you and you can get the links from there.
Additionally, if you're interested in learning more about building scalable web crawlers, then check out the following links:
How to crawl billions of pages?
Designing a web crawler
Good luck!