select a random link from generated search results in C# - c#

I am making a wpf application and I need it to select a link at random from the generated search results. I have no idea how to go about doing that. It is just an intellectual exercise I was assigned. please help i am almost done. Here is the code so far... I am a super beginner at WPF.
namespace Search
{
/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
}
private void Btn_Click(object sender, RoutedEventArgs e)
{
using (var browser = new IE("http://www.google.com"))
{
browser.TextField(Find.ByName("q")).TypeText(_textBox.Text);
browser.Button(Find.ByName("btnG")).Click();
browser.WaitForComplete(5000);
System.Windows.Forms.SendKeys.SendWait("{Enter}"); // presses search on the second screen
browser.Button(Find.ById("gbqfb")/*.ByName("btnG")*/).Click(); // doesn't work
}
}
}
}

Here's some indicative code...
private void DownloadRandomLink(string searchTerm)
{
string fullUrl = "http://www.google.com/#q=" + searchTerm;
WebClient wc = new WebClient();
wc.DownloadFile(fullUrl, "file.htm");
Random rand = new Random();
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
var linksOnPage = from lnks in doc.DocumentNode.Descendants()
where lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
if (linksOnPage.Count() > 0)
{
int randomChoice = rand.Next(0, linksOnPage.Count()-1);
var link = linksOnPage.Skip(randomChoice).First();
// do something with link...
}
}
This code takes a search term and builds a full Google url. It then downloads the query into a local file, and opens the file with HTML Agility Pack.
Then the code creates a list of all the links on the page and uses a cobbled-together randomized selection.
As others have mentioned, you will need to get Google's permission to run the code against their servers. Not doing so places you in breach and may have awkward consequences.
Also, this code is indicative; it's not meant to be exemplary, or even buildable. It's a rough idea of the steps needed to get what you are after.
Your earlier design was attempting to interact with controls on Google's index page and that kind of approach is way too brittle in the first instance. You can't hardly test it for starters.
The HTML Agility Pack is here http://htmlagilitypack.codeplex.com/wikipage?title=Examples

Related

Scraping html list data from a dynamic server

Hallo guys!
Sorry for the dump question, this is my last resort. I swear i triend countless of other Stackoverflow questions, different Frameworks, etc., but those didnt seem to help.
Ich have the following Problem:
A website displays a list of data (there is a TON of div, li, span etc. tags infront, its a big HTML.)
Im writing a tool that fetches data from a specific list inside a ton of other div tags, downloads it and outputs an excel file.
The website im trying to access, is dynamic. So you open the website, it loads a little bit, and then the list appears (probably some JS and stuff).
When i try to download the website via a webRequest in C#, the html I get ist almost empty with a ton on white spaces, lots of non-html stuff, some garbage data as well.
Now: Im pretty used to C#, HTMLAgillityPack, and countless other libraries, not so much in web related stuff tho. I tried CefSharp, Chromium etc. all of those stuff, but couldnt get them to work properly unfortunately.
I want to have a HTML in my program to work with that looks exactly like the HTML that you see when
you open the dev console in chrome wenn visting the website mentined above.
The HTML parser works flwalessly there.
This is how I image how the code could look like simplified.
Extreme C# pseudocode:
WebBrowserEngine web = new WebBrowserEngine()
web.LoadURLuntilFinished(url); // with all the JS executed and stuff
String html = web.getHTML();
web.close();
My Goal would be that the string html in the pseudocode looks exactly like the one in the Chrome dev tab.
Maybe there is a solution posted somewhere else but i swear i coudlnt find it, been looking for days.
Andy help is greatly appreciated.
#SpencerBench is spot on in saying
It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically.
To answer the question for your specific use case, we need to understand the behaviour of the page you want to scrape data from, or as I asked in the comments, how do you know the page is "finished"?
However, it's possible to give a fairly generic answer to the question which should act as a starting point for you.
This answer uses Selenium, a package which is commonly used for automating testing of web UIs, but as they say on their home page, that's not the only thing it can be used for.
Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well.
The web site I'm scraping
So first we need a web site. I've created one using ASP.net core MVC with .net core 3.1, although the web site's technology stack isn't important, it's the behaviour of the page you want to scrape which is important. This site has 2 pages, unimaginatively called Page1 and Page2.
Page controllers
There's nothing special in these controllers:
namespace StackOverflow68925623Website.Controllers
{
using Microsoft.AspNetCore.Mvc;
public class Page1Controller : Controller
{
public IActionResult Index()
{
return View("Page1");
}
}
}
namespace StackOverflow68925623Website.Controllers
{
using Microsoft.AspNetCore.Mvc;
public class Page2Controller : Controller
{
public IActionResult Index()
{
return View("Page2");
}
}
}
API controller
There's also an API controller (i.e. it returns data rather than a view) which the views can call asynchronously to get some data to display. This one just creates an array of the requested number of random strings.
namespace StackOverflow68925623Website.Controllers
{
using Microsoft.AspNetCore.Mvc;
using System;
using System.Collections.Generic;
using System.Text;
[Route("api/[controller]")]
[ApiController]
public class DataController : ControllerBase
{
[HttpGet("Create")]
public IActionResult Create(int numberOfElements)
{
var response = new List<string>();
for (var i = 0; i < numberOfElements; i++)
{
response.Add(RandomString(10));
}
return Ok(response);
}
private string RandomString(int length)
{
var sb = new StringBuilder();
var random = new Random();
for (var i = 0; i < length; i++)
{
var characterCode = random.Next(65, 90); // A-Z
sb.Append((char)characterCode);
}
return sb.ToString();
}
}
}
Views
Page1's view looks like this:
#{
ViewData["Title"] = "Page 1";
}
<div class="text-center">
<div id="list" />
<script src="~/lib/jquery/dist/jquery.min.js"></script>
<script>
var apiUrl = 'https://localhost:44394/api/Data/Create';
$(document).ready(function () {
$('#list').append('<li id="loading">Loading...</li>');
$.ajax({
url: apiUrl + '?numberOfElements=20000',
datatype: 'json',
success: function (data) {
$('#loading').remove();
var insert = ''
for (var item of data) {
insert += '<li>' + item + '</li>';
}
insert = '<ul id="results">' + insert + '</ul>';
$('#list').html(insert);
},
error: function (xht, status) {
alert('Error: ' + status);
}
});
});
</script>
</div>
So when the page first loads, it just contains an empty div called list, however the page loading trigger's the function passed to jQuery's $(document).ready function, which makes an asynchronous call to the API controller, requesting an array of 20,000 elements. While the call is in progress, "Loading..." is displayed on the screen, and when the call returns, this is replaced by an unordered list containing the received data. This is written in a way intended to be friendly to developers of automated UI tests, or of screen scrapers, because we can tell whether all the data has loaded by testing whether or not the page contains an element with the ID results.
Page2's view looks like this:
#{
ViewData["Title"] = "Page 2";
}
<div class="text-center">
<div id="list">
<ul id="results" />
</div>
<script src="~/lib/jquery/dist/jquery.min.js"></script>
<script>
var apiUrl = 'https://localhost:44394/api/Data/Create';
var requestCount = 0;
var maxRequests = 20;
$(document).ready(function () {
getData();
});
function getDataIfAtBottomOfPage() {
console.log("scroll - " + requestCount + " requests");
if (requestCount < maxRequests) {
console.log("scrollTop " + document.documentElement.scrollTop + " scrollHeight " + document.documentElement.scrollHeight);
if (document.documentElement.scrollTop > (document.documentElement.scrollHeight - window.innerHeight - 100)) {
getData();
}
}
}
function getData() {
window.onscroll = undefined;
requestCount++;
$('results2').append('<li id="loading">Loading...</li>');
$.ajax({
url: apiUrl + '?numberOfElements=50',
datatype: 'json',
success: function (data) {
var insert = ''
for (var item of data) {
insert += '<li>' + item + '</li>';
}
$('#loading').remove();
$('#results').append(insert);
if (requestCount < maxRequests) {
window.setTimeout(function () { window.onscroll = getDataIfAtBottomOfPage }, 1000);
} else {
$('#results').append('<li>That\'s all folks');
}
},
error: function (xht, status) {
alert('Error: ' + status);
}
});
}
</script>
</div>
This gives a nicer user experience because it requests data from the API controller in multiple smaller chunks, so the first chunk of data appears fairly quickly, and once the user has scrolled down to somewhere near the bottom of the page, the next chunk of data is requested, until 20 chunks have been requested and displayed, at which point the text "That's all folks" is added to the end of the unordered list. However this is more difficult to interact with programmatically because you need to scroll the page down to make the new data appear.
(Yes, this implementation is a bit buggy - if the user gets to the bottom of the page too quickly then requesting the next chunk of data doesn't happen until they scroll up a bit. But the question isn't about how to implement this behaviour in a web page, but about how to scrape the displayed data, so please forgive my bugs.)
The scraper
I've implemented the scraper as a xUnit unit test project, just because I'm not doing anything with the data I've scraped from the web site other than Asserting that it is of the correct length, and therefore proving that I haven't prematurely assumed that the web page I'm scraping from is "finished". You can put most of this code (other than the Asserts) into any type of project.
Having created your scraper project, you need to add the Selenium.WebDriver and Selenium.WebDriver.ChromeDriver nuget packages.
Page Object Model
I'm using the Page Object Model pattern to provide a layer of abstraction between functional interaction with the page and the implementation detail of how to code that interaction. Each of the pages in the web site has a corresponding page model class for interacting with that page.
First, a base class with some code which is common to more than one page model class.
namespace StackOverflow68925623Scraper
{
using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;
public class PageModel
{
protected PageModel(IWebDriver driver)
{
this.Driver = driver;
}
protected IWebDriver Driver { get; }
public void ScrollToTop()
{
var js = (IJavaScriptExecutor)this.Driver;
js.ExecuteScript("window.scrollTo(0, 0)");
}
public void ScrollToBottom()
{
var js = (IJavaScriptExecutor)this.Driver;
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight)");
}
protected IWebElement GetById(string id)
{
try
{
return this.Driver.FindElement(By.Id(id));
}
catch (NoSuchElementException)
{
return null;
}
}
protected IWebElement AwaitGetById(string id)
{
var wait = new WebDriverWait(Driver, TimeSpan.FromSeconds(10));
return wait.Until(e => e.FindElement(By.Id(id)));
}
}
}
This base class gives us 4 convenience methods:
Scroll to the top of the page
Scroll to the bottom of the page
Get the element with the supplied ID, or return null if it doesn't exist
Get the element with the supplied ID, or wait for up to 10 seconds for it to appear if it doesn't exist yet
And each page in the web site has its own model class, derived from that base class.
namespace StackOverflow68925623Scraper
{
using OpenQA.Selenium;
public class Page1Model : PageModel
{
public Page1Model(IWebDriver driver) : base(driver)
{
}
public IWebElement AwaitResults => this.AwaitGetById("results");
public void Navigate()
{
this.Driver.Navigate().GoToUrl("https://localhost:44394/Page1");
}
}
}
namespace StackOverflow68925623Scraper
{
using OpenQA.Selenium;
public class Page2Model : PageModel
{
public Page2Model(IWebDriver driver) : base(driver)
{
}
public IWebElement Results => this.GetById("results");
public void Navigate()
{
this.Driver.Navigate().GoToUrl("https://localhost:44394/Page2");
}
}
}
And the Scraper class:
namespace StackOverflow68925623Scraper
{
using OpenQA.Selenium.Chrome;
using System;
using System.Threading;
using Xunit;
public class Scraper
{
[Fact]
public void TestPage1()
{
// Arrange
var driver = new ChromeDriver();
var page = new Page1Model(driver);
page.Navigate();
try
{
// Act
var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);
// Assert
Assert.Equal(20000, actualResults.Length);
}
finally
{
// Ensure the browser window closes even if things go pear-shaped
driver.Quit();
}
}
[Fact]
public void TestPage2()
{
// Arrange
var driver = new ChromeDriver();
var page = new Page2Model(driver);
page.Navigate();
try
{
// Act
while (!page.Results.Text.Contains("That's all folks"))
{
Thread.Sleep(1000);
page.ScrollToBottom();
page.ScrollToTop();
}
var actualResults = page.Results.Text.Split(Environment.NewLine);
// Assert - we expect 1001 because of the extra "that's all folks"
Assert.Equal(1001, actualResults.Length);
}
finally
{
// Ensure the browser window closes even if things go pear-shaped
driver.Quit();
}
}
}
}
So, what's happening here?
// Arrange
var driver = new ChromeDriver();
var page = new Page1Model(driver);
page.Navigate();
ChromeDriver is in the Selenium.WebDriver.ChromeDriver package and implements the IWebDriver interface from the Selenium.WebDriver package with the code to interact with the Chrome browser. Other packages are available containing implementations for all popular browsers. Instantiating the driver object opens a browser window, and calling its Navigate method directs the browser to the page we want to test/scrape.
// Act
var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);
Because on Page1, the results element doesn't exist until all the data has been displayed, and no user interaction is required in order for it to be displayed, we use the page model's AwaitResults property to just wait for that element to appear and return it once it has appeared.
AwaitResults returns an IWebElement instance representing the element, which in turn has various methods and properties we can use to interact with the element. In this case we use its Text property which returns the element's contents as a string, without any markup. Because the data is displayed as an unordered list, each element in the list is delimited by a line break, so we can can use String's Split method to convert it to a string array.
Page2 needs a different approach - we can't use the presence of the results element to determine whether the data has all been displayed, because that element is on the page right from the start, instead we need to check for the string "That's all folks" which is written right at the end of the last chunk of data. Also the data isn't loaded all in one go, and we need to keep scrolling down in order to trigger the loading of the next chunk of data.
// Act
while (!page.Results.Text.Contains("That's all folks"))
{
Thread.Sleep(1000);
page.ScrollToBottom();
page.ScrollToTop();
}
var actualResults = page.Results.Text.Split(Environment.NewLine);
Because of the bug in the UI that I mentioned earlier, if we get to the bottom of the page too quickly, the fetch of the next chunk of data isn't triggered, and attempting to scroll down when already at the bottom of the page doesn't raise another scroll event. That's why I'm scrolling to the bottom of the page and then back to the top - that way I can guarantee that a scroll event is raised. You never know, the web site you're trying to scrape data from may itself be buggy.
Once the "That's all folks" text has appeared, we can go ahead and get the results element's Text property and convert it to a string array as before.
// Assert - we expect 1001 because of the extra "that's all folks"
Assert.Equal(1001, actualResults.Length);
This is the bit that won't be in your code. Because I'm scraping a web site which is under my control, I know exactly how much data it should be displaying so I can check that I've got all the data, and therefore that my scraping code is working correctly.
Further reading
Absolute beginner's introduction to Selenium: https://www.guru99.com/selenium-csharp-tutorial.html
(A curiosity in that article is the way that it starts by creating a console application project and later changes its output type to class library and manually adds the unit test packages, when the project could have been created using one of Visual Studio's unit test project templates. It gets to the right place in the end, albeit via a rather odd route.)
Selenium documentation: https://www.selenium.dev/documentation/
Happy scraping!
If you need to fully execute the web page, then a complete browser like CefSharp is your only option.
It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically. I know that CefSharp can simulate user actions like clicking, scrolling, etc.

Extracting Twitter PIN info, WP7

I've been following this great tutorial:
http://buildmobile.com/twitter-in-a-windows-phone-7-app/#fbid=o0eLp-OipGa
But it seems that the pin extraction method used in doesn't work for me or is out of date. I'm not an expert on html scrapping and was wondering if someone could help me find a solution to extracting the pin. The method used by the tutorial is:
private void BrowserNavigated(object sender, NavigationEventArgs e){
if (AuthenticationBrowser.Visibility == Visibility.Collapsed) {
AuthenticationBrowser.Visibility = Visibility.Visible;
}
if (e.Uri.AbsoluteUri.ToLower().Replace("https://", "http://") == AuthorizeUrl) {
var htmlString = AuthenticationBrowser.SaveToString();
var pinFinder = new Regex(#"<DIV id=oauth_pin>(?<pin>[A-Za-z0-9_]+)</DIV>", RegexOptions.IgnoreCase);
var match = pinFinder.Match(htmlString);
if (match.Length > 0) {
var group = match.Groups["pin"];
if (group.Length > 0) {
pin = group.Captures[0].Value;
if (!string.IsNullOrEmpty(pin)) {
RetrieveAccessToken();
}
}
}
if (string.IsNullOrEmpty(pin)){
Dispatcher.BeginInvoke(() => MessageBox.Show("Authorization denied by user"));
}
// Make sure pin is reset to null
pin = null;
AuthenticationBrowser.Visibility = Visibility.Collapsed;
}
}
When running through that code, "match" always ends up null and the pin is never found. Everything else in the tutorial works, but I have no idea how to manipulate this code to extract the pin due to the new structure of the page.
I really appreciate the time,
Mike
I have found that Twitter has 2 different PIN pages, and I think they determine which page to redirect you to depending on your browser.
Something as simple as string parsing will work for you. The first PIN page I came across has the PIN code wrapped in a <.code> tag, so simply look for <.code> and parse it out:
if (innerHtml.Contains("<code>"))
{
pin = innerHtml.Substring(innerHtml.IndexOf("<code>") + 6, 7);
}
The other page I came across (which looks like the one in the tutorial you are using) is wrapped using an id="oauth_pin" if I recall correctly. So, just parse that as well:
else if(innerHtml.Contains("oauth_pin"))
{
pin = innerHtml.Substring(innerHtml.IndexOf("oauth_pin") + 10, 7);
}
innerHtml is a string that contains the body of the page. Which seems to be var htmlString = AuthenticationBrowser.SaveToString(); from your code.
I use both of these in my C# program and they work great, full snippet:
private void WebBrowser1DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var innerHtml = webBrowser1.Document.Body.InnerHtml.ToLower();
var code = string.Empty;
if (innerHtml.Contains("<code>"))
{
code = innerHtml.Substring(innerHtml.IndexOf("<code>") + 6, 7);
}
else if(innerHtml.Contains("oauth_pin"))
{
code = innerHtml.Substring(innerHtml.IndexOf("oauth_pin") + 10, 7);
}
textBox1.Text = code;
}
Let me know if you have any question and I hope this helps!!
I need to change the code suggested from Toma A with this one:
var innerHtml = webBrowser1.SaveToString();
var code = string.Empty;
if (innerHtml.Contains("<code>"))
{
code = innerHtml.Substring(innerHtml.IndexOf("<code>") + 6, 7);
}
else if (innerHtml.Contains("oauth_pin"))
{
code = innerHtml.Substring(innerHtml.IndexOf("oauth_pin") + 10, 7);
}
because this one doesn't works for windows phone
var innerHtml = webBrowser1.Document.Body.InnerHtml.ToLower();

Simple web crawler in C#

I have created a simple web crawler but I want to add the recursion function so that every page that is opened I can get the URLs in this page, but I have no idea how I can do that and I want also to include threads to make it faster.
Here is my code
namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;
myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream
StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only
textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();
}
private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);
IHTMLElementCollection L = doc.links;
foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}
I fixed your GetContent method as follow to get new links from crawled page:
public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}
return newLinks;
}
Updated
Fixed: regex should be regexLink. Thanks #shashlearner for pointing this out (my mistype).
i have created something similar using Reactive Extension.
https://github.com/Misterhex/WebCrawler
i hope it can help you.
Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine,
onCompleted: () => Console.WriteLine("Crawling completed"));
The following includes an answer/recommendation.
I believe you should use a dataGridView instead of a textBox as when you look at it in GUI it is easier to see the links (URLs) found.
You could change:
textBox3.Text = Links;
to
dataGridView.DataSource = Links;
Now for the question, you haven't included:
using System. "'s"
which ones were used, as it would be appreciated if I could get them as can't figure it out.
From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

How to feed WebBrowser control and manipulate the HTML document?

Good day
I have question about displaying html documents in a windows forms applications. App that I'm working on should display information from the
database in the html format. I will try to describe actions that I have taken (and which failed):
1) I tried to load "virtual" html page that exists only in memory and dynamically change it's parameters (webbMain is a WebBrowser control):
public static string CreateBookHtml()
{
StringBuilder sb = new StringBuilder();
//Declaration
sb.AppendLine(#"<?xml version=""1.0"" encoding=""utf-8""?>");
sb.AppendLine(#"<?xml-stylesheet type=""text/css"" href=""style.css""?>");
sb.AppendLine(#"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.1//EN""
""http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"">");
sb.AppendLine(#"<html xmlns=""http://www.w3.org/1999/xhtml"" xml:lang=""en"">");
//Head
sb.AppendLine(#"<head>");
sb.AppendLine(#"<title>Exemplary document</title>");
sb.AppendLine(#"<meta http-equiv=""Content-Type"" content=""application/xhtml+xml;
charset=utf-8""/ >");
sb.AppendLine(#"</head>");
//Body
sb.AppendLine(#"<body>");
sb.AppendLine(#"<p id=""paragraph"">Example.</p>");
sb.AppendLine(#"</body>");
sb.AppendLine(#"</html>");
return sb.ToString();
}
void LoadBrowser()
{
this.webbMain.Navigate("about:blank");
this.webbMain.DocumentText = CreateBookHtml();
HtmlDocument doc = this.webbMain.Document;
}
This failed, because doc.Body is null, and doc.getElementById("paragraph") returns null too. So I cannot change paragraph InnerText property.
Furthermore, this.webbMain.DocumentText is "\0"...
2) I tried to create html file in specified folder, load it to the WebBrowser and then change its parameters. Html is the same as created by
CreateBookHtml() method:
private void LoadBrowser()
{
this.webbMain.Navigate("HTML\\BookPage.html"));
HtmlDocument doc = this.webbMain.Document;
}
This time this.webbMain.DocumentText contains Html data read from the file, but doc.Body returns null again, and I still cannot take element using
getByElementId() method. Of course, when I have text, I would try regex to get specified fields, or maybe do other tricks to achieve a goal, but I wonder - is there simply way to mainipulate html? For me, ideal way would be to create HTML text in memory, load it into the WebBrowser control, and then dynamically change its parameters using IDs. Is it possible? Thanks for the answers in advance, best regards,
Paweł
I've worked some time ago with the WebControl and like you wanted to load a html from memory but have the same problem, body being null. After some investigation, I noticed that the Navigate and NavigateToString methods work asynchronously, so it needs a little time for the control to load the document, the document is not available right after the call to Navigate. So i did something like (wbChat is the WebBrowser control):
wbChat.NavigateToString("<html><body><div>first line</div></body><html>");
DoEvents();
where DoEvents() is implemeted as:
[SecurityPermissionAttribute(SecurityAction.Demand, Flags = SecurityPermissionFlag.UnmanagedCode)]
public void DoEvents()
{
DispatcherFrame frame = new DispatcherFrame();
Dispatcher.CurrentDispatcher.BeginInvoke(DispatcherPriority.Background,
new DispatcherOperationCallback(ExitFrame), frame);
Dispatcher.PushFrame(frame);
}
and it worked for me, after the DoEvents call, I could obtain a non-null body:
mshtml.IHTMLDocument2 doc2 = (mshtml.IHTMLDocument2)wbChat.Document;
mshtml.HTMLDivElement div = (mshtml.HTMLDivElement)doc2.createElement("div");
div.innerHTML = "some text";
mshtml.HTMLBodyClass body = (mshtml.HTMLBodyClass)doc2.body;
if (body != null)
{
body.appendChild((mshtml.IHTMLDOMNode)div);
body.scrollTop = body.scrollHeight;
}
else
Console.WriteLine("body is still null");
I don't know if this is the right way of doing this, but it fixed the problem for me, maybe it helps you too.
Later Edit:
public object ExitFrame(object f)
{
((DispatcherFrame)f).Continue = false;
return null;
}
The DoEvents method is necessary on WPF. For System.Windows.Forms one can use Application.DoEvents().
Another way to do the same thing is:
webBrowser1.DocumentText = "<html><body>blabla<hr/>yadayada</body></html>";
this works without any extra initialization

Looking for OO gurus, need some help in the design of my programming logic. Nothing fancy, just new to it

I'll post my entire class and maybe someone with MUCH more experience can help me design something better. I'm really new to doing things Asynchronously, so I'm really lost here. Hopefully my design isn't TOO bad. :P
IMDB Class:
public class IMDB
{
WebClient WebClientX = new WebClient();
byte[] Buffer = null;
public string[] SearchForMovie(string SearchParameter)
{
//Format the search parameter so it forms a valid IMDB *SEARCH* url.
//From within the search website we're going to pull the actual movie
//link.
string sitesearchURL = FindURL(SearchParameter);
//Have a method download asynchronously the ENTIRE source code of the
//IMDB *search* website, and save it to the byte[] "Buffer".
WebClientX.DownloadDataCompleted += new DownloadDataCompletedEventHandler(WebClientX_DownloadDataCompleted);
WebClientX.DownloadDataAsync(new Uri(sitesearchURL));
//Convert the byte[] to a string so we can easily find the *ACTUAL*
//movie URL.
string sitesearchSource = Encoding.ASCII.GetString(Buffer);
//Pass the IMDB source code to method FindInformation() to FIND the movie
//URL.
string MovieURL = FindMovieURL(sitesearchSource);
//Download the source code from the recently found movie URL.
WebClientX.DownloadDataAsync(new Uri(MovieURL));
//Convert the source code to readable string for scraping of information.
string sitemovieSource = Encoding.ASCII.GetString(Buffer);
string[] MovieInformation = ScrapeInformation(sitemovieSource);
return MovieInformation;
}
void WebClientX_DownloadDataCompleted(object sender, DownloadDataCompletedEventArgs e)
{
Buffer = e.Result;
throw new NotImplementedException();
}
/// <summary>
/// Formats a valid IMDB url for ease of use according to a search parameter.
/// </summary>
/// <param name="sitesearchSource"></param>
/// <returns></returns>
private string FindMovieURL(string sitesearchSource)
{
int Start = sitesearchSource.IndexOf("<link rel=\"canonical\" href=\"");
string IMDBCode = sitesearchSource.Substring((Start + 28), 35);
return IMDBCode;
}
private string[] ScrapeInformation(string Source)
{
string[] Information = new string[5];
Information[0] = FindTitle(Source);
Information[1] = FindDirector(Source);
Information[2] = FindYear(Source);
Information[3] = FindPlot(Source);
Information[4] = FindPoster(Source);
return Information;
}
/*************************************************************************/
private string FindURL(string Search)
{
string[] SearchArray = Search.Split(' ');
string FormattedQuery = "";
foreach (string X in SearchArray)
{
FormattedQuery += X + "+";
}
FormattedQuery.Remove((FormattedQuery.Length - 1), 1);
string TheFormattedQuery = "http://www.imdb.com/find?s=all&q=" + FormattedQuery + "&x=0&y=0";
return TheFormattedQuery;
}
private string FindTitle(string Source)
{
//<title>Couples Retreat (2009)</title>
int Start = Source.IndexOf("<title>");
string Bookmark = Source.Substring((Start + 7), 400);
int End = Bookmark.IndexOf("</title>");
string Title = Bookmark.Substring(0, End - 7);
return Title;
}
private string FindDirector(string Source)
{
int Start = Source.IndexOf("<h5>Director:</h5>");
string Bookmark = Source.Substring((Start + 18), 250);
Start = Bookmark.IndexOf(">");
Bookmark = Bookmark.Substring(Start + 1, 100);
int End = Bookmark.IndexOf("</a>");
string Director = Bookmark.Substring(0, End - 1);
return Director;
}
private string FindYear(string Source)
{
int Start = Source.IndexOf("<h5>Release Date:</h5>");
string Bookmark = Source.Substring((Start + 22), 40);
int End = Bookmark.IndexOf("<a class=");
string ReleaseYear = Bookmark.Substring(0, End - 1);
return ReleaseYear;
}
private string FindPlot(string Source)
{
int Start = Source.IndexOf("<h5>Plot:</h5>");
string Bookmark = Source.Substring((Start + 14), 700);
int End = Bookmark.IndexOf("<a class");
string Plot = Bookmark.Substring(0, End - 1);
return Plot;
}
private string FindPoster(string Source)
{
int Start = Source.IndexOf("<a name=\"poster\" href=");
string Bookmark = Source.Substring((Start + 22), 700);
Start = Bookmark.IndexOf("src=\"");
string PosterURL = Bookmark.Substring((Start + 5), 103);
return PosterURL;
}
}
Form1.cs Class (My windows Form):
public partial class MainSearchForm : Form
{
public MainSearchForm()
{
InitializeComponent();
}
SearchFunctions.IMDB IMDBClass = new QuickFlick.SearchFunctions.IMDB();
int YPosition = 5;
private void btnSearch_Click(object sender, EventArgs e)
{
string[] MovieInformation = IMDBClass.SearchForMovie(txtSearch.Text);
LoadMovieInformation(MovieInformation);
}
public void LoadMovieInformation(string[] FoundInfo)
{
MovieItem TheMovieItem = new MovieItem();
TheMovieItem.SetTitle(FoundInfo[0]);
TheMovieItem.SetDirector(FoundInfo[1]);
TheMovieItem.SetRelease(FoundInfo[2]);
TheMovieItem.SetPlot(FoundInfo[3]);
TheMovieItem.SetPoster(FoundInfo[4]);
TheMovieItem.Location = new Point(5, YPosition);
YPosition += 196;
panel1.Controls.Add(TheMovieItem);
}
}
Now, the gist of what my program is trying to accomplish.
A user will write the name down of a movie, and I'll pull up the information about it. Nothing else! :P It's mostly intended for me to learn Async functions etc. but I'm scared I might be approaching this the completely wrong way.
Once again, I'm not looking for much code, just design of the program. Methods, order of methods, unnecesary methods, etc. :D
Thanks a bunch SO, as always, you rock!
I'm not going to comment on the entire design because, like #Vinko, I think you ought to focus your question a little more in that respect. I will say though that you are fundamentally misunderstanding how the asynchronous methods work. They will return immediately to the calling thread without completing first. That's the whole point with asynchronous threads - they don't wait until they have finished before they return.
For this to work, you must either use synchronous calls or wait on some event after the call and signal that event in the asynchronous callback handler so that your code doesn't attempt to access the returned data until after the call has completed. The benefit in this approach is that your thread is free to perform other tasks while waiting on the call to complete, but you must absorb the complexity of managing the waiting process (it's not much).
See the sample code for the DownloadDataCompleted event at MSDN. You'll need to do a wait for each of the web client downloads that you do. Note that you need to invoke the Set() method on the AutoResetEvent object passed to your handler in the example in your handler.
You need to have models.
For example the FindTitle() and FindDirector() methods should be in a separate class.. perhaps called 'Movie' or 'IMDBMovie'. This class should have attributes such as 'title' and 'director' with getters / setters.
This article could prove helpful to you: MVC
I would agree with clownbaby on the need for models, but would disagree on the need for FindTitle, FindDirectory etc. needing to be in separate classes.
I think that you should structure your code as follows:
Have a class that takes in the search parameters in a generic way and then calls a search through some kind of factory class.
Have the factory class internally create a specialized version of the search class based on either a specification by the user, or a specification in the configuration settings of the application.
Have the search class format the search query and then call another class which simply makes the HTTP request and gets back the HTTP response contents to the search class.
Have the search class then parse the HTTP response to generate the results to be fed back to the class that called the search in Item 1.
The advantages of this sort of approach are that you can then isolate the code that is specialized for processing website specific parsing routines into it's own class and isolate it from the UI as well as utilize a single class to make the HTTP request, regardless of how many websites you want to scrape.
Put another way, the View layer should not care how exactly the data is retrieved and parsed and should have the flexibility to call on various sources of information.
Likewise, the Data layer should not care how the data should be processed and should only be concerned with where to get the data from.
The Domain layer is the one that will sit in the middle, take raw data from the Data layer, parse it and make sense of it and then format it into a shape that the View can understand.
This way, if say you want to also target Yahoo! Movies, Wikipedia or any of the other sources of movie data, you can do that. All you would have to do is add another class in the Domain layer and inform the View that a new source is available in the Domain layer.
Good design is about limiting the impact of changes so that additional features can be built into the application.
Good design is also about being able to test individual components. A design such as the one I have described will allow you to test the various classes that comprise individual layers and ensure that they work as advertised.
Something I find very useful is that if a class cannot be tested very easily using NUnit and Rhino.Mocks, it's probably not very well-designed.
Better to split the model object from the data retrieval object.
User sees View, uses btnSearch_Click
Search starts on other thread, btnSearch_Click action finishes
Search finishes, data may be available, new object creation and addition to view.
View has new data event and redisplays content.
Some adjusted code to do the async calls.
public string[] SearchForMovie(string SearchParameter)
{
//Format the search parameter so it forms a valid IMDB *SEARCH* url.
//From within the search website we're going to pull the actual movie
//link.
string sitesearchURL = FindURL(SearchParameter);
//Have a method download asynchronously the ENTIRE source code of the
//IMDB *search* website, and save it to the byte[] "Buffer".
WebClientX.DownloadDataCompleted += new DownloadDataCompletedEventHandler(WebClientX_DownloadSearchCompleted);
WebClientX.DownloadDataAsync(new Uri(sitesearchURL));
}
void WebClientX_DownloadSearchCompleted(object sender, DownloadDataCompletedEventArgs e)
{
Buffer = e.Result;
//Convert the byte[] to a string so we can easily find the *ACTUAL*
//movie URL.
string sitesearchSource = Encoding.ASCII.GetString(Buffer);
//Pass the IMDB source code to method FindInformation() to FIND the movie
//URL.
string MovieURL = FindMovieURL(sitesearchSource);
//Download the source code from the recently found movie URL.
WebClientX.DownloadDataCompleted += new DownloadDataCompletedEventHandler(WebClientX_DownloadMovieCompleted);
WebClientX.DownloadDataAsync(new Uri(MovieURL));
}
void WebClientX_DownloadMovieCompleted(object sender, DownloadDataCompletedEventArgs e)
{
Buffer = e.Result;
//Convert the source code to readable string for scraping of information.
string sitemovieSource = Encoding.ASCII.GetString(Buffer);
// would create a movie object here rather than have the scrape function on this class
string[] MovieInformation = ScrapeInformation(sitemovieSource);
Model.LoadMovieInformation(MovieInformation);
}

Categories

Resources