Simple web crawler in C#

Simple web crawler in C# - c#

I have created a simple web crawler but I want to add the recursion function so that every page that is opened I can get the URLs in this page, but I have no idea how I can do that and I want also to include threads to make it faster.
Here is my code
namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;
myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream
StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only
textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();
}
private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);
IHTMLElementCollection L = doc.links;
foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}

I fixed your GetContent method as follow to get new links from crawled page:
public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}
return newLinks;
}
Updated
Fixed: regex should be regexLink. Thanks #shashlearner for pointing this out (my mistype).

i have created something similar using Reactive Extension.
https://github.com/Misterhex/WebCrawler
i hope it can help you.
Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine,
onCompleted: () => Console.WriteLine("Crawling completed"));

The following includes an answer/recommendation.
I believe you should use a dataGridView instead of a textBox as when you look at it in GUI it is easier to see the links (URLs) found.
You could change:
textBox3.Text = Links;
to
dataGridView.DataSource = Links;
Now for the question, you haven't included:
using System. "'s"
which ones were used, as it would be appreciated if I could get them as can't figure it out.

From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

Related

Awesomium WebControl load string

Is there a way to load a html as a string in webControl?
Something like:
webControl.Load("<!DOCTYPE html><html>...");
Like used in the normal wpf webControl:
webControl.NavigateToString("<!DOCTYPE html><html>...");

Actually now I found the answer in the tutorials for C++ (not on .net wpf) in Awesomium site.
Here is my solution:
var uri = new Uri("data:text/html,<!DOCTYPE html><html>...", UriKind.Absolute);
webControl.Source = uri;

I know it is an old question but here is how I menaged to do it :
var page = new WebControl
{
ViewType = WebViewType.Window,
};
page.NativeViewInitialized += (s, e) =>
{
page.LoadHTML("<html>SOME TEXT</html>");
};

Instead of using a URL in the source just put your HTML in there
loaded from Awesomium tutorials

Here is my solution:
Load html string to a file and then load page using webControl.Source property.
public static string WriteHtmlToTempFile(string html)
{
var fileName = GetTempFileName("html");
System.IO.File.WriteAllText(fileName, html);
return fileName;
}
var strHtml = "<HTML> Hello World</HTML>";
var file = Common.WriteHtmlToTempFile(strHtml);
var wUri = new Uri(string.Format(#"file://{0}", file ));
webControl2.Source = wUri;

Creating XML from MarkUp HTML

I've got a few web pages that have static data in HTML mark-up tables. By this, I mean, manually maintained text:
<table border="1" >
<tr><th>Number</th><th>Date</th><th>BW</th><th>WW</th><th>%</th><th>Type</th><th>CED</th><th>BW</th><th>WW</th><th>YW</th><th>Mlk</th><th>Me</th></tr>
<tr><td>313</td><td>9/16/2013</td><td>74</td><td>512</td><td>100</td><td>861U</td><td>3</td><td>-1.1</td><td>54</td><td>85</td><td>16</td><td></td></tr>
<tr><td>315</td><td>10/6/2013</td><td>-</td><td>-</td><td>-</td><td>W179</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
<tr><td>316</td><td>10/102013</td><td>72</td><td>595</td><td>94.2</td><td>W179</td><td>7</td><td>-2.3</td><td>53</td><td>80</td><td>21</td><td>-3</td></tr>
<tr><td>350</td><td>10/11/2013</td><td>71</td><td>703</td><td>100</td><td>W179</td><td>7</td><td>-2.3</td><td>46</td><td>72</td><td>20</td><td>-5</td></tr>
<tr><td>392</td><td>3/8/2013</td><td>61</td><td>651</td><td>100</td><td>RANGER</td><td>7</td><td>-2.3</td><td>52</td><td>82</td><td>20</td><td>-2</td></tr>
<tr><td>303</td><td>7/3/2013</td><td>63</td><td>-</td><td>97.1</td><td>W179</td><td>8</td><td>-3.2</td><td>N/A</td><td>82</td><td>21</td><td>-8</td></tr>
<tr><td>304</td><td>7/8/2013</td><td>62</td><td>-</td><td>97.1</td><td>W179</td><td>7</td><td>-3.9</td><td>N/A</td><td>69</td><td>20</td><td>-4</td></tr>
<tr><td>397</td><td>3/18/2013</td><td>78</td><td>621</td><td>100</td><td>STATEMENT</td><td>6</td><td>-2.7</td><td>55</td><td>84</td><td>19</td><td>5</td></tr>
<tr><td>395</td><td>3/17/2013</td><td>63</td><td>716</td><td>94.2</td><td>STATEMENT</td><td>5</td><td>-2.7</td><td>54</td><td>85</td><td>19</td><td>5</td></tr>
<tr><td>390</td><td>3/6/2013</td><td>66</td><td>583</td><td>94.2</td><td>ENVY</td><td>2</td><td>-0.6</td><td>55</td><td>80</td><td>23</td><td>2</td></tr>
<tr><td>388</td><td>3/4/2013</td><td>53</td><td>621</td><td>100</td><td>STATEMENT</td><td>10</td><td>-5.1</td><td>49</td><td>82</td><td>20</td><td>2</td></tr>
<tr><td>300</td><td>3/22/2013</td><td>61</td><td>633</td><td>100</td><td>RANGER</td><td>8</td><td>-2.8</td><td>49</td><td>81</td><td>19</td><td>-2</td></tr>
<tr><td>379</td><td>2/1/2013</td><td>55</td><td>518</td><td>100</td><td>STATEMENT</td><td>8</td><td>-4.1</td><td>61</td><td>98</td><td>18</td><td>1</td></tr>
<tr><td>398</td><td>3/20/2013</td><td>62</td><td>664</td><td>100</td><td>RANGER</td><td>6</td><td>-2.3</td><td>53</td><td>83</td><td>20</td><td>0</td></tr>
<tr><td>384</td><td>2/10/2013</td><td>61</td><td>650</td><td>100</td><td>ENVY</td><td>3</td><td>-1</td><td>50</td><td>70</td><td>19</td><td>4</td></tr>
<tr><td>369</td><td>1/30/2013</td><td>76</td><td>651</td><td>100</td><td>STATEMENT</td><td>5</td><td>-2.4</td><td>60</td><td>99</td><td>20</td><td>8</td></tr>
<tr><td>373</td><td>1/21/2013</td><td>71</td><td>433</td><td>100</td><td>STATEMENT</td><td>4</td><td>-1.6</td><td>55</td><td>89</td><td>17</td><td>3</td></tr>
<tr><td>393</td><td>3/10/2013</td><td>63</td><td>717</td><td>100</td><td>STATEMENT</td><td>3</td><td>-4.6</td><td>51</td><td>91</td><td>20</td><td>5</td></tr>
<tr><td>389</td><td>3/8/2013</td><td>72</td><td>723</td><td>88.3</td><td>ENVY</td><td>4</td><td>-0.6</td><td>54</td><td>76</td><td>24</td><td>2</td></tr>
<tr><td>364</td><td>10/1/2012</td><td>60</td><td>574</td><td>100</td><td>RANGER</td><td>1</td><td>0.4</td><td>56</td><td>84</td><td>21</td><td>2</td></tr>
</table>
Currently, I am contemplating using a WebClient.DownloadString to pull all of the text in, and try to create an XML file out of it by parsing each row <tr>.
That sounds tedious, and I would rather not reinvent the wheel. Besides, a few good solutions would give me something to look at for ideas on how to best approach writing my version.
Has anyone come across some code that can do this?
I've started, to give you an idea of what I'm working on:
private const string XML_DATA = "App_Data/page_data.xml";
private const string TABLE_START = "<table>";
private const string TABLE_STOP = "</table>";
private string[] TABLE_ROW = { "<tr>", "</tr>" };
private string[] TABLE_HEAD = { "<th>", "</th>" };
private string[] TABLE_DET = { "<td>", "</td>" };
private void load_data() {
if (!File.Exists(XML_DATA)) {
string HtmlText;
using (var client = new WebClient()) {
HtmlText = client.DownloadString(Server.MapPath("/Sales.aspx"));
}
if (!String.IsNullOrEmpty(HtmlText)) {
var lcTxt = HtmlText.ToLower();
int len0 = TABLE_START.Length;
int tStart = lcTxt.IndexOf(TABLE_START) + len0;
int tStop = lcTxt.IndexOf(TABLE_STOP);
if ((len0 < tStart) && (tStart < tStop)) {
var tableString = HtmlText.Substring(tStart, tStop - tStart);
var tableRows = tableString.Split(TABLE_ROW, StringSplitOptions.RemoveEmptyEntries);
foreach (var row in tableRows) {
if (-1 < row.IndexOf(TABLE_HEAD[0])) {
//
} else {
//
}
}
}
}
}
}
Of course, you can see that is already going to fail, because the Markup using <table border="1">.
Yes, easy to fix, but I'd rather have a working guide that has already been through a lot of debugging steps.
UPDATE: I tried using XmlDocument's LoadXml method, but it can't seem to read basic HTML:

You definitely shouldn't be trying to parse that manually. Other people have already solved that problem.
If your markup is valid XML (and from what you've shown us, it looks like it is), then you can just parse it as XML:
XmlDocument doc = new XmlDocument();
doc.LoadXml(HtmlString);
doc.Save("myfile.xml");
But for that matter, if it's already valid XML markup, and all you need to do is save it as a file, then you don't need to parse it. Just save it:
File.WriteAllText("myfile.xml", HtmlString);

wp7 Help Parse Xml

Hi I am trying to parse some xml from a weird xml document developed by icalander. I have been having a lot of trouble just parsing the data, but thanks to the help of people from stackoverflow I have been able to parse the data. Now I need some help parsing between the nodes. Here is a link to the xml file I am parsing from (http://datastore.unm.edu/events/events.xml)
I am using the pivotapp model from Visual Studio 2010 to create this app. In the MainViewModel.cs section I am modifying the following code in hopes that the tag will print out in place of "LineOne" (code listed below). For example, from the xml file linked above, I would like LineOne = Lobo's Got Talent.
I need help figuring out the best method to achieve this, I will need LineTwo to contain the date and time, and LineThree to contain the description.
Thank you for your time and help, it has been greatly appreciated!
public void LoadData()
{
var webClient = new WebClient();
webClient.OpenReadAsync(new Uri("http://datastore.unm.edu/events/events.xml"));
webClient.OpenReadCompleted += new OpenReadCompletedEventHandler(webClient_OpenReadCompleted);
}
public void webClient_OpenReadCompleted(object sender,
OpenReadCompletedEventArgs e)
{
XDocument unmXdoc = XDocument.Load(e.Result, LoadOptions.None);
this.Items.Add(new ItemViewModel() { LineOne = unmXdoc.ToString(),
LineTwo = "", LineThree = "" });
}
Thank you for looking and helping!

The xml is fine, I think you are running into a namespace issue here, you have two options, strip the namespace of the xml file if you are sure you do not need it. The preferred option is to work with the namespace and specify it for the fully qualified element names. see Here
private readonly XNamespace dataNamspace = "urn:ietf:params:xml:ns:icalendar-2.0";
public void webClient_OpenReadCompleted(object sender,
OpenReadCompletedEventArgs e)
{
XDocument unmXdoc = XDocument.Load(e.Result, LoadOptions.None);
this.Items = from p in unmXdoc.Descendants(dataNamspace + "vevent").Elements(dataNamspace + "properties")
select new ItemViewModel
{
LineOne = this.GetElementValue(p, "summary"),
LineTwo = this.GetElementValue(p, "description"),
LineThree = this.GetElementValue(p, "categories"),
};
lstData.ItemsSource = this.Items;
}
private string GetElementValue(XElement element, string fieldName)
{
var childElement = element.Element(dataNamspace + fieldName);
return childElement != null ? childElement.Value : String.Empty;
}

Issue in Parsing Json image in C#

net C#. I am trying to parse Json from a webservice. I have done it with text but having a problem with parsing image. Here is the Url from where I m getting Json
http://collectionking.com/rest/view/items_in_collection.json?args=122
And this is My code to Parse it
using (var wc = new WebClient()) {
JavaScriptSerializer js = new JavaScriptSerializer();
var result = js.Deserialize<ck[]>(wc.DownloadString("http://collectionking.com/rest/view/items_in_collection.json args=122"));
foreach (var i in result) {
lblTitle.Text = i.node_title;
imgCk.ImageUrl = i.["main image"];
lblNid.Text = i.nid;
Any help would be great.
Thanks in advance.
PS: It returns the Title and Nid but not the Image.
My class is as follows:
public class ck
{
public string node_title;
public string main_image;
public string nid; }

Your problem is that you are setting ImageUrl to something like this <img typeof="foaf:Image" src="http://... and not an actual url. You will need to further parse main image and extract the url to show it correctly.
Edit
This was a though nut to crack because of the whitespace. The only solution I could find was to remove the whitespace before parsing the string. It's not a very nice solution but I couldn't find any other way using the built in classes. You might be able to solve it properly using JSON.Net or some other library though.
I also added a regular expression to extract the url for you, though there is no error checking what so ever here so you'll need to add that yourself.
using (var wc = new WebClient()) {
JavaScriptSerializer js = new JavaScriptSerializer();
var result = js.Deserialize<ck[]>(wc.DownloadString("http://collectionking.com/rest/view/items_in_collection.json?args=122").Replace("\"main image\":", "\"main_image\":")); // Replace the name "main image" with "main_image" to deserialize it properly, also fixed missing ? in url
foreach (var i in result) {
lblTitle.Text = i.node_title;
string realImageUrl = Regex.Match(i.main_image, #"src=""(.*?)""").Groups[1].Value; // Extract the value of the src-attribute to get the actual url, will throw an exception if there isn't a src-attribute
imgCk.ImageUrl = realImageUrl;
lblNid.Text = i.nid;
}
}

Try This
private static string ExtractImageFromTag(string tag)
{
int start = tag.IndexOf("src=\""),
end = tag.IndexOf("\"", start + 6);
return tag.Substring(start + 5, end - start - 5);
}
private static string ExtractTitleFromTag(string tag)
{
int start = tag.IndexOf(">"),
end = tag.IndexOf("<", start + 1);
return tag.Substring(start + 1, end - start - 1);
}
It may help

Replace tokens in an aspx page on load

I have an aspx page that contains regular html, some uicomponents, and multiple tokens of the form {tokenname} .
When the page loads, I want to parse the page content and replace these tokens with the correct content. The idea is that there will be multiple template pages using the same codebehind.
I've no trouble parsing the string data itself, (see named string formatting, replace tokens in template) my trouble lies in when to read, and how to write the data back to the page...
What's the best way for me to rewrite the page content? I've been using a streamreader, and the replacing the page with Response.Write, but this is no good - a page containing other .net components does not render correctly.
Any suggestions would be greatly appreciated!

Take a look at System.Web.UI.Adapters.PageAdapter method TransformText - generally it is used for multi device support, but you can postprocess your page with this.

I'm not sure if I'm answering your question, but...
If you can change your notation from
{tokenname}
to something like
<%$ ZeusExpression:tokenname %>
you could consider creating your System.Web.Compilation.ExpressionBuilder.
After reading your comment...
There are other ways of getting access to the current page using ExpressionBuilder: just... create an expression. ;-)
Changing just a bit the sample from MSDN and supposing the code of your pages contain a method like this
public object GetData(string token);
you could implement something like this
public override CodeExpression GetCodeExpression(BoundPropertyEntry entry, object parsedData, ExpressionBuilderContext context)
{
Type type1 = entry.DeclaringType;
PropertyDescriptor descriptor1 = TypeDescriptor.GetProperties(type1)[entry.PropertyInfo.Name];
CodeExpression[] expressionArray1 = new CodeExpression[1];
expressionArray1[0] = new CodePrimitiveExpression(entry.Expression.Trim());
return new CodeCastExpression(
descriptor1.PropertyType,
new CodeMethodInvokeExpression(
new CodeThisReferenceExpression(),
"GetData",
expressionArray1));
}
This replaces your placeholder with a call like this
(string)this.GetData("tokenname");
Of course you can elaborate much more on this, perhaps using a "utility method" to simplify and "protect" access to data (access to properties, no special method involved, error handling, etc.).
Something that replaces instead with (e.g.)
(string)Utilities.GetData(this, "tokenname");
Hope this helps.

Many thanks to those that contributed to this question, however I ended up using a different solution -
Overriding the render function as per this page, except I parsed the page content for multiple different tags using regular expressions.
protected override void Render(HtmlTextWriter writer)
{
if (!Page.IsPostBack)
{
using (System.IO.MemoryStream stream = new System.IO.MemoryStream())
{
using (System.IO.StreamWriter streamWriter = new System.IO.StreamWriter(stream))
{
HtmlTextWriter htmlWriter = new HtmlTextWriter(streamWriter);
base.Render(htmlWriter);
htmlWriter.Flush();
stream.Position = 0;
using (System.IO.StreamReader oReader = new System.IO.StreamReader(stream))
{
string pageContent = oReader.ReadToEnd();
pageContent = ParseTagsFromPage(pageContent);
writer.Write(pageContent);
oReader.Close();
}
}
}
}
else
{
base.Render(writer);
}
}
Here's the regex tag parser
private string ParseTagsFromPage(string pageContent)
{
string regexPattern = "{zeus:(.*?)}"; //matches {zeus:anytagname}
string tagName = "";
string fieldName = "";
string replacement = "";
MatchCollection tagMatches = Regex.Matches(pageContent, regexPattern);
foreach (Match match in tagMatches)
{
tagName = match.ToString();
fieldName = tagName.Replace("{zeus:", "").Replace("}", "");
//get data based on my found field name, using some other function call
replacement = GetFieldValue(fieldName);
pageContent = pageContent.Replace(tagName, replacement);
}
return pageContent;
}
Seems to work quite well, as within the GetFieldValue function you can use your field name in any way you wish.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Simple web crawler in C# - c#

From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

Related

Awesomium WebControl load string

Creating XML from MarkUp HTML

wp7 Help Parse Xml

Issue in Parsing Json image in C#

Replace tokens in an aspx page on load

Categories

Resources