Reading links in header using WebKit.NET - c#

I am trying to figure out how to read header links using C#.NET. I want to get the edit link from Browser1 and put it in browser 2. My problem is that I can't figure out how to get at attributes, or even the link tags for that matter. Below is what I have now.
using System.XML.Linq;
...
string source = webKitBrowser1.DocumentText.ToString();
XDocument doc = new XDocument(XDocument.Parse(source));
webKitBrowser2.Navigate(doc.Element("link").Attribute("href").Value.ToString());
This would work except that xml is different than html, and right off the bat, it says that it was expecting "doctype" to be uppercase.

I finally figured it out, so I will post it for anyone who has the same question.
string site = webKitBrowser1.Url.Scheme + "://" + webKitBrowser1.Url.Authority;
WebKit.DOM.Document doc = webKitBrowser1.Document;
WebKit.DOM.NodeList links = doc.GetElementsByTagName("link");
WebKit.DOM.Element link;
string editlink = "none";
foreach (var item in links)
{
link = (WebKit.DOM.Element)item;
if (link.Attributes["rel"].NodeValue == "edit") { editlink = link.Attributes["href"].NodeValue; }
}
if (editlink != "none") { webKitBrowser2.Navigate(site + editlink); }

Related

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}
Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

How to get URL from the XPATH?

I've tried to check other answers on this site, but none of them worked for me. I have following HTML code:
<h3 class="x-large lheight20 margintop5">
<strong>some textstring</strong>
</h3>
I am trying to get # from this document with following code:
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a/#href").InnerText;
I've also tried to do that without #href. Also tried with a[contains(#href, 'searchString')]. But all of these lines gave me just the name of the link - some textstring
Attributes doesn't have InnerText.You have to use the Attributes collection instead.
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a")
.Attributes["href"].Value;
Why not just use the XDocument class?
private string GetUrl(string filename)
{
var doc = XDocument.Load(filename)
foreach (var h3Element in doc.Elements("h3").Where(e => e.Attribute("class"))
{
var classAtt = h3Element.Attribute("class");
if (classAtt == "x-large lheight20 margintop5")
{
h3Element.Element("a").Attribute("href").value;
}
}
}
The code is not tested so use with caution.

How To Write To A OneNote 2013 Page Using C# and The OneNote Interop

I have seen many articles about this but all of them are either incomplete or do not answer my question. Using C# and the OneNote Interop, I would like to simply write text to an existing OneNote 2013 Page. Currently I have a OneNote Notebook, with a Section titled "Sample_Section" and a Page called "MyPage".
I need to be able to use C# code to write text to this Page, but I cannot figure out how or find any resources to do so. I have looked at all of the code examples on the web and none answer this simple question or are able to do this. Also many of the code examples are outdated and break when attempting to run them.
I used the Microsoft code sample that shows how to change the name of a Section but I cannot find any code to write text to a Page. There is no simple way to do this that I can see. I have taken a lot of time to research this and view the different examples online but none are able to help.
I have already viewed the MSDN articles on the OneNote Interop as well. I vaguely understand how the OneNote Interop works through XML but any extra help understanding that would also be appreciated. Most importantly I would really appreciate a code example that demonstrates how to write text to a OneNote 2013 Notebook Page.
I have tried using this Stack Overflow answer:
Creating new One Note 2010 page from C#
However, there are 2 things about this solution that do not answer my question:
1) The marked solution shows how to create a new page, not how to write text to it or how to populate the page with any information.
2) When I try to run the code that is marked as the solution, I get an error at the following line:
var node = doc.Descendants(ns + nodeName).Where(n => n.Attribute("name").Value == objectName).FirstOrDefault();
return node.Attribute("ID").Value;
The reason being that the value of "node" is null, any help would be greatly appreciated.
I asked the same question on MSDN forums and was given this great answer. Below is a nice, clean example of how to write to OneNote using C# and the OneNote interop. I hope that this can help people in the future.
static Application onenoteApp = new Application();
static XNamespace ns = null;
static void Main(string[] args)
{
GetNamespace();
string notebookId = GetObjectId(null, OneNote.HierarchyScope.hsNotebooks, "MyNotebook");
string sectionId = GetObjectId(notebookId, OneNote.HierarchyScope.hsSections, "Sample_Section");
string firstPageId = GetObjectId(sectionId, OneNote.HierarchyScope.hsPages, "MyPage");
GetPageContent(firstPageId);
Console.Read();
}
static void GetNamespace()
{
string xml;
onenoteApp.GetHierarchy(null, OneNote.HierarchyScope.hsNotebooks, out xml);
var doc = XDocument.Parse(xml);
ns = doc.Root.Name.Namespace;
}
static string GetObjectId(string parentId, OneNote.HierarchyScope scope, string objectName)
{
string xml;
onenoteApp.GetHierarchy(parentId, scope, out xml);
var doc = XDocument.Parse(xml);
var nodeName = "";
switch (scope)
{
case (OneNote.HierarchyScope.hsNotebooks): nodeName = "Notebook"; break;
case (OneNote.HierarchyScope.hsPages): nodeName = "Page"; break;
case (OneNote.HierarchyScope.hsSections): nodeName = "Section"; break;
default:
return null;
}
var node = doc.Descendants(ns + nodeName).Where(n => n.Attribute("name").Value == objectName).FirstOrDefault();
return node.Attribute("ID").Value;
}
static string GetPageContent(string pageId)
{
string xml;
onenoteApp.GetPageContent(pageId, out xml, OneNote.PageInfo.piAll);
var doc = XDocument.Parse(xml);
var outLine = doc.Descendants(ns + "Outline").First();
var content = outLine.Descendants(ns + "T").First();
string contentVal = content.Value;
content.Value = "modified";
onenoteApp.UpdatePageContent(doc.ToString());
return null;
}
This is just what I've gleaned from reading examples on the web (of course, you've already read all of those) and peeking into the way OneNote stores its data in XML using ONOMspy (http://blogs.msdn.com/b/johnguin/archive/2011/07/28/onenote-spy-omspy-for-onenote-2010.aspx).
If you want to work with OneNote content, you'll need a basic understanding of XML. Writing text to a OneNote page involves creating an outline element, whose content will be contained in OEChildren elements. Within an OEChildren element, you can have many other child elements representing outline content. These can be of type OE or HTMLBlock, if I'm reading the schema correctly. Personally, I've only ever used OE, and in this case, you'll have an OE element containing a T (text) element. The following code will create an outline XElement and add text to it:
// Get info from OneNote
string xml;
onApp.GetHierarchy(null, OneNote.HierarchyScope.hsSections, out xml);
XDocument doc = XDocument.Parse(xml);
XNamespace ns = doc.Root.Name.Namespace;
// Assuming you have a notebook called "Test"
XElement notebook = doc.Root.Elements(ns + "Notebook").Where(x => x.Attribute("name").Value == "Test").FirstOrDefault();
if (notebook == null)
{
Console.WriteLine("Did not find notebook titled 'Test'. Aborting.");
return;
}
// If there is a section, just use the first one we encounter
XElement section;
if (notebook.Elements(ns + "Section").Any())
{
section = notebook.Elements(ns + "Section").FirstOrDefault();
}
else
{
Console.WriteLine("No sections found. Aborting");
return;
}
// Create a page
string newPageID;
onApp.CreateNewPage(section.Attribute("ID").Value, out newPageID);
// Create the page element using the ID of the new page OneNote just created
XElement newPage = new XElement(ns + "Page");
newPage.SetAttributeValue("ID", newPageID);
// Add a title just for grins
newPage.Add(new XElement(ns + "Title",
new XElement(ns + "OE",
new XElement(ns + "T",
new XCData("Test Page")))));
// Add an outline and text content
newPage.Add(new XElement(ns + "Outline",
new XElement(ns + "OEChildren",
new XElement(ns + "OE",
new XElement(ns + "T",
new XCData("Here is some new sample text."))))));
// Now update the page content
onApp.UpdatePageContent(newPage.ToString());
Here's what the actual XML you're sending to OneNote looks like:
<Page ID="{20A13151-AD1C-4944-A3D3-772025BB8084}{1}{A1954187212743991351891701718491104445838501}" xmlns="http://schemas.microsoft.com/office/onenote/2013/onenote">
<Title>
<OE>
<T><![CDATA[Test Page]]></T>
</OE>
</Title>
<Outline>
<OEChildren>
<OE>
<T><![CDATA[Here is some new sample text.]]></T>
</OE>
</OEChildren>
</Outline>
</Page>
Hope that helps get you started!
If you're using C#, Check out the newer OneNote REST API at http://dev.onenote.com. It already supports creating a new page and has a beta API to patch and add content to an existing page.

Simple web crawler in C#

I have created a simple web crawler but I want to add the recursion function so that every page that is opened I can get the URLs in this page, but I have no idea how I can do that and I want also to include threads to make it faster.
Here is my code
namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;
myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream
StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only
textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();
}
private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);
IHTMLElementCollection L = doc.links;
foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}
I fixed your GetContent method as follow to get new links from crawled page:
public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}
return newLinks;
}
Updated
Fixed: regex should be regexLink. Thanks #shashlearner for pointing this out (my mistype).
i have created something similar using Reactive Extension.
https://github.com/Misterhex/WebCrawler
i hope it can help you.
Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine,
onCompleted: () => Console.WriteLine("Crawling completed"));
The following includes an answer/recommendation.
I believe you should use a dataGridView instead of a textBox as when you look at it in GUI it is easier to see the links (URLs) found.
You could change:
textBox3.Text = Links;
to
dataGridView.DataSource = Links;
Now for the question, you haven't included:
using System. "'s"
which ones were used, as it would be appreciated if I could get them as can't figure it out.
From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

Parsing XML String in C#

I have looked over other posts here on the same subject and searched Google but I am extremely new to C# NET and at a loss. I am trying to parse this XML...
<whmcsapi version="4.1.2">
<action>getstaffonline</action>
<result>success</result>
<totalresults>1</totalresults>
<staffonline>
<staff>
<adminusername>Admin</adminusername>
<logintime>2010-03-03 18:29:12</logintime>
<ipaddress>127.0.0.1</ipaddress>
<lastvisit>2010-03-03 18:30:43</lastvisit>
</staff>
</staffonline>
</whmcsapi>
using this code..
XDocument doc = XDocument.Parse(strResponse);
var StaffMembers = doc.Descendants("staff").Select(staff => new
{
Name = staff.Element("adminusername").Value,
LoginTime = staff.Element("logintime").Value,
IPAddress = staff.Element("ipaddress").Value,
LastVisit = staff.Element("lastvisit").Value,
}).ToList();
label1.Text = doc.Element("totalresults").Value;
foreach (var staff in StaffMembers)
{
listBox1.Items.Add(staff.Name);
}
I have printed out the contents of strResponse and the XML is definitely there. However, when I click this button, nothing is added to the listBox1 or the label1 so I something is wrong.
Add Root here to start navigating from the root element (whmcsapi):
string label1_Text = doc.Root.Element("totalresults").Value;

Categories

Resources