I want to get book information such as author name / pages / publish year / etc ...
from amazon using HtmlAgilityPack but seems amazon webpages have some problems and I can't access the appropriate fields.
here is what I've done :
I use Firefox and Firebug + FirePath to retrieve desired XPath and then inside my code I summon HtmlAgilityPack and instruct it to get information using acquired XPath that I've got it from Firebug
but no luck and till now I couldn't access the "Product Details" part of the amazon.com
and this is my XPath (which is working only with HtmlAgilityPack)
HtmlAgilityPack.HtmlNodeCollection cnt = doc.DocumentNode.SelectNodes("//*[#class='content']");
int i=1;
foreach (HtmlAgilityPack.HtmlNode content in cnt)
{
if (i != 3)
{
i++;
continue;
}
if (i == 3) // i==3 means I've reached the product details but I can't go any further :(
{
s = content.SelectSingleNode("").OuterHtml;
// break;
}
}
How can I access Product Details using appropriate understandable XPath for HtmlAgilityPack?
And why does the syntax of Firebug + FirePath XPath is different from HtmlAgilityPack?
As #Mystere said, I suggest using the API. But if you are doing this for test purpose, or just because you want to use web scraping to obtain the info (I'm not sure if Amazon allows it or not. You should check it before doing this), here is the thing:
Why are you doing this?
s = content.SelectSingleNode("").OuterHtml;
The following is what you are looking for in case you want to get the HTML source of that part of the page.
s = content.OuterHtml;
When you are scraping, I suggest you trying to identify the part you need to scrape, and see the particularities of that block of content.
If you use:
var node = doc.DocumentNode.SelectNodes("//td[#class='bucket']/div[#class='content']");
that will give you the Product Details block you are looking for.
If you want to get some fields like Paperback, Publisher, ... you can do:
string paperback = node.SelectSingleNode("./ul/li[1]/text()").InnerText;
string publisher = node.SelectSingleNode("./ul/li[2]/text()").InnerText;
string language = node.SelectSingleNode("./ul/li[3]/text()").InnerText;
...
If you want to be sure that the XPath you are using will be correct for HtmlAgilityPack, open the page on Internet Explorer 8 (or 9) and use the Developer Tools (F12) to get the XPath. The thing is that each browser renders the HTML in a particular way. For example, you will always see <tbody> tags in Firefox right after a <table>, so maybe HtmlAgilityPack doesn't, and that simple detail of adding /tbody/ to your XPath can make your program fail.
Why don't you just use amazon's web service api that is designed to do this?
Related
I am working on a crawler that uses PuppeteerSharp to get a web pages and analyze it.
But my problem is that I can not catch / find links that are inside a div, with a specific class and I have not been able to find any documentation on how to do this.
I have only been able to find documentation on how to capture / find all links on the page with this code syntax:
var jsSelectAllAnchors = # "Array.from (document.querySelectorAll ('a')). map (a => a.href);";
var urls = await page.EvaluateExpressionAsync <string []> (jsSelectAllAnchors);
Are there any who can help?
thanks in advance!
I'm sorry if this question has allready been answered , but I litterally spent more than two weeks searching the Internet for a solution to my issue.
Now , I definitly do not perform the best google searches , and it might seem that my question has several effective answers on the Internet. but I really tried every single solution that I found , without any positive results.
What i'm trying to do is simple , and I did it successfully on many websites :
Navigating to a website using WebBrowser (1).
Waiting for everything to load properly (document completed event).
Download the page using DocumentText property (1).
(1) : I also use WebClient from time to time.
And there it is , I get the html page , and I can exploit it anyway I like. The issue is with a particular website that I cannot obtain the full content inspite of using all the different solutions that I found. I suspected the fact that this page might need to load several scripts before getting the full content. Yet again, I read that WebBrowser does run all the necessarry scripts before triggering the " completed " event, so , apparently , that's not the issue. The page that i'm inquiring about is : http://www.coolmod#com/tarjetas-graficas-nvidia-pci-express
I tried , after that the WebBrowser loads the entire page , looking for random elements using GetElementByID property and checking if I get a null result. It appears that when I try getting an element that does not belong to the products list , i'm successful. But , whenever I try to get an element that belongs to the list it self , I always get a null. Which means , the list it self does not load. and I really don't know why. By the way , I do not prevent the WebBrowser. Navigate () from delivering multiple responses , I allow it to give as many feedbacks as possible , and still , the product list does not load , even when I pass the cookies. I Even tried copying all the content of the document and pasting it through the clipboard. Here is a simple example of what I try to do :
private void catalogueDownload()
{
System.Windows.Forms.WebBrowser wb = new System.Windows.Forms.WebBrowser();
wb.ScriptErrorsSuppressed = true;
wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(Catalogue_DocumentCompleted);
wb.Navigate("http://www.coolmod.com/tarjetas-graficas-nvidia-pci-express");
}
public void Catalogue_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var wb = sender as System.Windows.Forms.WebBrowser;
string output = wb.DocumentText;
File.WriteAllText("testing.html", output);
}
Thanks for giving up your time to read all this.
System.Windows.Forms.WebBrowser is a bit outdated, If I were you, I would consider using an external library for that, Selenium would be my 1st choice, given it has all the necessary integrations with .NET Framework (and a lot of other languages)
I am creating an application that interfaces with Google's Maps API v3. My current approach is using a WebBrowser control by WebBrowser.Navigate("Map.html"). This is working correctly at the moment; however, I am also aware of WebBrowser.InvokeScript(). I have seen this used to execute a javascript function, but I would like to have something like the following structure:
APICalls.js - Contains different functions that can be called, or even separated out into a file for each function if necessary.
MapInterface.cs
WebBrowser.InvokeScript("APICalls.js", args) - Or control the javascript variables directly.
I have seen the InvokeScript method used, but none of the examples gave any detail to the source of the function, so I'm not sure if it was calling it from an html file or js file. Is it possible to have a structure like this, or a similarly organized structure, rather than creating an html file with javascript in each one and using Navigate()?
Additionally, are there any easier ways to use Google Maps with WPF. I checked around, but all of the resources I found were at least 2-3 years old, which I believe is older than the newest version of the maps API.
I can't suggest a better way of using Google Maps API with WPF (although I'm sure it exists), but I can try to answer the rest of the question.
First, make sure to enable FEATURE_BROWSER_EMULATION for your WebBrowser app, so Google Maps API recognizes is it as modern HTML5-capable browser.
Then, navigate to your "Map.html" page and let it finish loading. Here's how it can be done using async/await (the code is for the WinForms version of WebBrowser control, but the concept remains the same).
You can have your APICalls.js as a separate local file, but you'd need to create and populate a <script> element for it from C#. You do it once for the session.
Example:
var scriptText = File.ReadAllText("APICalls.js");
dynamic htmlDocument = webBrowser.Document;
var script = htmlDocument.createElement("script");
script.type = "text/javascript";
script.appendChild(htmlDocument.createTextNode(scriptText));
htmlDocument.body.appendChild(script);
Then you can call functions from this script in a few different ways.
For example, your JavaScript entry point function in APICalls.js may look like this:
(function() {
window.callMeFromCsharp = function(arg1, arg2) {
window.alert(arg1 + ", " +arg2);
}
})();
Which you could call from C# like this:
webBrowser.InvokeScript("callMeFromCsharp", "Hello", "World!");
[UPDATE] If you're looking for a bit more modular or object-oriented approach, you can utilize the dynamic feature of C#. Example:
JavaScript:
(function() {
window.apiObject = function() {
return {
property: "I'm a property",
Method1: function(arg) { alert("I'm method 1, " + arg); },
Method2: function() { return "I'm method 2"; }
};
}
})();
C#:
dynamic apiObject = webBrowser.InvokeScript("apiObject");
string property = apiObject.property;
MessageBox.Show(property);
apiObject.Method1("Hello!");
MessageBox.Show(apiObject.Method2());
We would like to automate certain tasks in a website, like having a user 'login', perform some functionality, read their account history etc.
We have tried emulating this with normal POST/GETs, however the problem is that for example for 'login', the website uses javascript code to execute an AJAX call, and also generate some random tokens.
Is it possible to literally emulate a web-browser? For example:
Visit 'www.[test-website].com'
Fill in these DOM items
DOM item 'username' fill in with 'testuser'
DOM item 'password' fill in with 'testpass'
Click' button DOM item 'btnSubmit'
Visit account history
Read HTML (So we can parse information about each distinct history item)
...
The above could be translated into say the below sample code:
var browser = new Browser();
var pageHomepage = browser.Load("www.test-domain.com");
pageHomepage.DOM.GetField("username").SetValue("testUser");
pageHomepage.DOM.GetField("password").SetValue("testPass");
pageHomepage.DOM.GetField("btnSubmit").Click();
var pageAccountHistory = browser.Load("www.test-domain.com/account-history/");
var html = pageAccountHistory.GetHtml();
var historyItems = parseHistoryItems(html);
You could use for example Selenium in C#. There is a good tutorial: Data Driven Testing Using Selenium (webdriver) in C#.
I would suggest to instantiate a WebBrowser control in code and do all your work with this instance but never show it on any form. I've done this several times and it works pretty good. The only flaw is that it makes use of the Internet Explorer ;-)
Try JMeter, it is a nice too for automating web requests, also quite popularly used for performance testing of web sites
Or just try System.Windows.Forms.WebBrowser, for example:
this.webBrowser1.Navigate("http://games.powernet.com.ru/login");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
System.Windows.Forms.Application.DoEvents();
HtmlDocument doc = webBrowser1.Document;
HtmlElement elem1 = doc.GetElementById("login");
elem1.Focus();
elem1.InnerText = "login";
HtmlElement elem2 = doc.GetElementById("pass");
elem2.Focus();
elem2.InnerText = "pass";
I've been given the task of content migration from another CMS system to SharePoint 2010.
The data in the old system is fairly easy to capture and the page hierarchy is simple so I'm not worried about that.
However, I am completely flummoxed about how to even create a page in code. I'm using the Microsoft.SharePoint.Client namespace as I do not have sharepoint installed on my system and am wanting to code this up as a console application and so I'm using I'm using ClientContext. (On the other hand, I am willing to go into other solutions if necessary).
My end-game: To get a page uploaded into some folder hierarchy which uses a master page, has the page title in a header web part, and a big ol' content-editable web part in the body so any user can come along and edit the content.
Things I've tried so far:
Using FileCollection.Add() to add an aspx file to the folder "Site Pages". This renders the html in the browser but doesn't enable any features for the user to edit the page
Using ListItemCollection.Add() to add a page to the site, but I didn't know what fields I needed. Also I remember it came up with a runtime error saying I should use FileCollection.Add()
Uploading to 'Site Pages' instead of 'Pages'
So many others... ow my head :(
The only plausible thing I can see on the net is to use the PublishingPage type along with PublishingWeb. However, PublishingWeb can only be constructed from an SPWeb object which requires me to be actually hosting the sharepoint application on my workstation.
If anyone can lend a hand that would be greatly appreciated :)
Here is a method I use to create pages. It seems a more supported way of creating pages than mr Aquino's. Though this is for MOSS 2007 I'm sure the equivalent exists in 2010. Also, I'd recommend to create console apps using the full object model. You'll have to run it on the server itself but that doesn't seem much of a problem for a migration? This way you won't be limited in any way.
public static void CreatePage(string url, string pageName, string title, string layoutName, Dictionary<string, string> fieldDataCollection)
{
var relUrl = new Uri(url);
using (SPSite site = new SPSite(url))
using (SPWeb web = site.AllWebs[relUrl.AbsolutePath])
{
if (!PublishingWeb.IsPublishingWeb(web))
throw new ArgumentException("The specified web is not a publishing web.");
PublishingWeb pubweb = PublishingWeb.GetPublishingWeb(web);
PageLayout layout = null;
string availableLayouts = string.Empty;
foreach (PageLayout lo in pubweb.GetAvailablePageLayouts())
{
availableLayouts += "\t" + lo.Name + "\r\n";
if (lo.Name.ToLowerInvariant() == layoutName.ToLowerInvariant())
{ layout = lo; break; }
}
if (layout == null)
throw new ArgumentException("The layout specified could not be found. Available layouts are:\r\n" + availableLayouts);
if (!pageName.ToLowerInvariant().EndsWith(".aspx")) pageName += ".aspx";
PublishingPage page = pubweb.GetPublishingPages().Add(pageName, layout);
page.Title = title;
SPListItem item = page.ListItem;
foreach (string fieldName in fieldDataCollection.Keys)
{
string fieldData = fieldDataCollection[fieldName];
try
{
SPField field = item.Fields.GetFieldByInternalName(fieldName);
if (field.ReadOnlyField)
{
Console.WriteLine("Field '{0}' is read only and will not be updated.", field.InternalName);
continue;
}
if (field.Type == SPFieldType.Computed)
{
Console.WriteLine("Field '{0}' is a computed column and will not be updated.", field.InternalName);
continue;
}
if (field.Type == SPFieldType.URL)
{
item[field.Id] = new SPFieldUrlValue(fieldData);
}
else if (field.Type == SPFieldType.User)
{
// AddListItem.SetUserField(web, item, field, fieldData);
}
else
{
item[field.Id] = fieldData;
}
}
catch (ArgumentException)
{
Console.WriteLine("WARNING: Could not set field {0} for item {1}.", fieldName, item.ID);
}
}
page.Update();
}
}
I don't see a way of creating a publishing page without the actual publishing methods.
When you create a new article page it will only create a few xml parameters inside the page, the layout itself lives in the /_catalogs/masterpage/article-XXXX.aspx file.
You can try downloading a native file created in the Pages document library, understand its structure, fill the XML with your data and then uploading it back to the Pages document library using the FileCollection -- that's my only guess.
Edit: sample Article Page
<%# Page Inherits="Microsoft.SharePoint.Publishing.TemplateRedirectionPage,Microsoft.SharePoint.Publishing,Version=12.0.0.0,Culture=neutral,PublicKeyToken=71e9bce111e9429c" %>
<%# Reference VirtualPath="~TemplatePageUrl" %>
<%# Reference VirtualPath="~masterurl/custom.master" %>
<html xmlns:mso="urn:schemas-microsoft-com:office:office" xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"><head>
<!--[if gte mso 9]><xml>
<mso:CustomDocumentProperties>
<mso:PublishingContact msdt:dt="string">1073741823</mso:PublishingContact>
<mso:display_urn_x003a_schemas-microsoft-com_x003a_office_x003a_office_x0023_PublishingContact msdt:dt="string">System Account</mso:display_urn_x003a_schemas-microsoft-com_x003a_office_x003a_office_x0023_PublishingContact>
<mso:PublishingContactPicture msdt:dt="string"></mso:PublishingContactPicture>
<mso:PublishingContactName msdt:dt="string"></mso:PublishingContactName>
<mso:ContentTypeId msdt:dt="string">0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF390078FB5FE740F6714B9595501175ECD8F000727044016EAB3B45B9E104498E366C85</mso:ContentTypeId>
<mso:Comments msdt:dt="string"></mso:Comments>
<mso:PublishingContactEmail msdt:dt="string"></mso:PublishingContactEmail>
<mso:PublishingPageLayout msdt:dt="string">http://dmserver008/_catalogs/masterpage/ArticlePage.aspx, EstudoAndre</mso:PublishingPageLayout>
</mso:CustomDocumentProperties>
</xml><![endif]--><title>New Article</title></head>
To grab one, hit the Pages library => Content Menu => Send To => Download a Copy
Uploading a page file should work, as long as you get the settings right on the item as well as the document itself. After you upload the file you can set the content type and properties appropriately. If you create a page manually first, you should be able to get an object that has all the right settings.
However, I would strongly recommend getting set up to develop a console app that will run on the sharepoint server rather than relying on the web services. The server side apis (including PublishingPage) tend to be a lot easier to work with.