How to extract data from website using AngleSharp & LINQ?

How to extract data from website using AngleSharp & LINQ? - c#

I'm trying to extract the prices from the below mentioned website. I'm using AngleSharp for the extraction. In the website, the prices are listed below (as an example):
<span class="c-price">650.00 </span>
I'm using the following code for the extraction.
using AngleSharp.Parser.Html;
using System.Net;
using System.Net.Http
//Make the request
var uri = "https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&scene=taobao_shop";
var cancellationToken = new CancellationTokenSource();
var httpClient = new HttpClient();
var request = await httpClient.GetAsync(uri);
cancellationToken.Token.ThrowIfCancellationRequested();
//Get the response stream
var response = await request.Content.ReadAsStreamAsync();
cancellationToken.Token.ThrowIfCancellationRequested();
//Parse the stream
var parser = new HtmlParser();
var document = parser.Parse(response);
//Do something with LINQ
var pricesListItemsLinq = document.All
.Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
Console.WriteLine(pricesListItemsLinq.Count());
However, I'm not getting any items, but they are there on the website. What am I doing wrong? If AngleSharp isn't the recommended method, what should I use? And what code should I use?

I am late at the party, but I try to bring some sanity here.
Querying static webpages
For this we require the following set of tools / functionality:
HTTP requester (to obtain resources, e.g., HTML documents, via HTTP), potentially with a SSL/TLS layer on top (either accepting all certificates or working against the certificate store / known CAs)
HTML parser
A queryable object model representation of the parsed HTML document
Maybe additionally some cookie state and the ability to follow links / post forms
AngleSharp gives us all these options (minus a connection to the certificate store / known CAs; so in order to use HTTPS we must do some additional configuration, e.g., to accept all certificates).
We would start by creating an AngleSharp configuration that defines which capabilities are available for the browsing engine. This engine is exposed in form of a "browsing context", which can be regarded as a headless tab. In this tab we can open a new document (either from a local source, a constructed source, or a remote source).
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("http://example.com");
Once we have the document we can use CSS query selectors to obtain certain elements. These elements can be used to gather the information we look for.
AngleSharp embraces LINQ (or IEnumerable in general), however, it makes sense to give full power to the queries if possible.
So instead of
var pricesListItemsLinq = document.All
.Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
We write
var pricesListItemsLinq = document.QuerySelectorAll("span.c-price");
This is also much more robust (the ClassList is anyway a complex object giving access to a list of classes, so you either meant ClassList.Contains or ClassName.Equals (the latter being the string representation). Note: The two versions are not equivalent, because the former is looking for a class within the list of classes, while the latter is looking for a match of the whole class serialization (thus posing some extra boundary conditions on the match; it needs to be the only class).
Dealing with dynamic pages
This is far more complicated. The basics are the same as previously, but the engine needs to deliver a lot more than just the previously mentioned requirements. Additionally, we need
A JavaScript engine
A valid CSSOM
A fake (or even fully computed) rendering tree
A lot more DOM interfaces that can be found in real browsers (e.g., navigator, full history, web workers, ...) - the list is limitless here
While there is a project that delivers an experimental (and limited) C# only JS engine to AngleSharp, the latter two requirements cannot be fully fulfilled right now. Furthermore, the CSSOM may also be not complete enough for one or the other web application. Keep in mind that these pages are potentially designed for real browsers. They make certain assumptions. They may even require user input (e.g., Google Captcha).
Long story short.
var config = Configuration.Default
.WithDefaultLoader()
.WithCss()
.WithJavaScript(); // maybe even more
var context = BrowsingContext.New(config);
The Task behind the await when opening a new document is equivalent to a load event in the DOM. Thus it will not fire when the document was downloaded and parsed, but only once all scripts have been loaded (and potentially run) incl. resources that needed to be downloaded.
Hope this helps a bit!

Related

Get HTML Code from a website after it completed loading

I am trying to get the HTML Code from a specific website async with the following code:
var response = await httpClient.GetStringAsync("url");
But the problem is that the website usually takes another second to load the other parts of it. Which I need, so the question is if I can load the site first and read the content after a certain amount of time.
Sorry if this question already got answered, but I didn't really know what to search for.
Thanks,
Twenty
Edit #1
If you want to try it yourself the URL is http://iloveradio.de/iloveradio/, I need the Title and the Artist which do not immediately load.

You are on the wrong direction. The referenced site has playlist api which returns json. you can get information from :
http://iloveradio.de/typo3conf/ext/ep_channel/Scripts/playlist.php
Edit: Chome Inspector is used to find out Playlist link

You could use Puppeteer-Sharp:
await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
using (var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = false }))
using (var page = await browser.NewPageAsync())
{
await page.SetViewportAsync(new ViewPortOptions() { Width = 1280, Height = 600 });
await page.GoToAsync("http://iloveradio.de/iloveradio/");
await page.WaitForSelectorAsync("#artisttitle DIV");
var artist = await page.EvaluateExpressionAsync<string>("$('#artisttitle DIV')[0].innerText");
Console.WriteLine(artist);
Console.ReadLine();
}

If there are things that load after, it means that they are generated by javascript code after page load (an ajax request for example), so no matter how long you wait, it won't have the content you want (because they are not in the source code when it loads).
Easy way to do it:
Use a WebBrowser and when DocumentCompleated event triggers wait till the element you want appears.
The Right Way:
find the javascript yourself and trigger it yourself (easy to say, hard to do).

The thing to understand here is that when you read the response from the URL, all you will ever get is the raw response, in this case the HTML source code the server replied with.
Unlike what you might see in your browser's DOM Inspector developer tools, you will only get the original HTML source from the page (what you might see in the "Page Source" developer tool) which will not include any dynamically created content (JavaScript) or loaded content (like iframes).
So you aren't getting what you see here in the DOM Inspector:
You are getting what you see here in the Page Source (View > Developer > View Source in Chrome):
You can't wait for that other content to load because it will never load since that HTML content isn't being parsed or rendered like a browser would.
You have several options available though:
See if the website has an API you can use
Determine where that content you want is actually loaded from, and make another/different HTTP request to that content (the Network Panel is helpful here)
Use a headless browser to programmatically load the page and dynamically read the page contents (this will add a lot of overhead, and should probably be avoided if possible)

I have checked out the website, data is loaded by javascript. You only can get the html using httpClient.GetStringAsync("url");.
As far as I know, there is no luck to get the elements what is manipulate by browser.

Simplified LDAP/AD Server on C#

I've searched without much success to the simplest (and yet working) example of an LDAP/AD Server for C#. Many libraries exist to connect to LDAP servers, but not the LDAP Server by itself (on C#).
I found however some information about it and even a post requesting a simple LDAP server that was answered "LDAP isn't simple"; and yet i read a lot of the RFC4511 and this sample code at GitHub Flexinet LDAP Server, but unfortunatly i don't have yet the knowledge to complete it's code.
My goal is not to make a fully functional LDAP server, but one that can at least do:
Serve as a login pool for softwares that allow it's users to be
registered and log on a AD/LDAP server (just check for login and
password for authentication).
Allow softwares like Outlook and Thunderbird to get a list of users (without passwords) with first and last name, e-mail address, phone number and department for contact list model.
No delete, add (or create), move, and other
functions are required since the main software that i aim to
integrate it with will do all the user and group management.
UPDATE
I'm trying to implement the Flexinet sample and adjust to that functionalities; as form of a question what should i do to change this function to prevent it from causing an exception (on the "var filter = searchRequest.ChildAttributes[6];" line it always breaks) when i call from a LDAP client software:
private void HandleSearchRequest(NetworkStream stream, LdapPacket requestPacket)
{
var searchRequest = requestPacket.ChildAttributes.SingleOrDefault(o => o.LdapOperation == LdapOperation.SearchRequest);
var filter = searchRequest.ChildAttributes[6];
if ((LdapFilterChoice)filter.ContextType == LdapFilterChoice.equalityMatch && filter.ChildAttributes[0].GetValue<String>() == "sAMAccountName" && filter.ChildAttributes[1].GetValue<String>() == "testuser") // equalityMatch
{
var responseEntryPacket = new LdapPacket(requestPacket.MessageId);
var searchResultEntry = new LdapAttribute(LdapOperation.SearchResultEntry);
searchResultEntry.ChildAttributes.Add(new LdapAttribute(UniversalDataType.OctetString, "cn=testuser,cn=Users,dc=dev,dc=company,dc=com"));
searchResultEntry.ChildAttributes.Add(new LdapAttribute(UniversalDataType.Sequence));
responseEntryPacket.ChildAttributes.Add(searchResultEntry);
var responsEntryBytes = responseEntryPacket.GetBytes();
stream.Write(responsEntryBytes, 0, responsEntryBytes.Length);
}
var responseDonePacket = new LdapPacket(requestPacket.MessageId);
responseDonePacket.ChildAttributes.Add(new LdapResultAttribute(LdapOperation.SearchResultDone, LdapResult.success));
var responseDoneBytes = responseDonePacket.GetBytes();
stream.Write(responseDoneBytes, 0, responseDoneBytes.Length);
}
The code is on the github link.

Finally i made a fork of the Flexinet LDAP Server on #Sammuel-Miranda/LdapServerLib and with the author's support and some changes and adaptations i completed this implementation. It responds to the bind and search calls and works perfectly for Outlook and Thunderbird to use as a shared address book.
I did not implemente however any ADD/MODIFY/DELETE request (but would not be hard to do) since i don't need then.

I found on the RFC4511 the explanation on how the search works ... and i'm "kind" of understanding it, not very well - and i see that the method implemented on the GitHub from Flexinet LDAP Server only answer to bind and search requests of one single user (since it's only a example implementation).
The client is requesting diferent calls to verify capabilities, structure and other info before making the search request itself. So i'll implement it all, one by one.
Still, if any other lib (in C#) exists, and anyone know about, would be better than writing a hole new server. If my implementation works, i'll fork it on github and share.

Parse page using AngleSharp

I want to parse website using c# with AngleSharp, it's easy to do with static pages, but there is a problem, I can't parse info avalible only for authorized users. What should I do to autorize programmatically into website and parse all info avalible for me?

Depending on the used authorization scheme this may either be super simple or ultra hard / impossible.
So let's first visit what can be done with AngleSharp:
Any kind of requests incl. their manipulation (on request, but also before response)
General cookie management (and their manipulation, of course)
Querying the DOM and perform "simple" actions (e.g., clicking a button, submitting a form)
Running trivial JavaScript files
Here trivial means: Scripts that do not need any capabilities beyond what AngleSharp offers, e.g., rendering tree information, advanced CSSOM access, ... - or scripts that require non-ES5 compliant parsers (e.g., make use of ES6 or some special non-standard capabilities).
Now since I do not know what is the authorization scheme or exact problem that you are hitting (some code / MWE would be helpful!) I'll just go for a simple click example.
var context = BrowsingContext.New(Configuration.Default.WithDefaultLoader().WithCookies());
var loginPage = await context.OpenAsync("http://yourpage.com");
var loginForm = loginPage.QuerySelector<IHtmlFormElement>("form");
var profilePage = await loginForm.SubmitAsync(new { userName = "myUser", password = "password" });
// get something on profilePage
Note that in this example the form field names for the login form are userName and password - they may be different for your login page. Also note that your page may contain multiple forms and the selector could be more sophisticated than a simple form.
HTH!

cXML PunchOutSetupRequest and PunchOutSetupResponse examples in C#

I'm trying to implement punchout catalogs on our eComm site. Honestly, the documentation for cXML is a mess and all the code examples are in javascript and/or VB.Net (I use C# and would rather not have to try and translate). Does anyone out there have examples or samples of how to receive the PunchOutSetupRequest XML and then send out the PunchOutSetupResponse XML using C#? I've been unable to find anything on the interwebs (I've been looking for two days now)...
I'm hoping I can just do this inside an ActionResult (vs. a 'launch page' as suggested).
I'm a complete noob at punchouts and could really use some help here. The bosses are being pretty pushy, so any assistance would be greatly appreciated. Suggestions as to how to make this work would also be much appreciated.
I apologize to all for the vagueness of the question (request).

This isn't trivial, but this should get you started.
You'll need 3 generic handlers (.ashx): Setup, Start, and Order....
Setup and Order will receive HTTP Post with content-type of "text/xml". Look at HttpRequest.InputStream if needed to get the XML into a string. From there, look at LINQ-to-XML to dig out the data you want. Your HTTP Response to both of these will also be content-type "text/xml" and UTF8 encoded, returning the CXML as documented...use LINQ-to-XML to produce that.
The Setup handler will need to validate credentials and return a URL with a unique QueryString token pointing to the Start handler. Do not expect session persistence between Setup and Start, because they're not from the same caller. This handler will need to create an application object for the token and associated data you extracted from the cXML.
The Start handler will be called as a simple GET, and will need to match the token in the QueryString to the appropriate application object, copy that data to the session, and then do a response.redirect to whatever page in your site you want the buyer to land on.
Once they populate their cart with some things, and are ready to check out, you'll take them to a page that has an embedded form (not to be confused with an ASP.Net form that posts back to your server) and a submit button (again, not an ASP.Net button). From your Setup handler, you captured a URL to point this form's Post, and within the form you'll have a hidden input tag with the UTF8 encoded CXML Punchout Order injected as the value produced with LINQ-to-XML. Highly recommend Base64 encoding that value to avoid ASP.Net messing with the tags it contains during rendering, and naming the hidden input "cxml-base64" per the documentation. The result is the form is client-side POSTed to your customer's server instead of yours, and their server will extract the CXML Punchout Order and that ends your visitor's session.
The Order handler will receive a CXML OrderRequest and just like Setup, you'll dump that to a string and then use LINQ-to-XML to parse it and act upon it. Again you'll get credentials to verify, possibly a credit card to process, and the order items, ship-to, etc. Note that the OrderRequest may not contain all the items that were in the Punchout Order, because the system on your customer's side may remove items or even change item quantities before submitting the final OrderRequest to you. The OrderRequest could come back to you after the Punchout Order is posted to them in a matter of minutes, days, weeks, or never...don't bother storing the cart data in hopes of matching it to the order later.
Last note...the buyer may be experiencing your site in an iframe embedded in their web-based procurement UI, so design accordingly.
If you need more info, reply to this and I'll get back.
Update...Additional considerations:
Discuss with the buyer how they want fault handling to flow, particularly with orders, because you have a choice. 1) exhaustively evaluate everything in the CXML you receive and return response codes other than 200 if anything is wrong, or 2) always return a 200 Success and deal with any issues out of band or by generating a ConfirmationRequest that rejects the order. My experience is that a mix of the two works best. Certainly you should throw a non-200 if the credentials fail, but you may not want (or be able) to run a credit card or validate stock availability inline. Your buyer's system may not be able to cope with dozens of possible faults, and/or may not show your fault messages to the user for them to make corrections. I've seen systems that will flat-out discard any non-200 response code and just blindly retry the submission repeatedly on an interval for hours or days until it gives up on a sanity check, while others will handle response codes within certain ranges differently than others, for example a 4xx invokes a retry, while a 5xx is treated as fatal. Remember that Setup and Order are not coming directly from the user...their procurement system is generating those internally.
Update...answering the comment about how to test things...
You'd use the same method as you will for generating outbound ConfirmationRequest, ShipNoticeRequest, and InvoiceDetailRequest, all of which generally are produced on your side after receiving an OrderRequest from your customer's procurement system.
Start with Linq-To-XML for an example of crafting your outgoing cXML (Creating XML Trees section). Combine that example with this bit of code:
StringBuilder output = new StringBuilder();
XmlWriterSettings objXmlWriterSettings = new XmlWriterSettings();
objXmlWriterSettings.Indent = true;
objXmlWriterSettings.NewLineChars = Environment.NewLine;
objXmlWriterSettings.NewLineHandling = NewLineHandling.Replace;
objXmlWriterSettings.NewLineOnAttributes = false;
objXmlWriterSettings.Encoding = new UTF8Encoding();
using (XmlWriter objXmlWriter = XmlWriter.Create(output, objXmlWriterSettings)) {
XElement root = new XElement("Root",
new XElement("Child", "child content")
);
root.Save(objXmlWriter);
}
Console.WriteLine(output.ToString());
So at this point the StringBuilder (output) has your whole cXML, and you need to POST it someplace. Your Web Application project, started with F5 and a default.aspx page will be listening on localhost and some port (you'll see that in the URL it opens). Separately, perhaps using VS Express for Desktop, you have the above code in a console app that you can run to do the Post using something like this:
Net.HttpWebRequest objRequest = Net.WebRequest.Create("http://localhost:12345/handler.ashx");
objRequest.Method = "POST";
objRequest.UserAgent = "Some User Agent";
objRequest.ContentLength = output.Length;
objRequest.ContentType = "text/xml";
IO.StreamWriter objStreamWriter = new IO.StreamWriter(objRequest.GetRequestStream, System.Text.Encoding.ASCII);
objStreamWriter.Write(output);
objStreamWriter.Flush();
objStreamWriter.Close();
Net.WebResponse objWebResponse = objRequest.GetResponse();
XmlReaderSettings objXmlReaderSettings = new XmlReaderSettings();
objXmlReaderSettings.DtdProcessing = DtdProcessing.Ignore;
XmlReader objXmlReader = XmlReader.Create(objWebResponse.GetResponseStream, objXmlReaderSettings);
// Pipes the stream to a higher level stream reader with the required encoding format.
IO.MemoryStream objMemoryStream2 = new IO.MemoryStream();
XmlWriter objXmlWriter2 = XmlWriter.Create(objMemoryStream2, objXmlWriterSettings);
objXmlWriter2.WriteNode(objXmlReader, true);
objXmlWriter2.Flush();
objXmlWriter2.Close();
objWebResponse.Close();
// Reset current position to the beginning so we can read all below.
objMemoryStream2.Position = 0;
StreamReader objStreamReader = new StreamReader(objMemoryStream2, Encoding.UTF8);
Console.WriteLine(objStreamReader.ReadToEnd());
objStreamReader.Close();
Since your handler should be producing cXML you'll see that spat out in the console. If it pukes, you'll get a big blob of debug mess in the console, which of course will help you fix whatever is broken.
pardon the verbosity in the variable names, done to try to make things clear.

How can I use Google Search in my C# project?

i build mini Question Answering System in C#. I need retrieve document by google Search.
What is google tools name, i can use it in my project?
Thanks

One possibility is to set up a custom Google search engine. Then you also need to create a developer key, which I believe is done under the console.
After setting that up, you can make REST style call with code such as the following, which retrieves the results as JSON:
WebClient w = new WebClient();
string result;
string uri;
string googleAPIKey = "your developer key";
string googleEngineID = "your search engine id";
string template = "https://www.googleapis.com/customsearch/v1?key={0}&cx={1}&q={2}&start={3}&alt=json";
int startIndex = 1;
int gathered = 0;
uri = String.Format(template, googleAPIKey, googleEngineID, "yoursearchstring", startIndex);
result = w.DownloadString(uri);
For extracting the information from the JSON results, you can use something like Json.NET. It makes it extremely easy to read the information:
JObject o = JObject.Parse(result);
Then you can directly access the desired information with a single line of code.
One important piece of information is that the search API free usage is extremely limited (100 requests per day). So for a real-world application it would probably be necessary to pay for the search. But depending on how you use it, maybe 100 requests per day is sufficient. I wrote a little mashup using the Google search API to search for Stackoverflow site information of interest and then used the StackExchange API to retrieve the information. For that personal use, it works very well.

I've never used it before (and it's alpha), but take a look at Google APIs for .NET Framework library.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract data from website using AngleSharp & LINQ? - c#

Related

Get HTML Code from a website after it completed loading

Simplified LDAP/AD Server on C#

Parse page using AngleSharp

cXML PunchOutSetupRequest and PunchOutSetupResponse examples in C#

How can I use Google Search in my C# project?

Categories

Resources