Parse page using AngleSharp

Parse page using AngleSharp - c#

I want to parse website using c# with AngleSharp, it's easy to do with static pages, but there is a problem, I can't parse info avalible only for authorized users. What should I do to autorize programmatically into website and parse all info avalible for me?

Depending on the used authorization scheme this may either be super simple or ultra hard / impossible.
So let's first visit what can be done with AngleSharp:
Any kind of requests incl. their manipulation (on request, but also before response)
General cookie management (and their manipulation, of course)
Querying the DOM and perform "simple" actions (e.g., clicking a button, submitting a form)
Running trivial JavaScript files
Here trivial means: Scripts that do not need any capabilities beyond what AngleSharp offers, e.g., rendering tree information, advanced CSSOM access, ... - or scripts that require non-ES5 compliant parsers (e.g., make use of ES6 or some special non-standard capabilities).
Now since I do not know what is the authorization scheme or exact problem that you are hitting (some code / MWE would be helpful!) I'll just go for a simple click example.
var context = BrowsingContext.New(Configuration.Default.WithDefaultLoader().WithCookies());
var loginPage = await context.OpenAsync("http://yourpage.com");
var loginForm = loginPage.QuerySelector<IHtmlFormElement>("form");
var profilePage = await loginForm.SubmitAsync(new { userName = "myUser", password = "password" });
// get something on profilePage
Note that in this example the form field names for the login form are userName and password - they may be different for your login page. Also note that your page may contain multiple forms and the selector could be more sophisticated than a simple form.
HTH!

Related

unable to get complete url after # using query string in asp.net c# [duplicate]

I know on client side (javascript) you can use windows.location.hash but could not find anyway to access from the server side. I'm using asp.net.

We had a situation where we needed to persist the URL hash across ASP.Net post backs. As the browser does not send the hash to the server by default, the only way to do it is to use some Javascript:
When the form submits, grab the hash (window.location.hash) and store it in a server-side hidden input field Put this in a DIV with an id of "urlhash" so we can find it easily later.
On the server you can use this value if you need to do something with it. You can even change it if you need to.
On page load on the client, check the value of this this hidden field. You will want to find it by the DIV it is contained in as the auto-generated ID won't be known. Yes, you could do some trickery here with .ClientID but we found it simpler to just use the wrapper DIV as it allows all this Javascript to live in an external file and be used in a generic fashion.
If the hidden input field has a valid value, set that as the URL hash (window.location.hash again) and/or perform other actions.
We used jQuery to simplify the selecting of the field, etc ... all in all it ends up being a few jQuery calls, one to save the value, and another to restore it.
Before submit:
$("form").submit(function() {
$("input", "#urlhash").val(window.location.hash);
});
On page load:
var hashVal = $("input", "#urlhash").val();
if (IsHashValid(hashVal)) {
window.location.hash = hashVal;
}
IsHashValid() can check for "undefined" or other things you don't want to handle.
Also, make sure you use $(document).ready() appropriately, of course.

[RFC 2396][1] section 4.1:
When a URI reference is used to perform a retrieval action on the
identified resource, the optional fragment identifier, separated from
the URI by a crosshatch ("#") character, consists of additional
reference information to be interpreted by the user agent after the
retrieval action has been successfully completed. As such, it is not
part of a URI, but is often used in conjunction with a URI.
(emphasis added)
[1]: https://www.rfc-editor.org/rfc/rfc2396#section-4

That's because the browser doesn't transmit that part to the server, sorry.

Probably the only choice is to read it on the client side and transfer it manually to the server (GET/POST/AJAX).
Regards
Artur
You may see also how to play with back button and browser history
at Malcan

Just to rule out the possibility you aren't actually trying to see the fragment on a GET/POST and actually want to know how to access that part of a URI object you have within your server-side code, it is under Uri.Fragment (MSDN docs).

Possible solution for GET requests:
New Link format: http://example.com/yourDirectory?hash=video01
Call this function toward top of controller or http://example.com/yourDirectory/index.php:
function redirect()
{
if (!empty($_GET['hash'])) {
/** Sanitize & Validate $_GET['hash']
If valid return string
If invalid: return empty or false
******************************************************/
$validHash = sanitizeAndValidateHashFunction($_GET['hash']);
if (!empty($validHash)) {
$url = './#' . $validHash;
} else {
$url = '/your404page.php';
}
header("Location: $url");
}
}

How to extract data from website using AngleSharp & LINQ?

I'm trying to extract the prices from the below mentioned website. I'm using AngleSharp for the extraction. In the website, the prices are listed below (as an example):
<span class="c-price">650.00 </span>
I'm using the following code for the extraction.
using AngleSharp.Parser.Html;
using System.Net;
using System.Net.Http
//Make the request
var uri = "https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&scene=taobao_shop";
var cancellationToken = new CancellationTokenSource();
var httpClient = new HttpClient();
var request = await httpClient.GetAsync(uri);
cancellationToken.Token.ThrowIfCancellationRequested();
//Get the response stream
var response = await request.Content.ReadAsStreamAsync();
cancellationToken.Token.ThrowIfCancellationRequested();
//Parse the stream
var parser = new HtmlParser();
var document = parser.Parse(response);
//Do something with LINQ
var pricesListItemsLinq = document.All
.Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
Console.WriteLine(pricesListItemsLinq.Count());
However, I'm not getting any items, but they are there on the website. What am I doing wrong? If AngleSharp isn't the recommended method, what should I use? And what code should I use?

I am late at the party, but I try to bring some sanity here.
Querying static webpages
For this we require the following set of tools / functionality:
HTTP requester (to obtain resources, e.g., HTML documents, via HTTP), potentially with a SSL/TLS layer on top (either accepting all certificates or working against the certificate store / known CAs)
HTML parser
A queryable object model representation of the parsed HTML document
Maybe additionally some cookie state and the ability to follow links / post forms
AngleSharp gives us all these options (minus a connection to the certificate store / known CAs; so in order to use HTTPS we must do some additional configuration, e.g., to accept all certificates).
We would start by creating an AngleSharp configuration that defines which capabilities are available for the browsing engine. This engine is exposed in form of a "browsing context", which can be regarded as a headless tab. In this tab we can open a new document (either from a local source, a constructed source, or a remote source).
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("http://example.com");
Once we have the document we can use CSS query selectors to obtain certain elements. These elements can be used to gather the information we look for.
AngleSharp embraces LINQ (or IEnumerable in general), however, it makes sense to give full power to the queries if possible.
So instead of
var pricesListItemsLinq = document.All
.Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
We write
var pricesListItemsLinq = document.QuerySelectorAll("span.c-price");
This is also much more robust (the ClassList is anyway a complex object giving access to a list of classes, so you either meant ClassList.Contains or ClassName.Equals (the latter being the string representation). Note: The two versions are not equivalent, because the former is looking for a class within the list of classes, while the latter is looking for a match of the whole class serialization (thus posing some extra boundary conditions on the match; it needs to be the only class).
Dealing with dynamic pages
This is far more complicated. The basics are the same as previously, but the engine needs to deliver a lot more than just the previously mentioned requirements. Additionally, we need
A JavaScript engine
A valid CSSOM
A fake (or even fully computed) rendering tree
A lot more DOM interfaces that can be found in real browsers (e.g., navigator, full history, web workers, ...) - the list is limitless here
While there is a project that delivers an experimental (and limited) C# only JS engine to AngleSharp, the latter two requirements cannot be fully fulfilled right now. Furthermore, the CSSOM may also be not complete enough for one or the other web application. Keep in mind that these pages are potentially designed for real browsers. They make certain assumptions. They may even require user input (e.g., Google Captcha).
Long story short.
var config = Configuration.Default
.WithDefaultLoader()
.WithCss()
.WithJavaScript(); // maybe even more
var context = BrowsingContext.New(config);
The Task behind the await when opening a new document is equivalent to a load event in the DOM. Thus it will not fire when the document was downloaded and parsed, but only once all scripts have been loaded (and potentially run) incl. resources that needed to be downloaded.
Hope this helps a bit!

Automated system for opening other website links in new window/tab

I have a website, in which when user clicks a link it opens up in the same window if it is of my website's page, else in new window if domain is different. But I am doing this manually like this:
Open Link
checkdomain() checks the domain name of the link and returns true if it's of my website else false. I used the code from [ HERE ] for this purpose.
My question is: Is there any efficient and client side way available for checking link domains and open up them in new windows/tab if of another website(domain)? Like a JavaScript solution will be better, but then again JavaScript can be disabled by user. So, is there any other solution? Even JS solution will be great. Ignoring the disabling by user.

Somewhere on the page, or in an external JS file:
function externalLinks() {
if (!document.getElementsByTagName) return;
var anchors = document.getElementsByTagName("a");
for (var i = 0; i < anchors.length; i++) {
var anchor = anchors[i];
if (anchor.getAttribute("href")
&& anchor.getAttribute("rel")
&& anchor.getAttribute("rel").indexOf("external") >= 0)
anchor.target = "_blank";
}
}
window.onload = function() {
externalLinks();
};
Then, any external links just need to have rel="external" in the markup. For example:
Click here!
The main advantages of this approach is that you're not going to cause any validation errors, even with an XHTML Strict doctype. Users are also able to easily prevent links opening in new windows by simply disabling JS.
If you need the decision of external/internal to be made automatically (and client-side), you can alter the logic of externalLinks to base the decision on the href attribute rather than the rel attribute. Of course, if you've already got the external/internal logic functioning in your codebehind, I would recommend using that information to render the anchor with the appropriate semantics (with rel), rather than re-writing almost identical code in your client-side JS.

Try comparing your link url's host part (www.wrangle.in) with following in you function logic.
string currentURL = HttpContext.Current.Request.Url.Host;
I do not recommend to compare the name (i.e http or https), you can split using substring function.
For Client side
var homeURL = document.location.hostname;
$('a').each(function() {
if ( $(this+'[href*='+homeURL+']')) {
$(this).attr('target','_self');
}else{
$(this).attr('target','_blank');
} });
This link may help you to understand Url Parts.

How to get data from a client on load?

I'm aware that data can be passed in through the URL, like "example.com/thing?id=1234", or it can be passed in through a form and a "submit" button, but neither of these methods will work for me.
I need to get a fairly large xml string/file. I need to parse it and get the data from it before I can even display my page.
How can I get this on page load? Does the client have to send a http request? Or submit the xml as a string to a hidden form?
Edit with background info:
I am creating a widget that will appear in my customer's application, embedded using C# WebBrowser control, but will be hosted on my server. The web app needs to pass some data (including a token for client validation) to my widget via xml, and this needs to be loaded in first thing when my widget starts up.

ASP.NET MVC 4 works great with jQuery and aJax posts. I have accomplished this goal many times by taking advantage of this.
jQuery:
$(document).ready(function() {
$.ajax({
type: "POST",
url: "/{controller}/{action}/",
data: { clientToken: '{token}', foo: 'bar',
success: function (data, text) {
//APPEND YOUR PAGE WITH YOUR PARSED XML DATA
//NOTE: 'data' WILL CONTAIN YOUR RETURNED RESULT
}
});
});
MVC Controller:
[HttpPost]
public JsonResult jqGetXML(string clientToken, string foo)
{
JsonResult jqResult = new JsonResult();
//GET YOUR XML DATA AND DO YOUR WORK
jqResult.Data = //WHATEVER YOU WANT TO RETURN;
return jqResult;
}
Note: This example returns Json data (easier to work with IMO), not XML. It also assumes that the XML data is not coming from the client but is stored server-side.
EDIT: Here is a link to jQuery's Ajax documentation,
http://api.jquery.com/jQuery.ajax/

Assuming you're using ASP.NET, since you say it's generated by another page, just stick the XML in the Session state.

Another approach, not sure if it helps in your situation.
If you share the second level domain name on your two sites (i.e. .....sitename.com ) then another potential way to share data is you could have them assert a cookie at this 2nd level with the token and xml data in it. You'll then be provided with this cookie.
I've only done this to share authentication details, you need to share machine keys at a minimum to support this (assuming .Net here...).

You won't be able to automatically upload a file from the client to the server - at least not via a browser using html/js/httprequests. The browser simply will not allow this.
Imagine the security implications if browsers allowed you to silently upload a file from the clients local machine without their knowledge.

Sample solution:
Background process imports xml file and parses it. The background process knows it is for customer YYY and updates their information so it know the xml file has been processed.
A visitor goes to the customer's web application where the widget is embedded. In the markup of the widget the customer token has been added. This could be in JavaScript, Flash, iFrame, etc.
When the widget loads, it makes a request to you app which then checks to see if the file was parsed for the provided customer (YYY) if it has, then show the page/widget.

If the XML is being served via HTTP you can use Liqn to parse the data.
Ex.
public partial class Sample : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://news.yahoo.com/rss/";
var el = XElement.Load(url).Elements("channel");
StringBuilder output = new StringBuilder();
foreach (var c in el.Elements())
{
switch (c.Name.LocalName.ToLower())
{
case "title":
output.Append(c.Value);
output.Append("<br />");
break;
}
}
this.Label1.Text = output.ToString();
}
}

It is not exactly clear what the application is and what kind of options you have, and what kind of control over web server you have.
If you are the owner of the web server/application your options are way wider. You can first send a file to web-server with HTTP POST or PUT, including a random token, and then use the same token for GET with token in the query string
or use other options, applicable to third party-owned websites
if you are trying to consume some auth api, learn more about it. since you are hosting web browser control, you have plenty of options to script it. including loading whatever form, setting textarea or hidden field text with your xml and then simulating a submit button click. you can then respond to any redirects and html responses.
you can also inject javascript inside the page that would send it to server with ajax request.
the choice heavily depends on the interaction model.
if you need better advice, it would be most helpful if you provided sample/simplified url/url pattern, form content, and sequence of events that is expected from you from code/api/sdk perspective. they are usually quite friendly.

There are limited number of ways to pass data between pages. Personally for this I would keep in session during the generating page and clear it when it is retrieved in the required page.
If it is generated server side then there is no reason to retrieve it from client side.
http://msdn.microsoft.com/en-us/library/6c3yckfw(v=vs.100).aspx

Create a webservice that your C# app can POST the XML to and get back HTML in response. Load this HTML string into the WebBrowser control rather than pointing the control to a URL.

Facebook "fan gate" with C#/asp.net

The problem is that 1) Facebook seems so fluid with how it allows developers to interact with it (FBML, iFrame, different versions of SDKs) and 2) Everything I find is either PHP or Javascript and I have NO experience with those. What I am trying to do seems sooo simple, and I can't believe there isn't an easy way to do this.
What I have:
I used Visual Studio 2010 to create a simple web application (asp.net/C#) that asks the user for some info (first name, last name, email, etc.). I have a button on there called "Submit" that, when clicked, saves the entered data into a database. I have this hosted on GoDaddy (I know, I know...heh) and it works just fine. No problem here.
I created a "Facebook App" that uses the iFrame thingy so that basically I have a new tab on Facebook that displays my web app mentioned above. This works fine too. The tab is there, the web app is there, and users can enter the data and it is saved to the database. No problem here.
What I WANT:
I want the web app (the thing displayed by the facebook app) to only show the data entry part if the user currently "likes" the facebook entity. I DO NOT want to have to ask permission. I just want to know if they are a fan of the company's facebook "page" that has this app. So I need two things here, shown in my pseudo code below:
Part 1 (check if user is already a fan):
If (user is fan)
{
Show data entry area (unhide it)
}
else
{
Show "Click the like button to see more options"
}
Part 2 (listen for "like" event)
WhenLikeButtonPressed()
{
Show data entry area (unhide it)
}
I've seen stuff about "visible to connection", C# sdk, edge.create, etc. but I just can't make heads or tails of it. I don't mind putting in Javascript or PHP if someone could please give me exact, "Fan Gate for Dummies" steps. Please, I'm going crazy over here :-(

The key is is the signed_request that Facebook posts to your app when the user accesses the page. It contains the data on whether or not the user likes the page. You shouldn't need to worry about catching edge events on an actual tab FB page as it get's reloaded when the user likes/unlikes the page.
You'll need to decode the signed request with your app secret to get the like info. There are examples provided for PHP but I'm sure with a little google help you can find decode info for the signed_request for asp.net/c#.
Here's the php decode for reference:
function parse_signed_request($signed_request, $secret) {
list($encoded_sig, $payload) = explode('.', $signed_request, 2);
// decode the data
$sig = base64_url_decode($encoded_sig);
$data = json_decode(base64_url_decode($payload), true);
if (strtoupper($data['algorithm']) !== 'HMAC-SHA256') {
error_log('Unknown algorithm. Expected HMAC-SHA256');
return null;
}
// check sig
$expected_sig = hash_hmac('sha256', $payload, $secret, $raw = true);
if ($sig !== $expected_sig) {
error_log('Bad Signed JSON signature!');
return null;
}
return $data;
}
function base64_url_decode($input) {
return base64_decode(strtr($input, '-_', '+/'));
}
and the link https://developers.facebook.com/docs/authentication/signed_request/ the like info will be contained in the page variable

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse page using AngleSharp - c#

I want to parse website using c# with AngleSharp, it's easy to do with static pages, but there is a problem, I can't parse info avalible only for authorized users. What should I do to autorize programmatically into website and parse all info avalible for me?

Related

unable to get complete url after # using query string in asp.net c# [duplicate]

How to extract data from website using AngleSharp & LINQ?

Automated system for opening other website links in new window/tab

How to get data from a client on load?

Facebook "fan gate" with C#/asp.net

Categories

Resources