Web Scraping Java Sites with HtmlAgilityPack - c#

Let me start out by saying that I am no pro at web scraping. I can do the basics on most platforms, but that's about it.
I am trying to create the foundation for a web application that can helps users reinforce their language learning by generating additional data, metrics, as well as create new tools for self-testing. The Duolingo website is not offering up any sort of API so my next thought for now is just to scrape https://www.duome.eu/. I wrote a quick little scraper but didn't realize that the site was java. In the following example, it is my wish to collect all of the words from the Words tab that contain anchors:
using System;
using HtmlAgilityPack;
using System.Net.Http;
using System.Text.RegularExpressions;
namespace DuolingoUpdate
{
class Program
{
static void Main(string[] args)
{
string userName = "Podus";
UpdateDuolingoUser(userName);
Console.ReadLine();
}
private static async void UpdateDuolingoUser(string userName)
{
string url = "https://www.duome.eu/" + userName + "/progress/";
// Create the http client connection
HttpClient httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
// Store the html client data in an object
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//var words = htmlDocument.DocumentNode.Descendants("div")
// .Where(node => node.GetAttributeValue("id", "")
// .Equals("words")).ToList();
//var wordList = words[0].Descendants("a")
// .Where(node => node.GetAttributeValue("class", "")
// .Contains("wA")).ToList();
Console.WriteLine(html);
}
}
}
The html object of the above code contains:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="google" value="notranslate">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Duolingo · Podus # duome.eu</title>
<link rel="stylesheet" href="/style.css?1548418871" />
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<script src="//code.jquery.com/jquery-3.3.1.min.js"></script>
<script type="text/javascript">
$(document).ready(function() {
if("".length==0){
var visitortime = new Date();
var visitortimezone = "GMT " + -visitortime.getTimezoneOffset()/60;
//localStorage.tz = visitortimezone;
//timezone = Date.parse(localStorage.tz);
//timezone = localStorage.tz;
//console.log(timezone);
$.ajax({
type: "GET",
url: "/tz.php",
data: 'time='+ visitortimezone,
success: function(){
location.reload();
}
});
}
});
</script>
</head>
<body>
<noscript>Click here to adjsut XP charts to your local timezone. </noscript>
<!-- Yandex.Metrika counter --> <script type="text/javascript" > (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter47765476 = new Ya.Metrika({ id:47765476, clickmap:true, trackLinks:true, accurateTrackBounce:true }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/47765476" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter -->
</body>
</html>
But if you go to the actual url https://www.duome.eu/Podus/progress/, the site contains a ton of script. So upon inspection the first problem is that I am not getting the html that I see in the browser. The second problem is that if you view source, its nothing like what is in inspect and I don't see anything in source that would lead me to isolate the data from div id="words".
Given my lackluster knowledge of java built web pages, how do I do this, or is it even possible?

You can access dualingo profile data in JSON format via https://www.duolingo.com/users/<username>
eg. https://www.duolingo.com/users/Podus
This should be much easier than trying to scrape the duome profile page manually.

Related

How do we authenticate Quickbooks Online in a Console Application?

I'm trying to invoke Quick Books Online API in a console application where i need to get the bearer token first. Below is the code snippet where i'm trying to get authorization code first for subsequent access token calls. I'm getting a HTML response instead of json object with auth code.
Also, What are the grant types does QBO support ?
HttpClientHandler httpClientHandler = new HttpClientHandler();
HttpClient httpClient = new HttpClient(httpClientHandler,false);
httpClient.BaseAddress = new Uri("https://appcenter.intuit.com/connect/oauth2");
List<KeyValuePair<string, string>> param = new List<KeyValuePair<string, string>>();
param.Add(new KeyValuePair<string, string>("response_type","code"));
param.Add(new KeyValuePair<string, string>("client_id", "AB5********26"));
param.Add(new KeyValuePair<string, string>("scope", "com.intuit.quickbooks.accounting"));
param.Add(new KeyValuePair<string, string>("redirect_uri", "https://developer.intuit.com/v2/OAuth2Playground/RedirectUrl"));
var resp = httpClient.PostAsync("", new FormUrlEncodedContent(param)).GetAwaiter().GetResult();
var result = resp.Content.ReadAsStringAsync().GetAwaiter().GetResult();
HTML Response for the request i sent..
<!DOCTYPE html>
<html class="dj_mac is-not-mobile" data-shell-type="node">
<!-- node-shell -->
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="google-site-verIfication" content="hiEXDzwqUxxMY5KZkAkeHBn6J0gy2Ne1gJdm77RkGbk">
<meta name="msapplication-TileColor" content="#0098cd">
<meta charset="utf-8" />
<link rel="preload"
href="https://plugin.intuitcdn.net/sbg-web-shell-ui/11.24.0/bower_components/document-register-element/build/document-register-element.js"
as="script">
<link rel="preload" href="https://plugin.intuitcdn.net/sbg-web-shell-ui/11.24.0/dojo/dojo.js" as="script">
<link rel="preload" href="https://plugin.intuitcdn.net/sbg-web-shell-ui/11.24.0/shell/boot.js" as="script">
<link rel="preload" href="https://plugin.intuitcdn.net/sbg-web-shell-ui/11.24.0/dojo/resources/blank.gif"
as="image">
<link rel="preload" href="https://plugin.intuitcdn.net/react-dom/16.9.0/umd/react-dom.production.min.js"
as="script">
<link rel="preload" href="https://plugin.intuitcdn.net/react/16.9.0/umd/react.production.min.js" as="script">
<script>
(function() {var e = document.createEvent("Event");e.initEvent("load", true, false);window.dispatchEvent(e);})();
</script>
<link rel="preload" href="https://plugin.intuitcdn.net/ua-parser-js/0.7.20/dist/ua-parser.min.js" as="script">
<meta id="viewPortMetaTag" name="viewport"
content="width=device-width, height=device-height, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=0">
<meta name="application-name" content="QuickBooks App Store">
<meta name="apple-mobile-web-app-title" content="QuickBooks App Store">
<title>QuickBooks App Store</title>
<script type="text/javascript" src="https://plugin.intuitcdn.net/sbg-web-shell-ui/11.24.0/dojo/dojo.js"></script>
</head>
<script type="text/javascript"
src="https://plugin.intuitcdn.net/sbg-web-shell-ui/11.24.0/bower_components/document-register-element/build/document-register-element.js">
</script>
</head>
<body class="en-us">
<script type="text/javascript" nonce="bTjY6kTz1OvXs0b/7WA0RA==">
try {
require({cache: {}});
require(["shell/base/loader", "shell/applications/default/config"], function(loader, config) {
var runtime = {"isWebpack":false,"embedded":false,"ecosystem":false,"accessDeniedPages":[48],"hiddenPages":[]};
loader.load({
useLayers: true,
storageSecondaryKey: "",
storagePrimaryKey: "",
platformPlugin: appContext.pluginsInfo && appContext.pluginsInfo.plugins ? appContext.pluginsInfo.plugins["qbo-ui-platform"] : null,
layers: config.getLayers(runtime)
}, config.getAppHandler(runtime));
});
} catch (error){
console.error(error);
require({cache: {}});
require(["shell/base/loader", "shell/applications/default/config"], function(loader, config) {
var runtime = {"isWebpack":false,"embedded":false,"ecosystem":false,"accessDeniedPages":[48],"hiddenPages":[]};
loader.load({
useLayers: true,
storageSecondaryKey: "",
storagePrimaryKey: "",
platformPlugin: appContext.pluginsInfo && appContext.pluginsInfo.plugins ? appContext.pluginsInfo.plugins["qbo-ui-platform"] : null,
layers: config.getLayers(runtime)
}, config.getAppHandler(runtime));
});
};
</script>
</body>
</html>
using Intuit.Ipp.Core;
using Intuit.Ipp.OAuth2PlatformClient;
using Intuit.Ipp.Security;
using System;
using System.Collections.Generic;
using System.Text;
namespace QuickBooksToken
{
public class GetAccessTokenQbo
{
public static string GetAccessToken()
{
System.Net.ServicePointManager.SecurityProtocol = System.Net.SecurityProtocolType.Tls12;
var oauth2Client = new OAuth2Client(client Id, client Secret, "https://developer.intuit.com/v2/OAuth2Playground/RedirectUrl", environment); // environment is “sandbox” or “production”
var previousRefreshToken = ReadRefreshTokenFromWhereItIsStored();
var tokenResp = oauth2Client.RefreshTokenAsync(previousRefreshToken);
tokenResp.Wait();
var data = tokenResp.Result;
if (!String.IsNullOrEmpty(data.Error) || String.IsNullOrEmpty(data.RefreshToken) ||
String.IsNullOrEmpty(data.AccessToken))
{
throw new Exception("Refresh token failed - " + data.Error);
}
// If we've got a new refresh_token store it in the file
if (previousRefreshToken != data.RefreshToken)
{
Console.WriteLine("Writing new refresh token : " + data.RefreshToken);
WriteNewRefreshTokenToWhereItIsStored(data.RefreshToken);
return data.AccessToken
}
return data.AccessToken;
}
private static string ReadRefreshTokenFromWhereItIsStored()
{
return "Refresh token"; //hard code your refresh token
}
private static string WriteNewRefreshTokenToWhereItIsStored(string refreshToken)
{
return refreshToken;
}
public static ServiceContext GetServiceContext()
{
string accessToken = GetAccessToken();// Code from above
var oauthValidator = new OAuth2RequestValidator(accessToken);
ServiceContext qboContext = new ServiceContext(realm Id, IntuitServicesType.QBO, oauthValidator);
return qboContext;
}
}
}
Install this nuget library IppDotNetSdkForQuickBooksApiV3 (version 14.4.4).
In my case environment is sandbox. You can use (https://developer.intuit.com/app/developer/playground) link to get values for client secret, client Id, realm Id and a new refresh token to hard code in your c# application.

Extract JSON from string in .NET

The input string is mix of some text with valid JSON:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<TITLE>Title</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<META HTTP-EQUIV="Content-language" CONTENT="en">
<META HTTP-EQUIV="keywords" CONTENT="search words">
<META HTTP-EQUIV="Expires" CONTENT="0">
<script SRC="include/datepicker.js" LANGUAGE="JavaScript" TYPE="text/javascript"></script>
<script SRC="include/jsfunctions.js" LANGUAGE="JavaScript" TYPE="text/javascript"></script>
<link REL="stylesheet" TYPE="text/css" HREF="css/datepicker.css">
<script language="javascript" type="text/javascript">
function limitText(limitField, limitCount, limitNum) {
if (limitField.value.length > limitNum) {
limitField.value = limitField.value.substring(0, limitNum);
} else {
limitCount.value = limitNum - limitField.value.length;
}
}
</script>
{"List":[{"ID":"175114","Number":"28992"]}
The task is to deserialize the JSON part of it into some object. The string can begin with some text, but it surely contains the valid JSON. I've tried to use JSON validation REGEX, but there was a problem parsing such pattern in .NET.
So in the end I'd wanted to get only:
{
"List": [{
"ID": "175114",
"Number": "28992"
}]
}
Clarification 1:
There is only single JSON object in whole the messy string, but the text can contain {}(its actually HTML and can contain javascripts with <script> function(){..... )
You can use this method
public object ExtractJsonObject(string mixedString)
{
for (var i = mixedString.IndexOf('{'); i > -1; i = mixedString.IndexOf('{', i + 1))
{
for (var j = mixedString.LastIndexOf('}'); j > -1; j = mixedString.LastIndexOf("}", j -1))
{
var jsonProbe = mixedString.Substring(i, j - i + 1);
try
{
return JsonConvert.DeserializeObject(jsonProbe);
}
catch
{
}
}
}
return null;
}
The key idea is to search all { and } pairs and probe them, if they contain valid JSON. The first valid JSON occurrence is converted to an object and returned.
Use regex to find all possible JSON structures:
\{(.|\s)*\}
Regex example
Then iterate all these matches unitil you find a match that will not cause an exception:
JsonConvert.SerializeObject(match);
If you know the format of the JSON structure, use JsonSchema.

Google map not properly load in web browser control

I have one html file like Map.Html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title></title>
<script type="text/javascript" src="http://maps.google.com.mx/maps/api/js?sensor=true"></script>
<script type="text/javascript">
var geocoder;
var map;
function initialize() {
geocoder = new google.maps.Geocoder();
var myOptions = {
zoom: 8,
mapTypeId: google.maps.MapTypeId.ROADMAP
}
var address = "Ahmedabad, India" //change the address in order to search the google maps
geocoder.geocode({ 'address': address }, function (results, status) {
if (status == google.maps.GeocoderStatus.OK) {
map.setCenter(results[0].geometry.location);
var marker = new google.maps.Marker({
map: map,
position: results[0].geometry.location
});
var infoWindow = new google.maps.InfoWindow({
content: 'Hello'
});
google.maps.event.addListener(marker, "click", function (e) {
infoWindow.open(map, marker);
});
} else {
alert("Geocode was not successful for the following reason: " + status);
}
});
map = new google.maps.Map(document.getElementById("map_canvas"), myOptions);
}
</script>
</head>
<body onload="initialize()">
<div id="map_canvas" style="width:100%; height:100%"></div>
</body>
</html>
then i load this html file in webbrowser control DocumentText using below code:
using (StreamReader reader = new StreamReader(System.Windows.Forms.Application.StartupPath + "\\Map.html"))
{
_mapHTML = reader.ReadToEnd();
}
webBrowser1.DocumentText = _mapHTML;
but map not load properly.
zoom in/out, Map/Satellite option, Marker are showing but one white layer on map display.
SOLVED -- for me at least.
For some reason, the Webbrowser control is defaulting to a bad version (experimental?). I changed my initialization script to specify the current 3.3 version and everything is back to normal.
<script type="text/javascript" src="https://maps.googleapis.com/maps/api/js?v=3.3"></script>

Plupload, C#, FTP and Chunking

I'm using Plupload to manage large file uploads. I have written a C# handler to upload using FTPWebRequest and it appears to work ok. The trouble is I would like to use chunking (http://www.plupload.com/docs/Chunking). When i enable chunking, I get multiple files uploaded to my FTP server called blobxxx, blobyyyy. The question is, how do I join them all together at the end? Here is my code - thanks for any help:
Plupload.aspx
<%# Page Language="C#" AutoEventWireup="true" CodeFile="plupload.aspx.cs" Inherits="plupload" %>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
<title>Plupload - Custom example</title>
<!-- production -->
<link href="plupload/js/jquery.plupload.queue/css/jquery.plupload.queue.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.js" charset="UTF-8"></script>
<script type="text/javascript" src="plupload/js/plupload.full.min.js"></script>
<script src="plupload/js/jquery.plupload.queue/jquery.plupload.queue.min.js" type="text/javascript"></script>
<!-- debug
<script type="text/javascript" src="../js/moxie.js"></script>
<script type="text/javascript" src="../js/plupload.dev.js"></script>
-->
</head>
<body style="font: 13px Verdana; background: #eee; color: #333">
<h1>Custom example</h1>
<p>Shows you how to use the core plupload API.</p>
<div id="filelist">Your browser doesn't have Flash, Silverlight or HTML5 support.</div>
<br />
<div id="uploader">
<p>Your browser doesn't have Flash, Silverlight or HTML5 support.</p>
</div>
<script type="text/javascript">
// Initialize the widget when the DOM is ready
$(function() {
// Setup html5 version
$("#uploader").pluploadQueue({
// General settings
runtimes : 'html5,flash,silverlight,html4',
url : "FileUpload.ashx",
chunk_size: '1mb',
max_retries: 3,
rename : true,
dragdrop: true,
// Resize images on clientside if we can
resize: {
width : 200,
height : 200,
quality : 90,
crop: true // crop to exact dimensions
},
// Flash settings
flash_swf_url : 'plupload/js/Moxie.swf',
// Silverlight settings
silverlight_xap_url : 'plupload/js/Moxie.xap'
});
$("#uploader").bind("Error", function (upload, error) {
alert(error.message);
});
// only allow 5 files to be uploaded at once
$("#uploader").bind("FilesAdded", function (up, filesToBeAdded) {
if (up.files.length > 5) {
up.files.splice(4, up.files.length - 5);
showStatus("Only 5 files max are allowed per upload. Extra files removed.", 3000, true);
return false;
}
return true;
});
});
</script>
</body>
</html>
FileUpload.ashx
<%# WebHandler Language="C#" Class="FileUpload" %>
using System;
using System.Web;
using System.Net;
using System.IO;
using System.Text;
public class FileUpload : IHttpHandler {
public void ProcessRequest(HttpContext context)
{
if (context.Request.Files.Count > 0)
{
for (int a = 0; a <= context.Request.Files.Count - 1; a++)
{
FtpWebRequest clsRequest = (System.Net.FtpWebRequest)System.Net.WebRequest.Create(new Uri("ftp://xxxx/yyyy/" + context.Request.Files[a].FileName));
clsRequest.Credentials = new NetworkCredential("xxx", "yyy");
clsRequest.Method = System.Net.WebRequestMethods.Ftp.UploadFile;
Stream streamReader = context.Request.Files[a].InputStream;
byte[] bFile = new byte[streamReader.Length];
streamReader.Read(bFile, 0, (int)streamReader.Length);
streamReader.Close();
streamReader.Dispose();
Stream clsStream = clsRequest.GetRequestStream();
//clsStream.Write(bFile, 0, bFile.Length);
// Write the local stream to the FTP stream
// 2 bytes at a time
int offset = 0;
int chunk = (bFile.Length > 2048) ? 2048 : bFile.Length;
while (offset < bFile.Length)
{
clsStream.Write(bFile, offset, chunk);
offset += chunk;
chunk = (bFile.Length - offset < chunk) ? (bFile.Length - offset) : chunk;
}
clsStream.Close();
clsStream.Dispose();
FtpWebResponse response = (FtpWebResponse)clsRequest.GetResponse();
response.Close();
}
}
}
}
UPDATE**
I've changed:
FtpWebRequest clsRequest = (System.Net.FtpWebRequest)System.Net.WebRequest.Create(new Uri("ftp://xxxx/yyyy/" + context.Request.Files[a].FileName));
to:
FtpWebRequest clsRequest = (System.Net.FtpWebRequest)System.Net.WebRequest.Create(new Uri("ftp://xxxx/yyyy/" + "staticFileName.zip"));
and
clsRequest.Method = System.Net.WebRequestMethods.Ftp.UploadFile;
to
clsRequest.Method = System.Net.WebRequestMethods.Ftp.AppendFile;
It appears to upload ok. Need to just test the chunking...
Your link has a section titled "Server-side Handling" which describes that you need reassemble the file yourself. Here is a link to information on appending binary files: How append data to a binary file?
The Plupload documentation mentions that they do the appending on the ChunkUploaded event and piece the file together as it arrives.

What should be the correct response from web service to display the Jquery token input results?

I am using a Jquery Token Input plugin. I have tried to fetch the data from the database instead of local data. My web service returns the json result is wrapped in xml:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://tempuri.org/">[{"id":"24560","name":"emPOWERed-Admin"},{"id":"24561","name":"emPOWERed-HYD-Visitors"}]</string>
I have checked in the site http://loopj.com/jquery-tokeninput/ which says that the script should output JSON search results in the following format:
[
{"id":"856","name":"House"},
{"id":"1035","name":"Desperate Housewives"}
]
Both seems to be the same,but still i m not getting the items displayed in my page.
I am posting my code also.
My Js code: DisplayTokenInput.js
$(document).ready(function() {
$("#textboxid").tokenInput('PrivateSpace.asmx/GetDl_info', {
hintText: "Type in DL Name", theme: "facebook",
preventDuplicates: true,
searchDelay: 200
});
});
My web-service code:
[WebMethod]
[ScriptMethod(UseHttpGet = true, ResponseFormat = ResponseFormat.Json)]
public string GetDl_info(string q)
{
string dl_input = string.Empty;
DataSet ds;
PSData ObjDl = new PSData();
ds = ObjDl.GetDistributionList(q);
List<DistributionList> DLObj = new List<DistributionList>();
foreach (DataRow datarow in ds.Tables[0].Rows)
{
DistributionList dl_list = new DistributionList();
dl_list.id = Convert.ToString(datarow["id"]);
dl_list.name = Convert.ToString(datarow["name"]);
DLObj.Add(dl_list);
}
dl_input = JsonConvert.SerializeObject(DLObj);
return dl_input;
}
}
public class DistributionList
{
public string id { get; set; }
public string name { get; set; }
}
I am posting the head portion of aspx code to show the library files i have included:
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title>Untitled Page</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link href="../Styles/jquery-ui-1.8.20.custom.css" rel="stylesheet" type="text/css" />
<link href="../Styles/token-input.css" rel="stylesheet" type="text/css" />
<link href="../Styles/token-input-facebook.css" rel="stylesheet" type="text/css" />
<script src="Scripts/Lib/jquery-1.7.2.min.js" type="text/javascript"></script>
<script src="../Scripts/jquery.tokeninput.js" type="text/javascript"></script>--%>
<script src="DisplayTokenInput.js" type="text/javascript"></script>
<head>
You need to make sure that your request is a POST request. Not a get request. See this answer to find out more about why: How to let an ASMX file output JSON
I would assume that the code for the plugin isn't setting the content-type for ajax requests to JSON, so you could do it yourself before the service call with $.ajaxSetup ie:
$.ajaxSetup({
contentType: "application/json; charset=utf-8"
});
UPDATE: Apparently asmx services sometimes have issues with the 'charset=utf-8' portion, so if that doesn't work you could try just 'application/json'
UPDATE 2:
I don't think it's the contentType causing the issue, use the following to force a POST for ajax requests and see if this fixes it:
$.ajaxSetup({
type: "POST", contentType: "application/json; charset=utf-8"
});
UPDATE 3:
There is a default setting inside the plugin you're using that can change the requests from GET to POST. See here on it's GitHub repo: jquery.tokeninput.js
and in your copy of the js file in the project, change the line:
var DEFAULT_SETTINGS = {
// Search settings
method: "GET",
to
var DEFAULT_SETTINGS = {
// Search settings
method: "POST",
I also assume that the plugin constructs the query in a way that ignores the global jquery ajax settings anyway, so you shouldn't need to include my earlier snippets anymore.

Categories

Resources