The input string is mix of some text with valid JSON:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<TITLE>Title</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<META HTTP-EQUIV="Content-language" CONTENT="en">
<META HTTP-EQUIV="keywords" CONTENT="search words">
<META HTTP-EQUIV="Expires" CONTENT="0">
<script SRC="include/datepicker.js" LANGUAGE="JavaScript" TYPE="text/javascript"></script>
<script SRC="include/jsfunctions.js" LANGUAGE="JavaScript" TYPE="text/javascript"></script>
<link REL="stylesheet" TYPE="text/css" HREF="css/datepicker.css">
<script language="javascript" type="text/javascript">
function limitText(limitField, limitCount, limitNum) {
if (limitField.value.length > limitNum) {
limitField.value = limitField.value.substring(0, limitNum);
} else {
limitCount.value = limitNum - limitField.value.length;
}
}
</script>
{"List":[{"ID":"175114","Number":"28992"]}
The task is to deserialize the JSON part of it into some object. The string can begin with some text, but it surely contains the valid JSON. I've tried to use JSON validation REGEX, but there was a problem parsing such pattern in .NET.
So in the end I'd wanted to get only:
{
"List": [{
"ID": "175114",
"Number": "28992"
}]
}
Clarification 1:
There is only single JSON object in whole the messy string, but the text can contain {}(its actually HTML and can contain javascripts with <script> function(){..... )
You can use this method
public object ExtractJsonObject(string mixedString)
{
for (var i = mixedString.IndexOf('{'); i > -1; i = mixedString.IndexOf('{', i + 1))
{
for (var j = mixedString.LastIndexOf('}'); j > -1; j = mixedString.LastIndexOf("}", j -1))
{
var jsonProbe = mixedString.Substring(i, j - i + 1);
try
{
return JsonConvert.DeserializeObject(jsonProbe);
}
catch
{
}
}
}
return null;
}
The key idea is to search all { and } pairs and probe them, if they contain valid JSON. The first valid JSON occurrence is converted to an object and returned.
Use regex to find all possible JSON structures:
\{(.|\s)*\}
Regex example
Then iterate all these matches unitil you find a match that will not cause an exception:
JsonConvert.SerializeObject(match);
If you know the format of the JSON structure, use JsonSchema.
Related
Let me start out by saying that I am no pro at web scraping. I can do the basics on most platforms, but that's about it.
I am trying to create the foundation for a web application that can helps users reinforce their language learning by generating additional data, metrics, as well as create new tools for self-testing. The Duolingo website is not offering up any sort of API so my next thought for now is just to scrape https://www.duome.eu/. I wrote a quick little scraper but didn't realize that the site was java. In the following example, it is my wish to collect all of the words from the Words tab that contain anchors:
using System;
using HtmlAgilityPack;
using System.Net.Http;
using System.Text.RegularExpressions;
namespace DuolingoUpdate
{
class Program
{
static void Main(string[] args)
{
string userName = "Podus";
UpdateDuolingoUser(userName);
Console.ReadLine();
}
private static async void UpdateDuolingoUser(string userName)
{
string url = "https://www.duome.eu/" + userName + "/progress/";
// Create the http client connection
HttpClient httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
// Store the html client data in an object
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//var words = htmlDocument.DocumentNode.Descendants("div")
// .Where(node => node.GetAttributeValue("id", "")
// .Equals("words")).ToList();
//var wordList = words[0].Descendants("a")
// .Where(node => node.GetAttributeValue("class", "")
// .Contains("wA")).ToList();
Console.WriteLine(html);
}
}
}
The html object of the above code contains:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="google" value="notranslate">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Duolingo ยท Podus # duome.eu</title>
<link rel="stylesheet" href="/style.css?1548418871" />
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<script src="//code.jquery.com/jquery-3.3.1.min.js"></script>
<script type="text/javascript">
$(document).ready(function() {
if("".length==0){
var visitortime = new Date();
var visitortimezone = "GMT " + -visitortime.getTimezoneOffset()/60;
//localStorage.tz = visitortimezone;
//timezone = Date.parse(localStorage.tz);
//timezone = localStorage.tz;
//console.log(timezone);
$.ajax({
type: "GET",
url: "/tz.php",
data: 'time='+ visitortimezone,
success: function(){
location.reload();
}
});
}
});
</script>
</head>
<body>
<noscript>Click here to adjsut XP charts to your local timezone. </noscript>
<!-- Yandex.Metrika counter --> <script type="text/javascript" > (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter47765476 = new Ya.Metrika({ id:47765476, clickmap:true, trackLinks:true, accurateTrackBounce:true }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/47765476" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter -->
</body>
</html>
But if you go to the actual url https://www.duome.eu/Podus/progress/, the site contains a ton of script. So upon inspection the first problem is that I am not getting the html that I see in the browser. The second problem is that if you view source, its nothing like what is in inspect and I don't see anything in source that would lead me to isolate the data from div id="words".
Given my lackluster knowledge of java built web pages, how do I do this, or is it even possible?
You can access dualingo profile data in JSON format via https://www.duolingo.com/users/<username>
eg. https://www.duolingo.com/users/Podus
This should be much easier than trying to scrape the duome profile page manually.
I am trying this regex..
but not getting the desired result in code..
<script[\s\w="'/]*src\s*=\s*['"]([\w/\.\d\s-]*)["']>|<link[/\s\w="\d]*href=['"]([\.\d\w\\/-]*)['"][\s\w="'/]*>
here is my pattern..
string pattern = #"<script\s[\d\s\w='";
pattern += "\"/]*";
pattern += #"src\s*=\s*['" + "\"]";
pattern += #"([\w/\.\d\s-]*)['" + "\"]>";
pattern += "|";
pattern += #"<link[/\s\w=\d" + "\"]*";
pattern += "href['\"](" + #"[\.\d\w/"+ Regex.Escape("\\") + "-]*)";
pattern += "['\"]" + #"[\s\w='/" + "\"]*>";
Just incase you find the fault.. that why its not working well in C#
while the test are all cleared on link given below:
http://regexr.com/3admv
Just to be sure, here is the code:
string url = "http://www.uok.edu.pk";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex(GetDirectoryListingRegexForUrl(url));
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine("***************");
Console.WriteLine(match.ToString());
}
}
}
}
Console.ReadLine();
}
If you can help me, please give me the string pattern to correctly parse the html I gave in the link.
I cant seem to get link href via this regex.
Thanks for any help :)
You can escape quote characters here by doubling them:
string pattern = #"<script[\s\w=""'/]*src\s*=\s*['""]([\w/\.\d\s-]*)[""']>|<link[/\s\w=""\d]*href=['""]([\.\d\w\\/-]*)['""][\s\w=""'/]*>";
TextReader reader = File.OpenText("texttoparse.txt");// I put text from your example in this file
string txt = reader.ReadToEnd();
var matches = Regex.Matches(txt, pattern);
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine("***************");
Console.WriteLine(match.ToString());
}
}
output (same as in your RegExr test):
***************
<link rel="import" href="component.html" >
***************
<link rel="stylesheet" href="css/style.css">
***************
<script src="js/script.js">
***************
<link rel="import" href="component.html">
***************
<link href="css/style-original.css" rel="stylesheet" type="text/css">
***************
<link href="css/style-original.css" rel="stylesheet" type="text/css" />
***************
<script type="text/javascript" src="/js/jquery.js">
***************
<script type="text/javascript" src="/js/cufon-yui.js">
***************
<script type="text/javascript" src="/js/arial.js">
***************
<script type="text/javascript" src="/js/chilli.js">
***************
<script type="text/javascript" src="/js/cycle.js">
***************
<script type="text/javascript" src="/js/functions.js">
***************
<script type="text/javascript" src="/js/fancybox.js">
It seems like you were trying to just extract "href" and "src" attribute values from HTML tags. You can use regex for that:
<(?:script|link)[^<]*?\s(?:src|href)=(?<quot>['"])(?<result>(?>(?!\k<quot>).)+)\k<quot>
Since we never know if single or double quotation marks are used in the HTML code, we can capture the first one ((?<quot>['"])), and then everything that is not equal to it ((?<result>(?>(?!\k<quot>).)+)\k<quot>).
You can split this into separate alternatives as well, named captured groups are great in C#:
<script[^<]*?\ssrc=(?<quot>['"])(?<result>(?>(?!\k<quot>).)+)\k<quot>|<link[^<]*?\shref=(?<quot>['"])(?<result>(?>(?!\k<quot>).)+)\k<quot>
${result} will hold your data.
I am using a Jquery Token Input plugin. I have tried to fetch the data from the database instead of local data. My web service returns the json result is wrapped in xml:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://tempuri.org/">[{"id":"24560","name":"emPOWERed-Admin"},{"id":"24561","name":"emPOWERed-HYD-Visitors"}]</string>
I have checked in the site http://loopj.com/jquery-tokeninput/ which says that the script should output JSON search results in the following format:
[
{"id":"856","name":"House"},
{"id":"1035","name":"Desperate Housewives"}
]
Both seems to be the same,but still i m not getting the items displayed in my page.
I am posting my code also.
My Js code: DisplayTokenInput.js
$(document).ready(function() {
$("#textboxid").tokenInput('PrivateSpace.asmx/GetDl_info', {
hintText: "Type in DL Name", theme: "facebook",
preventDuplicates: true,
searchDelay: 200
});
});
My web-service code:
[WebMethod]
[ScriptMethod(UseHttpGet = true, ResponseFormat = ResponseFormat.Json)]
public string GetDl_info(string q)
{
string dl_input = string.Empty;
DataSet ds;
PSData ObjDl = new PSData();
ds = ObjDl.GetDistributionList(q);
List<DistributionList> DLObj = new List<DistributionList>();
foreach (DataRow datarow in ds.Tables[0].Rows)
{
DistributionList dl_list = new DistributionList();
dl_list.id = Convert.ToString(datarow["id"]);
dl_list.name = Convert.ToString(datarow["name"]);
DLObj.Add(dl_list);
}
dl_input = JsonConvert.SerializeObject(DLObj);
return dl_input;
}
}
public class DistributionList
{
public string id { get; set; }
public string name { get; set; }
}
I am posting the head portion of aspx code to show the library files i have included:
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title>Untitled Page</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link href="../Styles/jquery-ui-1.8.20.custom.css" rel="stylesheet" type="text/css" />
<link href="../Styles/token-input.css" rel="stylesheet" type="text/css" />
<link href="../Styles/token-input-facebook.css" rel="stylesheet" type="text/css" />
<script src="Scripts/Lib/jquery-1.7.2.min.js" type="text/javascript"></script>
<script src="../Scripts/jquery.tokeninput.js" type="text/javascript"></script>--%>
<script src="DisplayTokenInput.js" type="text/javascript"></script>
<head>
You need to make sure that your request is a POST request. Not a get request. See this answer to find out more about why: How to let an ASMX file output JSON
I would assume that the code for the plugin isn't setting the content-type for ajax requests to JSON, so you could do it yourself before the service call with $.ajaxSetup ie:
$.ajaxSetup({
contentType: "application/json; charset=utf-8"
});
UPDATE: Apparently asmx services sometimes have issues with the 'charset=utf-8' portion, so if that doesn't work you could try just 'application/json'
UPDATE 2:
I don't think it's the contentType causing the issue, use the following to force a POST for ajax requests and see if this fixes it:
$.ajaxSetup({
type: "POST", contentType: "application/json; charset=utf-8"
});
UPDATE 3:
There is a default setting inside the plugin you're using that can change the requests from GET to POST. See here on it's GitHub repo: jquery.tokeninput.js
and in your copy of the js file in the project, change the line:
var DEFAULT_SETTINGS = {
// Search settings
method: "GET",
to
var DEFAULT_SETTINGS = {
// Search settings
method: "POST",
I also assume that the plugin constructs the query in a way that ignores the global jquery ajax settings anyway, so you shouldn't need to include my earlier snippets anymore.
I'm using this following code for auto complete feature ,
but I need to fetch values from database using sql server 2008 and C# , asp.net.
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>jQuery UI Autocomplete - Default functionality</title>
<link rel="stylesheet" href="http://code.jquery.com/ui/1.9.1/themes/base/jquery-ui.css" />
<script src="http://code.jquery.com/jquery-1.8.2.js"></script>
<script src="http://code.jquery.com/ui/1.9.1/jquery-ui.js"></script>
<link rel="stylesheet" href="/resources/demos/style.css" />
<script>
$(function() {
var availableTags = [
"ActionScript",
"AppleScript",
"Asp",
"BASIC",
"C",
"C++",
"Clojure",
"COBOL",
"ColdFusion",
"Erlang",
"Fortran",
"Groovy",
"Haskell",
"Java",
"JavaScript",
"Lisp",
"Perl",
"PHP",
"Python",
"Ruby",
"Scala",
"Scheme"
];
$( "#tags" ).autocomplete({
source: availableTags
});
});
</script>
</head>
<body>
<div class="ui-widget">
<label for="tags">Tags: </label>
<input id="tags" />
</div>
</body>
</html>
How can i fetch that array list values from my database using (EF4 and asp.net)
The first step is to create a C# ASP.Net page which produces a JSON result that the autocomplete plugin can parse. According to the documentation you can use the two following formats:
Array: An array can be used for local data.
There are two supported formats: An array of strings: [ "Choice1", "Choice2" ]
An array of objects with label and value properties: [ { label: "Choice1", value:
"value1" }, ... ]
http://api.jqueryui.com/autocomplete/#option-source
Alternatively you can use a function to parse out whatever format you need but it sounds like the simplest solution will fulfill your needs.
I'm going to assume you're using ASP.Net forms which isn't really tuned for this kind of thing but you can still make it work with some tweaking. Let's create a page in your web application root called SearchResults.aspx.
The first thing to do is to clear out everything from your ASPX file except the line:
<%# Page Language="C#" AutoEventWireup="true" CodeBehind="SearchResults.aspx.cs" Inherits="ASP.Net_Forms.SearchResults" %>
Then you're free to change the code behind to output whatever format you like. In this case we'll be using JSON in a structure that Autocomplete can understand natively. We'll also need to set the response type.
public partial class SearchResults : System.Web.UI.Page
{
private class SomeSearchableClass
{
public int ID { get; set; }
public string Name { get; set; }
}
protected void Page_Load(object sender, EventArgs e)
{
// The autocomplete plugin defaults to using the querystring
// parameter "term". This can be confirmed by stepping through
// the following line of code and viewing the raw querystring.
List<SomeSearchableClass> Results = SomeSearchSource(Request.QueryString["term"]);
Response.ContentType = "application/json;charset=UTF-8";
// Now we need to project our results in a structure that
// the plugin can understand.
var output = (from r in Results
select new { label = r.Name, value = r.ID }).ToList();
// Then we need to convert it to a JSON string
JavaScriptSerializer Serializer = new JavaScriptSerializer();
string JSON = Serializer.Serialize(output);
// And finally write the result to the client.
Response.Write(JSON);
}
List<SomeSearchableClass> SomeSearchSource(string searchParameter)
{
// This is where you'd put your EF code to gather your search
// results. I'm just hard coding these examples as a demonstration.
List<SomeSearchableClass> ret = new List<SomeSearchableClass>();
ret.Add(new SomeSearchableClass() { ID = 1, Name = "Watership Down" });
ret.Add(new SomeSearchableClass() { ID = 2, Name = "Animal Farm" });
ret.Add(new SomeSearchableClass() { ID = 3, Name = "The Plague Dogs" });
ret = ret.Where(x => x.Name.Contains(searchParameter)).ToList();
return ret;
}
}
And finally just change your jQuery to use the correct source:
$( "#tags" ).autocomplete({ source: "/SearchResults.aspx" });
See this below sample from jQueryUI Autocomplete Example
Hope you can do by yourself!.
All you need to do is call some page or handler and prepare JSON data.
$( "#city" ).autocomplete({
source: function( request, response ) {
$.ajax({
url: "yourpage.aspx",
dataType: "jsonp",
data: {
},
success: function( data ) {
response( $.map( data.geonames, function( item ) {
return {
label: item.name + (item.adminName1 ? ", " + item.adminName1 : "") + ", " + item.countryName,
value: item.name
}
}));
}
});
},
minLength: 2,
select: function( event, ui ) {
log( ui.item ?
"Selected: " + ui.item.label :
"Nothing selected, input was " + this.value);
},
open: function() {
$( this ).removeClass( "ui-corner-all" ).addClass( "ui-corner-top" );
},
close: function() {
$( this ).removeClass( "ui-corner-top" ).addClass( "ui-corner-all" );
}
});
I have a problem with the events in my fullCalendar object not showing when using ajax to fetch the data from my JSON feed. I believe the JSON format is proper though since the output from JSON.aspx is:
[{"id":1,"title":"TESTTITLE","info":"INFOINFOINFO","start":"2012-08-20T12:00:00","end":"2012-08-20T12:00:00","user":1}]
I used Firebug and it seems like the JSON feed is not getting fetched properly?
When I add the upper JSON-feed directly in the events it displays properly.
(Edit) The JSON response is now working, although the events are still not displayed in fullcalendar.
JSON.aspx
public partial class JSON : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
// Get events from db and add to list.
DataClassesDataContext db = new DataClassesDataContext();
List<calevent> eventList = db.calevents.ToList();
// Select events and return datetime as sortable XML Schema style.
var events = from ev in eventList
select new
{
id = ev.event_id,
title = ev.title,
info = ev.description,
start = ev.event_start.ToString("s"),
end = ev.event_end.ToString("s"),
user = ev.user_id
};
// Serialize to JSON string.
JavaScriptSerializer jss = new JavaScriptSerializer();
String json = jss.Serialize(events);
Response.Write(json);
Response.End();
}
}
And my Site.master
<link href="~/Styles/Site.css" rel="stylesheet" type="text/css" />
<link href='fullcalendar/fullcalendar.css' rel='stylesheet' type='text/css' />
<script src='jquery/jquery-1.7.1.min.js' type='text/javascript'></script>
<script src='fullcalendar/fullcalendar.js' type='text/javascript' ></script>
<script type="text/javascript">
$(document).ready(function () {
$('#fullcal').fullCalendar({
eventClick: function() {
alert('a day has been clicked!');
},
events: 'JSON.aspx'
})
});
</script>
I've been scanning related questions for days but none of them seems to fix mine...
Why are your calls so complicated? Try this for now:
$('#fullcal').fullCalendar({
events: 'JSON.aspx',
eventClick: function (calEvent, jsEvent, view) {
alert('a day has been clicked!');
}
});