SQL for the web

SQL for the web - c#

Does anyone have experience with a query language for the web?
I am looking for project, commercial or not, that does a good job at making a webpage queryable and that even follows links on it to aggregate information from a bunch of pages.
I would prefere a sql or linq like syntax. I could of course download a webpage and start doing some XPATH on it but Im looking for a solution that has a nice abstraction.
I found websql
http://www.cs.utoronto.ca/~websql/
Which looks good but I'm not into Java
SELECT a.label
FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";
Are there others out there?
Is there a library that can be used in a .NET language?

See hpricot (a Ruby library).
# load the RedHanded home page
doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
(doc/"span.entryPermalink").set("class", "newLinks")
# remove the sidebar
(doc/"#sidebar").remove
# print the altered HTML
puts doc
It supports querying with CSS or XPath selectors.

Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively.
For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness.
There is also this C# html parser which looks good but I've not tried it.

You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.

I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.

Related

How to navigate through a website and "mine information"

I want to "simulate" navigation through a website and parse the responses.
I just want to make sure i am doing something reasonable before i start, I saw 2 options to do so:
Using the WebBrowser class.
Using the HttpWebRequest class.
So my initial though was to use HttpWebRequest and just parse the response.
What do you guys think?
Also wanted to ask,i use c# cause its my strongest language, but what are common languages used to do such stuff as mining from websites?

If you start doing it manually, you probably will end up hard-coding lots of cases. Try Html Agility Pack or something else support xpath expressions.
There are alot of Mining and ETL tools out there for serious data mining needs.

For "user simulation" I would suggest using Selenum web driver or PhantomJS, which is much faster but has some limitations in browser emulation, while Selenium provides almost 100% browser features support.

If you're going to mine data from a website there is something you must do first in order to be 'polite' to the websites you are mining from. You have to obey the rules set in that websites robots.txt, which is almost always located at www.example.com/robots.txt.
Then use HTML Agility Pack to traverse the website.
Or Convert the html document to xhtml using html2xhtml. Then use an xml parser to traverse the website.
Remember to:
Check for duplicate pages. (general idea is to hash each the html doc at each url. Look up (super)shingles)
Respect the robots.txt
Get the absolute URL from each page
Filter duplicate URL from your queue
Keep track of the URLs you have visited(ie. timestamp)
Parse your html doc. And keep your queue updated.
Keywords: robots.txt, absolute URL, html parser, URL normalization, mercator scheme.
Have fun.

Locating heading text using CSS

I have some C# code which will verify the heading text on a web page, currently located via xpath as follows.
Assert.AreEqual("Permissions", driver.FindElement(By.XPath(".//*[#id='navigation']/li[6]/h3")).Text);
This, as I understand, will check the text found at the end of the XPath matches the word "Permissions".
The above currently works but I would rather use CSS locators. I hear its best not to use XPath if possible.
I'm new to website testing so am not yet familiar with all this, any help will be much appreciated.
Let me know if there is more you require than what is provided above or if you have any alternate suggestions to the method already used.

I may not fully understand what you are trying to do but since this has no answer after 7 hours i though i might at least mention that if you are trying to get the innerhtml of the header you can use Agile Html http://htmlagilitypack.codeplex.com/ its very easy to use. Might not be what you are looking for though.

C# data scraping from websites

HI I am pretty new in C# sphere. Been in php and JavaScript since the beginning of this year. I want to scrap posts and comments from a blog. The site is http://www.somewhereinblog.net
What I want to do is
1. I want to log in using a software
2. Then download the html
3. Then use regular expressions, xpath whatever comes handy to separate the contents of posts and comments
I been searching all over. Understood very little. Though I am quite sure I need to use 'htmlagilitypack'. I dont know how to add a library to c# console or form application. Can someone give me some help? I badly need this. And I am not too into C# just a week. So would be grateful if there is some detailed information. Waiting eagerly.
Thanks in advance brothers.

Using Webclient you can login and download
Instead html-agility-pack I like CsQuery because lets you use jQuery syntax inside a string in C# code, so you can download to a string the html, and search and do things in it like with jQuery and HTML page.

BeautifulSoup and ASP.NET/C#

Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)?
Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C#
The intent of planning to use the library is to extract readable text from any random URL.
Thanks

Html Agility Pack is a similar project, but for C# and .NET
EDIT:
To extract all readable text:
document.DocumentNode.InnerText
Note that this will return the text content of <script> tags.
To fix that, you can remove all of the <script> tags, like this:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
(Credit: SLaks)

You could try this although it currently has a few bugs:
http://nsoup.codeplex.com/

I know this is quite old, but I decided to post this for future reference.
I came across this searching for a similar solution.
I found a library built on top of Html Agility Pack called scrapysharp
I've used it in quite similar manner as I would BeautifulSoup
https://bitbucket.org/rflechner/scrapysharp/wiki/Home (EDIT: broken link, project moved to https://github.com/rflechner/ScrapySharp)
EDIT: https://www.nuget.org/packages/ScrapySharp/ has the package

Html Parser & Object Model for .net/C#

I'm looking to parse html using .net for the purposes of testing or asserting its content.
i.e.
HtmlDocument doc = GetDocument("some html")
List forms = doc.Forms()
Link link = doc.GetLinkByText("New Customer")
the idea is to allow people to write tests in c# similar to how they do in webrat (ruby).
i.e.
visits('\')
fills_in "Name", "mick"
clicks "save"
I've seen the html agility pack, sgmlreader etc but has anyone created an object model for this, i.e. a set of classes representing the html elements, such as form, button etc??
Cheers.

Here is good library for html parsing, objects like HtmlButton , HtmlInput s are not created but it is a good point to start and to create them yourself if you don't want to use HTML DOM

The closest thing to an HTML DOM in .NET, as far as I can tell, is the HTML DOM.
You can use the Windows Forms WebBrowser control, load it with your HTML, then access the DOM from the outside.
BTW, this is .NET. Any code that works for VB.NET would work for C#.

you have 2 major options:
Use some browser engine (i.e. internet explorer) that will parse the html for u and then will give give u access to the generated DOM. this option will require u to hvae some interop with the browser engine (in the case of i.e. it's simple COM)
use some light weight parser like HtmlAgilityPack

It sounds to me like you are trying to do HTML unit tests. Have you looked into Selenium? It even has C# library so that you can write your HTML unit tests in C# and assert that elements exist and that they have the correct values and even click on links. It even works with JavaScript / AJAX sites.

The best parser for HTML is the HTQL COM. Use can use HTQL queries to retrieve HTML content.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

SQL for the web - c#

Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively. For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness. There is also this C# html parser which looks good but I've not tried it.

You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.

I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.

Related

How to navigate through a website and "mine information"

Locating heading text using CSS

C# data scraping from websites

BeautifulSoup and ASP.NET/C#

Html Parser & Object Model for .net/C#

Categories

Resources