Html Parser & Object Model for .net/C# - c#

I'm looking to parse html using .net for the purposes of testing or asserting its content.
i.e.
HtmlDocument doc = GetDocument("some html")
List forms = doc.Forms()
Link link = doc.GetLinkByText("New Customer")
the idea is to allow people to write tests in c# similar to how they do in webrat (ruby).
i.e.
visits('\')
fills_in "Name", "mick"
clicks "save"
I've seen the html agility pack, sgmlreader etc but has anyone created an object model for this, i.e. a set of classes representing the html elements, such as form, button etc??
Cheers.

Here is good library for html parsing, objects like HtmlButton , HtmlInput s are not created but it is a good point to start and to create them yourself if you don't want to use HTML DOM

The closest thing to an HTML DOM in .NET, as far as I can tell, is the HTML DOM.
You can use the Windows Forms WebBrowser control, load it with your HTML, then access the DOM from the outside.
BTW, this is .NET. Any code that works for VB.NET would work for C#.

you have 2 major options:
Use some browser engine (i.e. internet explorer) that will parse the html for u and then will give give u access to the generated DOM. this option will require u to hvae some interop with the browser engine (in the case of i.e. it's simple COM)
use some light weight parser like HtmlAgilityPack

It sounds to me like you are trying to do HTML unit tests. Have you looked into Selenium? It even has C# library so that you can write your HTML unit tests in C# and assert that elements exist and that they have the correct values and even click on links. It even works with JavaScript / AJAX sites.

The best parser for HTML is the HTQL COM. Use can use HTQL queries to retrieve HTML content.

Related

How to convert html to text without removing html tags

convert this into this]2
I am trying to convert the text with html tags(p,ol, b) [!
into normal text(like the result of run code snippet) - I have tried with the below code in .net core but the result is plain html without formatting(eg: p tag should show paragraph, <ol should convert to numbers, <b should make the text bold etc..)
var doc = new HtmlDocument();
doc.LoadHtml(sampleHtml);
var innertext = doc.DocumentNode.InnerText;
also tried with HTMLAgility pack, but no luck.
Html.Raw(sampleHtml) works with mvc razor but not with .net core.
<p>Angular is a platform for building mobile and desktop web applications. It has a big community of millions of developers who choose Angular to build compelling user interfaces.:</p><ol><li>Angular is a JavaScript open-source front-end web application framework..</li><li>Angular solves many of the challenges faced when developing single page, cross platform, performant applications.</li></ol><p><b>Angular</b/></p><p><b>What's new</b/></p><p><b>Angular is a complete rewrite of AngularJS.</b/></p><p>Angular does not have a "scope" concept or controllers, instead, it uses a component hierarchy as its main architecture.</p><p><b>Warning</b/></p><p>Static Typing (<b>support</b>) for the purpose of study.
Kindly comment your ideas and ways to achieve this. Thanks
Thanks for all your responses. I was able to do it in angular template instead of getting the converted html from C# using <p [innerHTML]="sampleHtml">'. innerHTML></p>, innerHTML does not work with 'textarea' which I was trying to do, so I used a paragraph, div can also be used.

ASP.NET Core: Load the actually rendered text from a URL

Im looking for a simple way to get a string from a URL that contains all text actually displayed to the user.
I.e. anything loaded with a delay (using JavaScript) should be contained. Also, the result should ideally be free from HTML tags etc.
A straightforward approach with WebClient.DownlodString() and subsequent HTML-regex is pretty much pointless, because most content in modern web apps is not contained in the initial HTML document.
Most probably you can use Selenium WebDriver to fully load the page and then dump the full DOM.

Generate Screenshots from URL using ASP.NET and C#

I want to generate screenshots of a website using its URL.
This I want to create using ASP.NET and C#, and I dont want to use any of the available tools and API(Url2Png, Wesnappr, Awesomium etc..).
Which classes of ASP.NET and C# should I explore for this ? How should I start about on this ?
Please can someone guide me on this.
Looks like fun project to do by hand...
Read W3C site for HTML and CSS specifications (4+5 for HTML and 1+2+3 for CSS)
Implement your own HTML engine
Read ECMA specification to learn inner workings of JavaScript. Also dont forget to check for specific implementations for most popular browser(s) for you.
Implement your own JavaScript engine
Tie HTML and JavaScript engines togehter
Now when you have a way to safely render HTML on server it is an easy task: Get your engines to render page into a bitmap (may also need to implement cutom grapics library) and you are done.
More seriously - use existing tools (make sure they are ok to be used on server - i.e. I would not do it with IE engine). Or if you want to learn some particular part of the stack - scope the rest down (i.e. just render title of the page to bitmap using System.Drawing) to see how components work together.

Grab details from web page

I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)
The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.
You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.
yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.
You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class

SQL for the web

Does anyone have experience with a query language for the web?
I am looking for project, commercial or not, that does a good job at making a webpage queryable and that even follows links on it to aggregate information from a bunch of pages.
I would prefere a sql or linq like syntax. I could of course download a webpage and start doing some XPATH on it but Im looking for a solution that has a nice abstraction.
I found websql
http://www.cs.utoronto.ca/~websql/
Which looks good but I'm not into Java
SELECT a.label
FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";
Are there others out there?
Is there a library that can be used in a .NET language?
See hpricot (a Ruby library).
# load the RedHanded home page
doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
(doc/"span.entryPermalink").set("class", "newLinks")
# remove the sidebar
(doc/"#sidebar").remove
# print the altered HTML
puts doc
It supports querying with CSS or XPath selectors.
Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively.
For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness.
There is also this C# html parser which looks good but I've not tried it.
You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.
I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.

Categories

Resources