BeautifulSoup and ASP.NET/C# - c#

Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)?
Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C#
The intent of planning to use the library is to extract readable text from any random URL.
Thanks

Html Agility Pack is a similar project, but for C# and .NET
EDIT:
To extract all readable text:
document.DocumentNode.InnerText
Note that this will return the text content of <script> tags.
To fix that, you can remove all of the <script> tags, like this:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
(Credit: SLaks)

You could try this although it currently has a few bugs:
http://nsoup.codeplex.com/

I know this is quite old, but I decided to post this for future reference.
I came across this searching for a similar solution.
I found a library built on top of Html Agility Pack called scrapysharp
I've used it in quite similar manner as I would BeautifulSoup
https://bitbucket.org/rflechner/scrapysharp/wiki/Home (EDIT: broken link, project moved to https://github.com/rflechner/ScrapySharp)
EDIT: https://www.nuget.org/packages/ScrapySharp/ has the package

Related

How to convert html to text without removing html tags

convert this into this]2
I am trying to convert the text with html tags(p,ol, b) [!
into normal text(like the result of run code snippet) - I have tried with the below code in .net core but the result is plain html without formatting(eg: p tag should show paragraph, <ol should convert to numbers, <b should make the text bold etc..)
var doc = new HtmlDocument();
doc.LoadHtml(sampleHtml);
var innertext = doc.DocumentNode.InnerText;
also tried with HTMLAgility pack, but no luck.
Html.Raw(sampleHtml) works with mvc razor but not with .net core.
<p>Angular is a platform for building mobile and desktop web applications. It has a big community of millions of developers who choose Angular to build compelling user interfaces.:</p><ol><li>Angular is a JavaScript open-source front-end web application framework..</li><li>Angular solves many of the challenges faced when developing single page, cross platform, performant applications.</li></ol><p><b>Angular</b/></p><p><b>What's new</b/></p><p><b>Angular is a complete rewrite of AngularJS.</b/></p><p>Angular does not have a "scope" concept or controllers, instead, it uses a component hierarchy as its main architecture.</p><p><b>Warning</b/></p><p>Static Typing (<b>support</b>) for the purpose of study.
Kindly comment your ideas and ways to achieve this. Thanks
Thanks for all your responses. I was able to do it in angular template instead of getting the converted html from C# using <p [innerHTML]="sampleHtml">'. innerHTML></p>, innerHTML does not work with 'textarea' which I was trying to do, so I used a paragraph, div can also be used.

C# data scraping from websites

HI I am pretty new in C# sphere. Been in php and JavaScript since the beginning of this year. I want to scrap posts and comments from a blog. The site is http://www.somewhereinblog.net
What I want to do is
1. I want to log in using a software
2. Then download the html
3. Then use regular expressions, xpath whatever comes handy to separate the contents of posts and comments
I been searching all over. Understood very little. Though I am quite sure I need to use 'htmlagilitypack'. I dont know how to add a library to c# console or form application. Can someone give me some help? I badly need this. And I am not too into C# just a week. So would be grateful if there is some detailed information. Waiting eagerly.
Thanks in advance brothers.
Using Webclient you can login and download
Instead html-agility-pack I like CsQuery because lets you use jQuery syntax inside a string in C# code, so you can download to a string the html, and search and do things in it like with jQuery and HTML page.

diff to web display?

I want to create a diff between two pieces of text (code but i may also want forum post) and highlight the difference in a webpage. What lib might i use to do that? I dont mind if highlighting is done in javascript. I am using asp.net with C#
You can use the DiffPlex library.
It is the same diff'ing library that is used by CodePlex.com to show source code diffs. You can see an example of its output here.
I've used jsdifflib

C#, turn html to valid html e-mail

I want to turn an html page that can easily be edited on the net to a valid html e-mail (inline styles, absolute links etc).
I have found this project premailer, it changes your html to work well in as much e-mail clients as possible. I want to know if a .NET equivalent exists or if it could be possible to run this project in IronRuby for example.
It's Ruby code, so I would expect that it would run in IronRuby. Did you try it and run into problems?

SQL for the web

Does anyone have experience with a query language for the web?
I am looking for project, commercial or not, that does a good job at making a webpage queryable and that even follows links on it to aggregate information from a bunch of pages.
I would prefere a sql or linq like syntax. I could of course download a webpage and start doing some XPATH on it but Im looking for a solution that has a nice abstraction.
I found websql
http://www.cs.utoronto.ca/~websql/
Which looks good but I'm not into Java
SELECT a.label
FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";
Are there others out there?
Is there a library that can be used in a .NET language?
See hpricot (a Ruby library).
# load the RedHanded home page
doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
(doc/"span.entryPermalink").set("class", "newLinks")
# remove the sidebar
(doc/"#sidebar").remove
# print the altered HTML
puts doc
It supports querying with CSS or XPath selectors.
Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively.
For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness.
There is also this C# html parser which looks good but I've not tried it.
You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.
I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.

Categories

Resources