C# extract content from HTML document

C# extract content from HTML document - c#

I was wondering how can I do something similar to Facebook when a link is posted or like shortening link services that can get the title of the page and its content.
Example:
My idea is to get only the plain text from a web page, for example if the url is an article of a newspaper how can I get only the news's text, like showed in the image. For now I have been trying to use the HtmlAgilityPack but I can never get the text clean.
Note this app is for Windows Phone 7.

You're on the right track with HtmlAgilityPack.
If you want all the text of the website, go for the innerText attribute. But I suggest you go with the meta description tag (if available).
EDIT - Go for the meta description. I believe that's what Facebook is doing:
Facebook link sample
Site source

Related

How to get hidden data in a HTML file

I try to get comments of a Instagram post with C#. But the thing is there is this 'Load more comments' button which as it refers does its job and when i take a look a Firefox HTML codes all of the sudden new <li> element appears out of no where. Is this data getting there from a Database or it's embedded in HTML file? Is there a way to reach that data? I tried SgmlReader but i couldn't manage get all of the data that i'm looking for.

Kentico 7, How to drill into CMS and get the images in an HTML page

Hi I'm fairly new to Kentico. I need to access the CMS and get images to my html page. how do i do that, have seen methods like getURL, but dont know which context i use them

If the places where you enter content on the page are rich text, meaning you type in them and a toolbar appears which allows you to format text & do other stuff, then there is an icon that looks like a film strip, click that and you should be able to select an image from the media library.
or
If there is nowhere to do that, then click on the Design view and add an editable image web part to the page template.

Can you elaborate on what you mean by HTML page?
Are you using the portal engine to design your page? if so, you can add the image using a static html web part, and just add the url to the file, getting the url depends on where it is stored. is it an Attachment? In a Media Library? In your App Themes?
-UPDATE-
Since you are storing your files in the Media Library, there are a couple ways to display the image on the page.
Static HTML - You can go to the page, design tab, add a static html web part, in that webpart add the HTML for the image. Something like To get the URL, you can go to the media library, and select the image you want, and in the panel at the bottom it should have the URL for the image.
Add the Editable Image Web Part to the page, you can then set the default image in the web part properties, and using those dialogs you should be able to select the file from the media library. Additionally, once the web part is on the page, you should be able to switch to the page tab, and select the image you want there.

Showing html in Windows Phone app

I create application for read news, and I get content with server in the form of html code. I use standard WebView control for showing him, but it is not very well. Because this control you cannot change. And I founded HTMLTextBox but this control does not display youtube video. My question what control is best to use for this?

You have 3 solutions to display HTML :
Parse the HTML and display it with your own XAML. You could use the Html Agility Pack to parse the html you want.
Let someone parse the HTML for you and display it with a custom control like the HTMLTextBox you mentioned.
Let the browser parse the HTML for you and display it with the classic WebView. Note that you can add some css or js to enhance, improve or manage the html.

how to Read Full Text RSS Feed

Some sites can get full text Rss feed when the rss address don't have full text
like this site This
how can I do that?

I don't know much about C#, but I can still give a general answer on how to solve your problem. RSS feeds (almost) always link to the article, hosted on the newspaper/blog's website, where the whole article is available. So the "RSS filler" takes the content of the article from the website content and basically puts it back in the feed, replacing the available (short) intro.
To achieve this you need to:
parse/generate RSS/Atoms feeds (I'm sure there are plenty of C# libs to do that)
find the actual article from the html page linked in the original RSS feed. Indeed the linked page contains a lot of things you don't want to put in the "full" RSS feed (such as the website header, nav bar, ads, comments, facebook like button and so on). The easiest way to do this is to use readability (a quick google check gives this lib).
If you combine both of these, you can achieve your goal.
You can find one implementation of this kind of tool at http://fivefilters.org, and their source code (for older versions) is at /content-only/ http://code.fivefilters.org/full-text-rss/. It's in PHP, but it can give a rough idea on how to proceed.

You can get the complete script which enlarge the partial rss feed from Full post rss feed website
The steps involves:
- Get the post URL from the RSS feed.
- Fetch the full content of the post URL, it will use curl to get the content.
- Parse the content, it uses templates for that. They keep on update the templates for the most popular websites and wordpress themes. Based on the template, parse the html content in to html dom objects and then find the content based on the html dom objects.
- Finally, generate the rss feed again with full content.
You can check the script which is written in PHP to get some idea, later you can rewrite the logic to any language.

Get a snapshot of posted HTML page?

I'm using a expertPDF to convert a couple webpages to PDF, and there's one that i'm having difficulties with. This page only renders content when info is POST'd to it, and the content is text and a PNG graph (the graph is the most important piece).
I tried creating a page form with a 'auto submit' on the body onload='' event. If i go to this page, it auto posts to the 3rd party page and i get the page as i expect. But it appears ExpertPDF won't take a 'snapshot' if the page is redirected.
I tried using HTTPRequest/Response and WebClient, but have only been able to retrieve the HTML, which doesn't include the PNG graph.
Any idea how i can create a memorystream that includes the HTML AND the PNG graph or post to it, but then somehow send ExpertPDF to that URL to take a snapshot of the posted results?
Help is greatly appreciated - i've spent too much time trying on this one sniff.
Thanks!

In HTML/HTTP the web page (the HTML) is a separate resource from any images it includes. So you would need to parse the HTML and find the URL that points to your graph, and then make a second request to that URL to get the image. (This is unless the page spits the image out inline, which is pretty rare, and if that were the case you probably wouldn't be asking.)
A quick look at ExpertPDF's FAQ page, there's a FAQ question that deals specifically with your problem. I would recommend you take a look at that.
** UPDATE **
Take a look at the second FAQ question:
Q: When I convert a HTML string to PDF, the external CSS files and images are not applied in the rendered PDF document.
You can take the original (single) response from your WebClient and convert that into a string and pass that string to ExpertPDF based on the answer to that question.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# extract content from HTML document - c#

You're on the right track with HtmlAgilityPack. If you want all the text of the website, go for the innerText attribute. But I suggest you go with the meta description tag (if available). EDIT - Go for the meta description. I believe that's what Facebook is doing: Facebook link sample Site source

Related

How to get hidden data in a HTML file

Kentico 7, How to drill into CMS and get the images in an HTML page

Showing html in Windows Phone app

how to Read Full Text RSS Feed

Get a snapshot of posted HTML page?

Categories

Resources