How to parse HTML to Text with styles - c#

I'm building an app using Xamarin Android, and I want to convert HTML to normal with formatting, for example :
HTML Code
<p><strong>Lorem ipsum</strong> is placeholder text <strong><em><span style="color:#ff0000">commonly</span></em></strong> used in the graphic, print, and publishing industries for previewing layouts and visual mockups.
</p>
<p> </p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
Text
Lorem ipsum is placeholder text commonly used in the graphic, print, and publishing industries for previewing layouts and visual mockups.
Item 1
Item 2
Item 3
I get this content from the database and I want to convert post content to Text with formatting.

The TextView currently supports the following HTML tags as listed in this blog post:
<a href="...">
<b>
<big>
<blockquote>
<br>
<cite>
<dfn>
<div align="...">
<em>
<font size="..." color="..." face="...">
<h1>
<h2>
<h3>
<h4>
<h5>
<h6>
<i>
<img src="...">
<p>
<small>
<strike>
<strong>
<sub>
<sup>
<tt>
<u>
If you just want to display it in a TextView then simply do something like this:
TextView txtView;
txtView.TextFormatted = Html.FromHtml(HTMLFromDataSource);
If you want to use a different control then there are other ways to achieve this, but the TextView supports HTML to a degree anyway so if you can use that, I would.
However it is worth noting that UL and LI doesn't look to currently be supported. So you would have to use something like the Html.TagHandler to tell it what to do, here is a Java implementation:
public class UlTagHandler implements Html.TagHandler{
#Override
public void handleTag(boolean opening, String tag, Editable output,
XMLReader xmlReader) {
if(tag.equals("ul") && !opening) output.append("\n");
if(tag.equals("li") && opening) output.append("\n\t•");
}
}
textView.setText(Html.fromHtml(myHtmlText, null, new UlTagHandler()));
You should be able to convert that to C# for Xamarin.

Related

How to find and replace strings containing invalid HTML tags to valid tags

I have a string containing a list of html tags with invalid tag formatting.
For example, I have a string such as that below:
<p>
<strong>Scale:</strong>
</p>
<p>
<ul style="list-style-type:disc" class="pl-2">
 <li>2 to 4 nodes</li>
</ul>
</p>
<p>
<strong>Single Node Data:</strong>
</p>
<p>
<ul style="list-style-type:disc" class="pl-2">
 <li>CPU: 6-26 cores (Intel)</li>
 <li>RAM: 128GB to 2TB</li>
 <li>Raw storage: 240GB to 16TB</li>
 <li>Storage type: SSD + HDD</li>
 <li>Network speed: Up to 25Gb</li>
</ul>
</p><img src="xxxxx"/>
I need to replace the tags ending with /> to </img>, such that <img src="xxxxx"/> would be replaced with <img src="xxxxx"></img>.
How would I achieve this using C#?
For what you are asking, you can go with either one of the following options
Option 1
You can use a 3rd party library that parses your HTML into tags (it actually renders it as XML) and separate each tag (and its content) in a string array/list
then you loop the list and check if the closing tag is proper, if not replace it with the proper one.
Here is the library
Option 2
You can create your own html parser, which would give you more control over the parser's logic, i found this example of C# HTML parser on CodeProject you can check it out.

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.
You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

Grouping Results in XPath

Introduction :
Suppose we have such a HTML code like this :
<div class="search-result">
<h2>TV-Series</h2>
<ul>
<li>
<div class="title">
Prison Break : Sequel - First Season
</div>
<span class="subtle count">10 subtitles</span>
</li>
<li>
<div class="title">
Prison Break - Fourth Season
</div>
<span class="subtle count">1232 subtitles</span>
</li>
</ul>
<h2>Popular</h2>
<ul>
<li>
<div class="title">
Prison Break - Fourth Season (2008)
</div>
<div class="subtle count">
1232 subtitles
</div>
</li>
<li>
<div class="title">
Prison Break - Third Season (2007)
</div>
<div class="subtle count">
644 subtitles
</div>
</li>
</ul>
</div>
The page is something like this :
And you can see the Original site here : SubScene
I'm writting a C# Desktop application , that get the information of this site .
Before I learn HTML Agility Pack , I use Regular Expression .
with this pattern : <h2>[\s\S]+?</ul> I separate Series ( like Tv-Series , Popular and ...) .
then with this pattern on Rgular Expression : <li>[\s\S]+?(.+)[\s\S]+?class="subtle count"[\s\S]+?(\d*)[\s\S]+?</li> I get categorized information from this site.
with MatchCollection & using Groups ( that difined with Parenthesis) , My method in Regex , Returned me Two-dimensional list for each Serie, that each Row is about a Movie and columns include : Movie Name , Number of Subtitles and Subtitle Dowunload Link .
and that Two-dimensional list became like a DataBase somthing like this :
NOW i learned HTML Agility Pack .
Question :
1- How can I Create such a that list in HTML Agility Pack with XPath ?
2- With which XPath I can create group like Regex as you saw before ?
Thank you so much .
The comment by Martin Honnen is correct, there isn't really much functionality to provide 'grouping' via XPath. However it is possible to use a loop and run a set of XPaths on sets of elements to extract the data you want.
First, you extract each of the title elements, then extract each of the list items from the titles, and run one file XPath to pull out the values you want from each one.
Note: This code is written using XPaths against an XDocument instead of with HTML Agility Pack, but the XPath should be the same regardless.
var titleNodes = d.XPathSelectElements("/div[#class='search-result']/h2");
foreach (var titleNode in titleNodes)
{
string title = titleNode.Value.Dump();
var listItems = titleNode.XPathSelectElements("following-sibling::ul[1]/li");
foreach (var listItem in listItems)
{
var itemData = listItem.XPathEvaluate("div[#class='title']/a/text() | *[#class='subtle count']/text()");
}
}
Note the use of the XPath | operator in the last expression to select the values of multiple different children in a single XPath call. The values are kind of 'grouped' like you wanted.

dynamically add html tag in asp.net

I need to add html tags dynamically in asp.net code behind , here is the scenario :
List<Product> products = ProductManager.GetProductList();//read data from database
Product is a class containing information I need to show in website such as product name, image,… .
For adding html code I tried this to show 5 products per page :
String finalHtmlCodeContainer="";
for(int i=0;i<=5;i++)
{
String myBox= "<div class=\"box\"> </div>";
finalHtmlCodeContainer+=mybox;
}
boxWrapper.InnerHtml=finalHtmlCodeContainer;
boxWrapper is a div that would contain our 5 product info.up to now everything is ok but problem appears when insted of "<div class=\"box\"> </div>",myBoxcontains long characters of html code , the original myBox should include this:
<div class="boxWrapper">
<div class="box">
<div class="rightHeader">rightHeader</div>
<div class="leftHeader">leftHeader</div>
<div class="article">
<img src="../Image/cinemaHeader.jpg" />
<p>
some text here <!--product.content-->
</p>
</div><!--end of article-->
<div class="rightFooter"><span>rightFooter</span>
<div class="info">
<ul>
<li>item1</li>
<li>item2</li>
<li>item3</li>
</ul>
</div>
</div><!--end of rightFooter-->
<div class="leftFooter"><span>leftFooter</span>
</div><!--end of leftFooter-->
</div><!--end of box-->
</div><!--end of boxWrapper-->
you see if i add producet.atrributes to code snippet above, it would be more messy,hard to debug , has less scalability and... . what is your solution?
It depends on the underlying product. If you are using web forms, I'd suggest using a data bound control that supports templating like the ListView or Repeater. It gives you full control over the UI, and also supports reloading the UI in ViewState. In your scenario, you'd have to build the UI on every postback. The ListView or Repeater track it in ViewState.
If you are using MVC, well using a helper method can make this easier as you can stick the item template within and then use a foreach and render out the helper.
If you are talking doing it in JavaScript, then there are templating components for doing the same thing.
I suggest you to Add repeater in your <p> tag ,
<p>
<asp:Repeater runat="server" ID="rpDetail">
<ItemTemplate>
<div class="box">
<%# Eval("YourDataField") %>
</div>
</ItemTemplate>
</asp:Repeater>
</p>
And make your products list as Repeater's datasourse in code behind .
rpDetail.DataSource = products ;
rpDetail.DataBind();

Parsing html for windows 8 metro style application using C#, XAML

My application should parse the html and load the contents into the list box. I am able to get the html via webclient but got stuck parsing it.
I heard of Htmlagilitypack and Fizzler but couldn't find any tutorials or examples on their usage.
I want some help in grabbing "first_content" and "second_content" into a list box from the html document shown below.
<html>
<body>
<div>
<section>
<article>
<header>
<hgroup>
<h1>
first_content
</h1>
</hgroup>
</header>
<ul>
<li>
second_content
</li>
</ul>
</article>
</section>
</div>
</body>
</html>
HtmlAgilityPack is the way to go, I've been using it in WCF, Windows Phone and now WinRt with total success, for a tutorial check this blog post
You can use XPath. For example ...
var html = "<html><body><div><section><article><header><hgroup><h1>first_content</h1></hgroup></header><ul><li>second_content</li></ul></article> </section></div></body></html>";
var doc = new XmlDocument();
doc.LoadXml(html);
var txt1 = doc.SelectSingleNode("/html/body/div/section/article/header/hgroup/h1").InnerText;
var txt2 = doc.SelectSingleNode("/html/body/div/section/article/ul/li").InnerText;

Categories

Resources