Parsing html for windows 8 metro style application using C#, XAML - c#

My application should parse the html and load the contents into the list box. I am able to get the html via webclient but got stuck parsing it.
I heard of Htmlagilitypack and Fizzler but couldn't find any tutorials or examples on their usage.
I want some help in grabbing "first_content" and "second_content" into a list box from the html document shown below.
<html>
<body>
<div>
<section>
<article>
<header>
<hgroup>
<h1>
first_content
</h1>
</hgroup>
</header>
<ul>
<li>
second_content
</li>
</ul>
</article>
</section>
</div>
</body>
</html>

HtmlAgilityPack is the way to go, I've been using it in WCF, Windows Phone and now WinRt with total success, for a tutorial check this blog post

You can use XPath. For example ...
var html = "<html><body><div><section><article><header><hgroup><h1>first_content</h1></hgroup></header><ul><li>second_content</li></ul></article> </section></div></body></html>";
var doc = new XmlDocument();
doc.LoadXml(html);
var txt1 = doc.SelectSingleNode("/html/body/div/section/article/header/hgroup/h1").InnerText;
var txt2 = doc.SelectSingleNode("/html/body/div/section/article/ul/li").InnerText;

Related

How do I download the HTML code of the url with the images NOT being hidden

I am trying to do some webscraping but when I download the html of the url the images are hidden but in my browser they are not "user-ad-row__image image image--is-hidden" instead of "user-ad-row__image image image--is-visible". Was seeing if webclient changed anything. Using the HtmlAgilityPack.
var url = "https://www.gumtree.com.au/s-motorcycles-scooters/wa/drz+400/k0c18322l3008845";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
WebClient client = new WebClient();
htmlDocument.LoadHtml(client.DownloadString(url));
<div class="user-ad-row__main-image-wrapper user-ad-row__main-image-wrapper--has-image"><img class="user-ad-row__image image image--is-hidden" src="" alt="Suzuki Drz-400E"></div>
</div>
<div class="user-ad-row__details">
<div class="user-ad-row__info">
<p class="user-ad-row__title">Suzuki Drz-400E</p>
<div class="user-ad-price user-ad-price--row"><span class="user-ad-price__price">$4,250</span>
<!-- -->
<!-- --><span class="user-ad-price__price-negotiable user-ad-price__price-negotiable--with-price">Negotiable</span>
<!-- -->
<!-- -->
<!-- -->
</div>
<ul class="user-ad-attributes">
<li class="user-ad-attributes__attribute">Learner Approved</li>
<li class="user-ad-attributes__attribute">6000 km</li>
</ul>
<p id="user-ad-desc-MAIN-1228533281" class="user-ad-row__description user-ad-row__description--regular">For sale 2008 Drz-400E excellent condition, well looked after starts first time evertime serviced about a month ago. Just paid 3 months rego. Call or text </p>
</div>
<div class="user-ad-row__extra-info">
<div class="user-ad-row__location"><span class="user-ad-row__location-area">Perth City Area</span>Perth<span class="user-ad-row__distance"> </span></div>
<p class="user-ad-row__age">15/09/2019</p>
</div>
</div>
<button id="" type="button" class="user-ad-row__watchlist-heart-wrapper watchlist-heart Button__buttonBase--3YR6h Button__button--2NsdC Button__buttonBasic--3CSBx" role=""><span class="" aria-hidden="true"><span class="icon-heart heart"></span></span>
</button>```
The website that you provided loads images using Javascript and according to an internet search it appears that HtmlAgilityPack only renders the HTML but is unable to run Javascript.
Some solutions would be:
WebBrowser Class
It's kind of tricky if you want to mix it with the HtmlAgilityPack but provides decent performance.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible.
Javascript.Net
It allows you to run scripts using Chrome's V8 JavaScript engine. Near the bottom of the page there will be something like <script src="/latest/resources/react/app.full.something.js"></script>
If you are able to figure out how that loads then you should be able to get all of the images.

How to parse HTML to Text with styles

I'm building an app using Xamarin Android, and I want to convert HTML to normal with formatting, for example :
HTML Code
<p><strong>Lorem ipsum</strong> is placeholder text <strong><em><span style="color:#ff0000">commonly</span></em></strong> used in the graphic, print, and publishing industries for previewing layouts and visual mockups.
</p>
<p> </p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
Text
Lorem ipsum is placeholder text commonly used in the graphic, print, and publishing industries for previewing layouts and visual mockups.
Item 1
Item 2
Item 3
I get this content from the database and I want to convert post content to Text with formatting.
The TextView currently supports the following HTML tags as listed in this blog post:
<a href="...">
<b>
<big>
<blockquote>
<br>
<cite>
<dfn>
<div align="...">
<em>
<font size="..." color="..." face="...">
<h1>
<h2>
<h3>
<h4>
<h5>
<h6>
<i>
<img src="...">
<p>
<small>
<strike>
<strong>
<sub>
<sup>
<tt>
<u>
If you just want to display it in a TextView then simply do something like this:
TextView txtView;
txtView.TextFormatted = Html.FromHtml(HTMLFromDataSource);
If you want to use a different control then there are other ways to achieve this, but the TextView supports HTML to a degree anyway so if you can use that, I would.
However it is worth noting that UL and LI doesn't look to currently be supported. So you would have to use something like the Html.TagHandler to tell it what to do, here is a Java implementation:
public class UlTagHandler implements Html.TagHandler{
#Override
public void handleTag(boolean opening, String tag, Editable output,
XMLReader xmlReader) {
if(tag.equals("ul") && !opening) output.append("\n");
if(tag.equals("li") && opening) output.append("\n\t•");
}
}
textView.setText(Html.fromHtml(myHtmlText, null, new UlTagHandler()));
You should be able to convert that to C# for Xamarin.

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.
You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

Creating a textbox with a new syntax

In my Asp.NET page I have one html editor.
When user write below part, and click the save button this text is saved in database and gets the id number like (Id=12) and I get it from user interface side of web site with a page with below code.
<html>
<head></head>
<body>
...
..
.
</body>
</html>
I can get the saved text like below sql statement
SELECT Text FROM StackOverFlow WHERE Id = 12
And then I can show the value in web page.
In this respect I want to use this editor to create a asp.net textbox.
That is to say I want to create a new syntax which supply to editor entering basic sentences to create asp.net textbox.
Let's assume that syntax is below:
{{inputbox}}
<html>
<head></head>
<body>
<li>
{{inputbox}}
</li>
</body>
</html>
How can I create an asp.net textbox with using a new syntax like {{inputbox}}?
Can you give any advice to illuminate me?
I'd try looking at how the Razor view engine works. Or any ASP.net view engine.
I use some replace operations in HTML.
<html>
<head></head>
<body>
<li>
{{inputbox}}
</li>
</body>
</html>
I found {{ ...... }} and replace and dynamically create what I want.
You can try to do it using JQuery on client side by replacing {{inputbox}} with Text box.
var htmlStr = $(this).html();
htmlStr.replace('{{inputbox}}', '<asp:TextBox ID="DynamicName" runat="server"></asp:TextBox>');
$(this).html(htmlStr);

Getting value from string using specific conditions

I have an html data in my string in which i need to get only paragraph values.Below is a sample html.
<html>
<head>
<title>
<script>
<div>
Some contents
</div>
<div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
<div>
Other html elements
</div>
So how to get the data from the paragraphs using string manipulation.
Desired Output
<Div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
Give the div an ID, e.g.
<div id="test">
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
then use //div[#id='test']/p.
The solution broken down:
//div - All div elements
[#id='test'] - With an ID attribute whose value is test
/p
I have used Html agility Pack for something like this. Then you can use LINQ to get what you want.
Xpath is the obvious answer (if the HTML is decent, has a root etc), failing that some third party widget like chilkat
If you use Html Agility Pack as mentioned in the other posts, you can get all paragraph elements in the html by using:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html string");
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p")
Since you are using .net Framework 2.0, you would want an older version of Agility Pack, which can be found here: HTML Agility Pack
If you want just the text inside the paragraph, you can use
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p/text()")

Categories

Resources