I have this from html page source
<h5 class="icn-venue">Tavernita</h5>
There are say 10 values like this between these tags on the page source.
I want to extract value between "h5" tags. Class="icn-venue" remains same for all values.
I tried splitting the tag and then storing but the code doesnt seem to work.
You can do it like this using htmlAgilityPack:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> lst=doc.DocumentNode.SelectNodes("//h5[class='icn-venue']")
.Select(x=>x.InnerHtml)
.ToList();
HTML Agility Pack is a great tool for manipulating and working with HTML: http://htmlagilitypack.codeplex.com/
It could at least make grabbing the values you need and doing the replaces a little easier.
Contains links to using the HTML Agility Pack: How to use HTML Agility pack
Related
I have the HTML code of a webpage in a text file. I'd like my program to return the value that is in a tag. E.g. I want to get "Julius" out of
<span class="hidden first">Julius</span>
Do I need regular expression for this? Otherwise what is a string function that can do it?
You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//span[#class='hidden first']")//this xpath selects all span tag having its class as hidden first
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the span tags content having its class as hidden first
I would use the Html Agility Pack to parse the HTML in C#.
I'd strongly recommend you look into something like the HTML Agility Pack
i've asked the same question few days ago and ened up using HTML Agility Pack, but here is the regular expressions that you want
this one will ignore the attributes
<span[^>]*>(.*?)</span>
this one will consider the attributes
<span class="hidden first"[^>]*>(.*?)</span>
I'm trying to figure out how to grab DOM elements from a webpage. Here is the function I'm using:
private void processHTML(String htmlContent)
{
IHTMLDocument2 htmlDocument = (IHTMLDocument2)new mshtml.HTMLDocument();
htmlDocument.write(htmlContent);
IHTMLElementCollection allElements = htmlDocument.all;
webBrowser1.DocumentText = allElements.item("storytext").innerHTML;
textBox2.Text = allElements.item("chap_select").length.ToString();
}
If I set a breakpoint at either of the last two lines and then check the allElements collection, I'm able to find the SELECT element. It correctly shows the ID as being chap_select and the length property shows 13 for the particular document that is being passed. For some reason the length that is being put into the textBox2 field is 2, however.
Any suggestions on what I'm doing wrong here? I've spent several hours trying to figure this out, but have not been able to find any code samples of somebody trying to grab this property of a SELECT.
Instead of using IHTMLDocument2 and mshtml.HTMLDocument I suggest using the much easier to work with HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Something like (untested):
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
textBox2.Text = doc.DocumentNode
.SelectNodes("//select[#id='chap_select']/option").Count().ToString();
I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.
Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}
It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)
You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.
What syntax should be used with HTML Agility Pack to extract all
Tags from a Php file..?
HtmlNodeCollection tags = htmlDoc.DocumentNode.SelectNodes("//??php");
Throws an exception (invalid token).
Tried escaping ? with ?? and \?
Thanks
HTML Agility Pack does choke on nodes with ? in the name. The simplest option is probably to go through the HTML string before you load it into a document object and replace instances of <? with <php and so-on. That doesn't handle any nasty cases like having a string literal on the page with "<?" but really, how often does that happen?
Can Html Agility Pack be used to parse an html string fragment?
Such As:
var fragment = "<b>Some code </b>";
Then extract all <b> tags? All the examples I seen so far have been loading like html documents.
If it's html then yes.
string str = "<b>Some code</b>";
// not sure if needed
string html = string.Format("<html><head></head><body>{0}</body></html>", str);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// look xpath tutorials for how to select elements
// select 1st <b> element
HtmlNode bNode = doc.DocumentNode.SelectSingleNode("b[1]");
string boldText = bNode.InnerText;
I dont think this is really the best use of HtmlAgilityPack.
Normally I see people trying to parse large amounts of html using regular expressions and I point them towards HtmlAgilityPack but in this case I think it would be better to use a regex.
Roy Osherove has a blog post describing how you can strip out all the html from a snippet:
http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
Even if you did get the correct xpath with Mika Kolari's sample this would only work for a snippet with a <b> tag in it and would break if the code changed.
This answer came up when I searched for the same thing. I don't know if the features have changed since it was answered but this below should be better.
$string = '<b>Some code </b>'
[HtmlAgilityPack.HtmlNode]::CreateNode($string)