Can I use Html Agility Pack To Parse HTML Fragment? - c#

Can Html Agility Pack be used to parse an html string fragment?
Such As:
var fragment = "<b>Some code </b>";
Then extract all <b> tags? All the examples I seen so far have been loading like html documents.

If it's html then yes.
string str = "<b>Some code</b>";
// not sure if needed
string html = string.Format("<html><head></head><body>{0}</body></html>", str);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// look xpath tutorials for how to select elements
// select 1st <b> element
HtmlNode bNode = doc.DocumentNode.SelectSingleNode("b[1]");
string boldText = bNode.InnerText;

I dont think this is really the best use of HtmlAgilityPack.
Normally I see people trying to parse large amounts of html using regular expressions and I point them towards HtmlAgilityPack but in this case I think it would be better to use a regex.
Roy Osherove has a blog post describing how you can strip out all the html from a snippet:
http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
Even if you did get the correct xpath with Mika Kolari's sample this would only work for a snippet with a <b> tag in it and would break if the code changed.

This answer came up when I searched for the same thing. I don't know if the features have changed since it was answered but this below should be better.
$string = '<b>Some code </b>'
[HtmlAgilityPack.HtmlNode]::CreateNode($string)

Related

Better alternative to getting the 'inner html' of a div?

I have a string:
<div class="className1234"><p>Some html</p></div>
From this string, I would like to get <p>Some html</p> i.e. I would like to remove the surrounding div tags based on the fact that it's class contains 'className'.
What I've Tried
What I've tried works, but it's cludgey - and I know there'll be a better alternative like regex or something. What I currently do is chain a series of substring(), indexof() and replace() calls to strip out the divs.
EDIT: I've used the phrase 'innerhtml' because I'd like to think there's a library out there somewhere that would allow me to manipulate a string with regard to the tags within it.
PLEASE NOTE: There's no JQuery involved in this. It's all server-side C#.
(See tags)
I would suggest Html Agility Pack, it's designed to allow operations on html documents, kind of like the builtin support for XML in the framework.
It might be overkill, but it will get the work done, easily, and you won't have to care about bad html
How about:
XmlDocument doc = new XmlDocument();
doc.LoadXml(divStr);
// classAtr will be null if the root is not a div with a class with the value className1234
XmlNode classAtr = doc.SelectSingleNode("/div/#class[contains(., 'className1234')]");
string result = classAtr != null ? doc.DocumentElement.InnerXml : divStr;
Whenever you need to manipulate HTML, you should use a dedicated HTML parser/DOM library. One library I've found recommended here on StackOverflow for .Net is HTMLAgilityPack.
As others said HtmlAgilityPack is the best for html parsing, also be sure to download HAP Explorer from HtmlAgilityPack site, use it to test your selects, anyway this SelectNode command will get :
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
var myNodes = doc.DocumentNode.SelectNodes("/div/#class[. = 'className1234']");
foreach (HtmlNode node in myNodes)
{
// you code
}

Get the value of an HTML element

I have the HTML code of a webpage in a text file. I'd like my program to return the value that is in a tag. E.g. I want to get "Julius" out of
<span class="hidden first">Julius</span>
Do I need regular expression for this? Otherwise what is a string function that can do it?
You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//span[#class='hidden first']")//this xpath selects all span tag having its class as hidden first
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the span tags content having its class as hidden first
I would use the Html Agility Pack to parse the HTML in C#.
I'd strongly recommend you look into something like the HTML Agility Pack
i've asked the same question few days ago and ened up using HTML Agility Pack, but here is the regular expressions that you want
this one will ignore the attributes
<span[^>]*>(.*?)</span>
this one will consider the attributes
<span class="hidden first"[^>]*>(.*?)</span>

Getting HTML SELECT length property in C#

I'm trying to figure out how to grab DOM elements from a webpage. Here is the function I'm using:
private void processHTML(String htmlContent)
{
IHTMLDocument2 htmlDocument = (IHTMLDocument2)new mshtml.HTMLDocument();
htmlDocument.write(htmlContent);
IHTMLElementCollection allElements = htmlDocument.all;
webBrowser1.DocumentText = allElements.item("storytext").innerHTML;
textBox2.Text = allElements.item("chap_select").length.ToString();
}
If I set a breakpoint at either of the last two lines and then check the allElements collection, I'm able to find the SELECT element. It correctly shows the ID as being chap_select and the length property shows 13 for the particular document that is being passed. For some reason the length that is being put into the textBox2 field is 2, however.
Any suggestions on what I'm doing wrong here? I've spent several hours trying to figure this out, but have not been able to find any code samples of somebody trying to grab this property of a SELECT.
Instead of using IHTMLDocument2 and mshtml.HTMLDocument I suggest using the much easier to work with HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Something like (untested):
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
textBox2.Text = doc.DocumentNode
.SelectNodes("//select[#id='chap_select']/option").Count().ToString();

Does .NET framework offer methods to parse an HTML string?

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:
find specific controls in the hierarchy by id or by tag
modify (and ideally create) attributes of those found elements
Are there methods available in .net to do so?
HtmlDocument
GetElementById
HtmlElement
You can create a dummy html document.
WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();
Output:
2
file:///c:
about:myUrl
Editing elements:
HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
"src=\"c:\"",
"src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Output:
file:///d:
Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:
Most obviously - use regex. (System.Text.RegularExpressions)
Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
Linq?
One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here
Also, here is a post about Linq vs Regex.
You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

Inbuilt Regex class or Parser.How to extract text between the tags from html file?

I have html file in which there is table content and other information in my c#.net application.
I want to parse the table contents for only some columns.Then should I use parser of html or Replace method of Regex in .net ?
And if I use the parser then how to use parser? Will parser extract the inforamation which is between the tags? If yes then how to use ? If possible show the example because I am new to parser.
If I use Replace method of Regex class then in that method how to pass the file name for which I want to extract the information ?
Edit : I want to extract information from the table in html file. For that how can I use html agility parser ? What type of code I should write to use that parser ?
You just asked an almost identical question and deleted it. Here was the answer I gave before:
Try the HTML Agility Pack.
Here's an example:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
Regarding your extra question regarding regex: do not use Regex to parse HTML. It is not a robust solution. The above library can do a much better job.
HtmlAgilityPack....
Next time - search for an answer before. This is duplicate for sure.
Little tutorial.

Categories

Resources