c# parse html using XPathDocument

c# parse html using XPathDocument - c#

i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml...
is there a way to do this or not?

Should use HtmlAgilityPack. Still the best!

Use something like Html Agility Pack which can load your html into a DOM object which can be traversed with for example xpath queries.
Unless your html is in fact xhtml, it is usually not a valid xml structure with correct opening and ending node tags.

Related

Html.Raw in controller

I get content from editor so content include html tags like this "dddd"
I must remove html tags from content because I write this content to PDF(generate pdf in c#-controller action) using itextsharp.DLL but itextsharp content with html tags,it does not render html tags as you can see below screen
There is no Html.Raw function or HtmlHelper.Raw function in c#(action -controller)
What should I do?I try to remove html tags with regex but content is very complex and it is dynamic so there is many many html tags

One approach would be to use an HTML parser like the HTML Agility Toolpack. I've used this successfully for problems as you describe (but am otherwise unaffiliated with its development). From the site:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You'll find lots of examples online to tailor to your needs.

You can use Html.Raw and Html.Json in controller like this
Example
If I use this in View
var attrilist = #Html.Raw(Json.Encode(attriFeildlist));
Then I can use this as alternate of this code in Controller like
var jsonencode = System.Web.Helpers.Json.Encode(attriFeildlist);
var htmlencode= WebUtility.HtmlEncode(jsonencode);

Get the value of an HTML element

I have the HTML code of a webpage in a text file. I'd like my program to return the value that is in a tag. E.g. I want to get "Julius" out of
<span class="hidden first">Julius</span>
Do I need regular expression for this? Otherwise what is a string function that can do it?

You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//span[#class='hidden first']")//this xpath selects all span tag having its class as hidden first
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the span tags content having its class as hidden first

I would use the Html Agility Pack to parse the HTML in C#.

I'd strongly recommend you look into something like the HTML Agility Pack

i've asked the same question few days ago and ened up using HTML Agility Pack, but here is the regular expressions that you want
this one will ignore the attributes
<span[^>]*>(.*?)</span>
this one will consider the attributes
<span class="hidden first"[^>]*>(.*?)</span>

HTML Agility to extract PHP tags

What syntax should be used with HTML Agility Pack to extract all
Tags from a Php file..?
HtmlNodeCollection tags = htmlDoc.DocumentNode.SelectNodes("//??php");
Throws an exception (invalid token).
Tried escaping ? with ?? and \?
Thanks

HTML Agility Pack does choke on nodes with ? in the name. The simplest option is probably to go through the HTML string before you load it into a document object and replace instances of <? with <php and so-on. That doesn't handle any nasty cases like having a string literal on the page with "&lt?" but really, how often does that happen?

How to extract XML from the WebBrowser control?

I want the same as WebBrowser.Document.Body.InnerHtml, but as an XML representation.

Are you using WebBrowser to browse an XML document and want to get to that XML in code, or are you trying to browse to an HTML page and represent HTML as XML?
If the former you can likely just get the raw text from the WebBrowser (maybe InnerText instead of InnerHTML) and parse it as XML.
If the latter, the problem is, HTML isn't XML (unless it's XHTML).
You can convert it to XML with 'tidy' tools but the representation accuracy depends on how well formed the orginal HTML is.

TidyCOM will clean up HTML to XHTML.
Here's how to use it from C#.

IE's document has an expando property named "XMLDocument". You can access it via its IDispatchEx interface.
You can get the document's COM interface via Document.DomDocument.

XPath + Firebug +XML/HTML +HTML AgilityPack C#

When using Firebug or some of the bookmarklets:
javascript:(function(){var a=document.createElement("script");a.setAttribute("src","http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.js");if(typeof jQuery=="undefined"){document.getElementsByTagName("head")[0].appendChild(a)}(function(){if(typeof jQuery=="undefined"){setTimeout(arguments.callee,100)}else{jQuery("*").one("click",function(d){jQuery(this)[0].scrollIntoView();for(var e="",c=jQuery(this)[0];c&&c.nodeType==1;c=c.parentNode){var b=jQuery(c.parentNode).children(c.tagName).index(c)+1;b>1?(b="["+b+"]"):(b="");e="/"+c.tagName.toLowerCase()+b+e}window.location.hash="#xpath:"+e;prompt('Twoje wyrazenie:',e);d.preventDefault();d.stopPropagation();jQuery("*").unbind("click",arguments.callee)})}})()})();
I receive a HTML's XPath. In order to parse HTML via HTML Agility Pack or Sgml, i need to convert it to XHTML (XML).
But the problem is (i think) that XHTML's XPath is different from HTML's XPath.
That's why Firebug's "XPath Copy" feature doesn't work when using it with
HtmlNode valueNode = doc.DocumentNode.SelectSingleNode(Firebugs_XPath);
For example, firebug/bookmarklet gives (if I remove tbody it won't help):
/html/body/div[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]/form/table/tbody/tr[2]/td/div/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]/u
and proper code is (give or take):
/html/body/div/table/tr[1]/td[2]/table/tr[1]//td[2]/table[2]/tr[1]//td[2]/table/tr/tr/td[2]/u
My question is - how to fix that behavior, in order to make firebugXpath->HtmlAgilityPack work.
And - is this possible, to use bookmarklet with built in C# WebBrowser component.
I will really appreciate your help.

Firebug's representation of your markup might be different from the actual XHTML because it tries to normalise the markup, and that's what the XPath queries are generated against rather than the actual underlying XHTML. I'm not sure it's possible to change this behaviour, you might just need to tweak the XPaths by hand.

I was having the same issue with trying to get the correct xpath using firebug and chrome and ie's dev tools so I wrote an application using HTML agility pack to find the xpath.
http://letschat.info/?p=23

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

c# parse html using XPathDocument - c#

i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml... is there a way to do this or not?

Should use HtmlAgilityPack. Still the best!

Use something like Html Agility Pack which can load your html into a DOM object which can be traversed with for example xpath queries. Unless your html is in fact xhtml, it is usually not a valid xml structure with correct opening and ending node tags.

Related

Html.Raw in controller

Get the value of an HTML element

HTML Agility to extract PHP tags

How to extract XML from the WebBrowser control?

XPath + Firebug +XML/HTML +HTML AgilityPack C#

Categories

Resources