I get content from editor so content include html tags like this "dddd"
I must remove html tags from content because I write this content to PDF(generate pdf in c#-controller action) using itextsharp.DLL but itextsharp content with html tags,it does not render html tags as you can see below screen
There is no Html.Raw function or HtmlHelper.Raw function in c#(action -controller)
What should I do?I try to remove html tags with regex but content is very complex and it is dynamic so there is many many html tags
One approach would be to use an HTML parser like the HTML Agility Toolpack. I've used this successfully for problems as you describe (but am otherwise unaffiliated with its development). From the site:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You'll find lots of examples online to tailor to your needs.
You can use Html.Raw and Html.Json in controller like this
Example
If I use this in View
var attrilist = #Html.Raw(Json.Encode(attriFeildlist));
Then I can use this as alternate of this code in Controller like
var jsonencode = System.Web.Helpers.Json.Encode(attriFeildlist);
var htmlencode= WebUtility.HtmlEncode(jsonencode);
Related
I have an html code.
I parse it with such regex
MatchCollection matches = Regex.Matches(go, #"photoWrapper""><div><a href=""(?<id>[^""]+?)\?");
I receive:
matches[0].Groups["id"].Value = "/group/47502002094086";
matches[1].Groups["id"].Value = "/dk";
matches[2].Groups["id"].Value = "/prostooglavnom";
How to edit my regexp or add smth, to receive in matches only
matches[0].Groups["id"].Value = "47502002094086";
matches[1].Groups["id"].Value = "prostooglavnom";
Any help?=\
Full html code : http://pastebin.com/xEJNiD4G
You have just discovered for yourself why Regex is a poor choice for parsing HTML.
I suggest you use the HTML Agility Pack to parse and query your HTML.
The source download comes with many example projects.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
I have the HTML code of a webpage in a text file. I'd like my program to return the value that is in a tag. E.g. I want to get "Julius" out of
<span class="hidden first">Julius</span>
Do I need regular expression for this? Otherwise what is a string function that can do it?
You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//span[#class='hidden first']")//this xpath selects all span tag having its class as hidden first
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the span tags content having its class as hidden first
I would use the Html Agility Pack to parse the HTML in C#.
I'd strongly recommend you look into something like the HTML Agility Pack
i've asked the same question few days ago and ened up using HTML Agility Pack, but here is the regular expressions that you want
this one will ignore the attributes
<span[^>]*>(.*?)</span>
this one will consider the attributes
<span class="hidden first"[^>]*>(.*?)</span>
I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)
Absolutely, that is what it excels at.
In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.
The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.
An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).
See creating a new DOM for details.
Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.
Using CsQuery is very straightforward, e.g.
CQ docFromString = CQ.Create(htmlString);
CQ docFromWeb = CQ.CreateFromUrl(someUrl);
// there are other methods for asynchronous web gets, creating from files, streams, etc.
// css selector: the indexer [] is like jQuery $(..)
CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];
// Text() is a jQuery method returning text contents of selection
string textOfCell = lastCellInFirstRow.Text();
Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.
I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.
Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}
It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)
You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.
i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml...
is there a way to do this or not?
Should use HtmlAgilityPack. Still the best!
Use something like Html Agility Pack which can load your html into a DOM object which can be traversed with for example xpath queries.
Unless your html is in fact xhtml, it is usually not a valid xml structure with correct opening and ending node tags.