Convert HTML text to Plain text - c#

I have a text area.
I allow entering html markups in that any html code can be entered.
now i want to convert that html code to plain text without using third party tool...how can it be done
currently i am doing it like below:-
var desc = Convert.ToString(Html.Raw(Convert.ToString(drJob["Description"])));
drJob["Description"] is datarow from where I fetch description and I want to convert description to plain text.

There is no direct way coming from .NET to do this. You either need to resort to a third party tool like HtmlAgilePack- or do this in javascript.
document.getElementById('myTextContainer').innerText = document.getElementById('myMarkupContainer').innerText;
For your safety, dont use a regex. ( http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html )

You can replace html tags with empty string using System.Text.RegularExpressions.Regex
String desc = Regex.Replace(drJob["Description"].ToString(), #"<[^>]*>", String.Empty);

You can simply use a replace method using regex "<[^>]+>"

using System.Text.RegularExpressions;
private void button1_Click(object sender, EventArgs e)
{
string sauce = htm.Text; // htm = your html box
Regex myRegex = new Regex(#"(?<=^|>)[^><]+?(?=<|$)", RegexOptions.Compiled);
foreach (Match iMatch in myRegex.Matches(sauce))
{
txt.AppendText(Environment.NewLine + iMatch.Value); //txt = your destination box
}
}
Let me know if you need more clarification.
[EDIT:] Be aware that this is not a clean function, so add a line to clean up empty spaces or line breaks. But the actual getting of text from in-between tags should work fine. If you want to save space - use regex and see if this works for you. Although the person who posted about regex not being clean is right, there might be other ways; Regex is usually better when separating a single type of tag from html. (I use it for rainmeter to parse stuff and never had any issues)

Related

RegEx to pull out specific URL format from HTML source

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA
People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO
I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).
You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}

How to Remove all the HTML tags and display a plain text using C#

I want to remove all html tags from a string.i can achieve this using REGX.
but inside the string if it contains number inside the angular braces <100> it should not remove it .
var withHtml = "<p>hello <b>there<1234></b></p>";
var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty);
Result: hello there
but needed output :
hello there 1234
Your example of HTML isn't valid HTML since it contains a non-HTML tag. I figure you intended for the angle-brackets to be encoded.
I don't think regular expressions are suitable for HTML parsing. I recommend using an HTML parser such as HTML Agility Pack to do this.
Here's an example:
var withHtml = "<p>hello <b>there<1234></b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);
var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work.
Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. A simple improvement that gets you almost there is:
Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);
Gives "hello there<1234>" You then just need to replace all angled brackets.

Regex for Removing Comma between <a> tag text C#

I have the following string , i tried many many regex to remove comma between a tag text, but not found any regex for removing comma between a tag text. I want that , whenever text inside a tag has comma ,then will be replace by empty string.
Getty Center, Restaurant at the
i have tried this regex but it is not working, here input is string that contains html.
input = Regex.Replace(input, #"<a(\s+[^>]*)?>[^\w\s]</a(\s+[^>]*)?>", "");
Please help me out. Thank You
You can use the Regex to find and modify the content of the tag like so.
var input = "Getty Center, Restaurant at the";
var regex = new Regex(#"<a[^>]*>(?<content>.*?)</a[^>]*>",
RegexOptions.Singleline);
var match = regex.Match(input);
while (match.Success) {
var group = match.Groups["content"];
input = input.Substring(0, group.Index)
+ group.Value.Replace(",", "")
+ input.Substring(group.Index + group.Length);
match = regex.Match(input, group.Index);
};
The loop is in place to catch multiple tags in the same string. The Regex however is fairly naive. It will mess with tags nested inside the A tag, and will parse incorrectly if a > is in any of the attributes. (Though that would probably be bad HTML anyway.) A proper HTML parser is recommended for this reason.
I would suggest to use a HTML parser. There are plenty available which are open source and are free. One of the best I found is HTMLAgilityPack at HTMLAgilityPack
Some examples at Some Examples
In nutshell, the following code snippet will give you all tag
HtmlDocument myDoc = new HtmlDocument();
myDoc.Load(path);
HtmlNodeCollection imgs = new HtmlNodeCollection(myDoc.DocumentNode.ParentNode);
imgs = myDoc.DocumentNode.SelectNodes("//img");
Hope that helps
If you want to directly use the replace, you will have to match only the comma and not the text before or after the comma. You'd have to use look ahead and look behind to check if the comma is in the tag. Although this is doable, it is not advised to do this.
An alternative is to use matching groups to match the whole text in the tag and group the comma if it exists and replace the match.
<a[^>]+>[\w\s]*(,?)[\w\s]*<\/a>
The first capture group captures comma if present. You can test it here. [http://rubular.com/r/K2jjIaObty][1]
The best option would be to use a html parser to capture contents of the a tag, search for comma and replace.

How Can I strip HTML from Text in .NET?

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
EDIT
See my answer.
EDIT 2
alt text http://tinyurl.com/sillychimp
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, #"\s+", " ").Trim();
}
Take a look at this Strip HTML tags from a string using regular expressions
Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method
TextReader tr = new StreamReader(#"Filepath");
string str = tr.ReadToEnd();
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);
but you need to have a namespace referenced i.e:
system.text.RegularExpressions
only take this logic for your website
If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:
public static string StripTags(string value)
{
if (value == null)
return string.Empty;
string pattern = #"&.{1,8};";
value = Regex.Replace(value, pattern, " ");
pattern = #"<(.|\n)*?>";
return Regex.Replace(value, pattern, string.Empty);
}
It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...
You could:
Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
Use TinyMCE's built-in configuration options for stripping unwanted HTML.
Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.
As you may have malformed HTML in the system: BeautifulSoup or similar could be used.
It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?
You can use HTQL COM, and query the source with a query:
<body> &tx;
You can use something like this
string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!
Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian
?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)
I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Categories

Resources