Strip HTML tags? - c#

How to strip this text
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<test#test.com>
</body>
</html>
to look like
My First Heading
My first paragraph.
<test#test.com>
Using the function
public static string StripHTML(this string htmlText)
{
var reg = new Regex("<(.|\n)*?>", RegexOptions.IgnoreCase);
return reg.Replace(htmlText, "");
}
I get
My First Heading
My first paragraph.

Use Html Agility Pack for these kinds of operations. It is faster than any regex and supports LINQ.

static void Main(string[] args)
{
string modified_html = emas(input);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(modified_html);
string test1 = doc.DocumentNode.InnerText;
Console.WriteLine();
var reg = new Regex("<(.|\n)*?>", RegexOptions.IgnoreCase);
Console.WriteLine(reg.Replace(modified_html , ""));
Console.Read();
}
public static string emas(string text)
{
string stripped = text;
const string MatchEmailPattern =
#"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))#"
+ #"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
+ #"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
+ #"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
int noOfMatches = matches.Count;
// Report on each match.
foreach (Match match in matches)
{
stripped = stripped.Replace("<"+ match.Value + ">" , match.Value);
}
return stripped;
}
static string input = " Your html goes here ";

Related

Replace part of text with html text not working

I want to replace matching keyword in text with same keyword but wrapped with <span></span>
example : This is the sample text to be searched
replaced text should be line
This is the <span class="match">sample</span>text to be searched
I am using following code but its not working
protected String getTitle(object title)
{
string sTitle = title.ToString();
Regex regex = null;
string pattern = #"(\b(?:" + _Keyword.ToString().Trim() + #")\b)(?![^<]*?>)";
regex = new Regex(pattern);
sTitle = regex.Replace(sTitle, "<span class='keyword-highlight'>" + _Keyword + "</span>");
return sTitle;
}
above code replace whole text with keyword not just the matching part
One reason could be, you are getting some wrong value in 'sTitle'. Example: sTitle = "sample"
If the problem is the case of sample, you can use:
regex = new Regex(pattern, RegexOptions.IgnoreCase);
Replace this:
regex = new Regex(pattern);
With this:
regex = new Regex(pattern,RegexOptions.IgnoreCase);
I run this code:
string _Keyword = "Sample";
string sTitle = " This is the sample text to be searched";
Regex regex = null;
string pattern = #"(\b(?:" + _Keyword.ToString().Trim() + #")\b)(?![^<]*?>)";
regex = new Regex(pattern,RegexOptions.IgnoreCase);
sTitle = regex.Replace(sTitle, "<span class='keyword-highlight'>" + _Keyword + "</span>");
and result is:
This is the <span class='keyword-highlight'>sample</span> text to be searched

How can i remove HTML Tags from String by REGEX?

I am fetching data from Mysql but the issue is "HTML tags i.e.
<p>LARGE</p><p>Lamb;<br>;li;ul;
also being fetched with my data i just need "LARGE" and "Lamb" from above line. How can I separate/remove HTML tags from String?
I am going to assume that the HTML is intact, perhaps something like the following:
<ul><li><p>LARGE</p><p>Lamb<br></li></ul>
In which case, I would use HtmlAgilityPack to get the content without having to resort to regex.
var html = "<ul><li><p>LARGE</p><p>Lamb</p><br></li></ul> ";
var hap = new HtmlDocument();
hap.LoadHtml(html);
string text = HtmlEntity.DeEntitize(hap.DocumentNode.InnerText);
// text is now "LARGELamb "
string[] lines = hap.DocumentNode.SelectNodes("//text()")
.Select(h => HtmlEntity.DeEntitize(h.InnerText)).ToArray();
// lines is { "LARGE", "Lamb", " " }
If we assume that you are going to fix your html elements.
static void Main(string[] args)
{
string html = WebUtility.HtmlDecode("<p>LARGE</p><p>Lamb</p>");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
List<HtmlNode> spanNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "p").ToList();
foreach (HtmlNode node in spanNodes)
{
Console.WriteLine(node.InnerHtml);
}
}
You need to use HTML Agility Pack.You can add reference like this.:
Install-Package HtmlAgilityPack
try this
// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
return StripHTMLExpression.Replace(target, string.Empty);
}
call
string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);
Assuming that:
the original string is always going to be in that specific format,
and that
you cannot add the HTMLAgilityPack,
here is a quick and dirty way of getting what you want:
static void Main(string[] args)
{
// Split original string on the 'separator' string.
string originalString = "<p>LARGE</p><p>Lamb;<br>;li;ul; ";
string[] sSeparator = new string[] { "</p><p>" };
string[] splitString = originalString.Split(sSeparator, StringSplitOptions.None);
// Prepare to filter the 'prefix' and 'postscript' strings
string prefix = "<p>";
string postfix = ";<br>;li;ul; ";
int prefixLength = prefix.Length;
int postfixLength = postfix.Length;
// Iterate over the split string and clean up
string s = string.Empty;
for (int i = 0; i < splitString.Length; i++)
{
s = splitString[i];
if (s.Contains(prefix))
{
s = s.Remove(s.IndexOf(prefix), prefixLength);
}
if (s.Contains(postfix))
{
s = s.Remove(s.IndexOf(postfix), postfixLength);
}
splitString[i] = s;
Console.WriteLine(splitString[i]);
}
Console.ReadLine();
}
// Convert < > etc. to HTML
String sResult = HttpUtility.HtmlDecode(sData);
// Remove HTML tags delimited by <>
String result = Regex.Replace(sResult, #"enter code here<[^>]*>", String.Empty);

Remove HTML tags from string including &nbsp in C#

How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.
string noHTML = Regex.Replace(inputHTML, #"<[^>]+>| ", "").Trim();
You should ideally make another pass through a regex filter that takes care of multiple spaces as
string noHTMLNormalised = Regex.Replace(noHTML, #"\s{2,}", " ");
I took #Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.
public static string ScrubHtml(string value) {
var step1 = Regex.Replace(value, #"<[^>]+>| ", "").Trim();
var step2 = Regex.Replace(step1, #"\s{2,}", " ");
return step2;
}
I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.
private static readonly Regex _tags_ = new Regex(#"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);
//add characters that are should not be removed to this regex
private static readonly Regex _notOkCharacter_ = new Regex(#"[^\w;&##.:/\\?=|%!() -]", RegexOptions.Compiled);
public static String UnHtml(String html)
{
html = HttpUtility.UrlDecode(html);
html = HttpUtility.HtmlDecode(html);
html = RemoveTag(html, "<!--", "-->");
html = RemoveTag(html, "<script", "</script>");
html = RemoveTag(html, "<style", "</style>");
//replace matches of these regexes with space
html = _tags_.Replace(html, " ");
html = _notOkCharacter_.Replace(html, " ");
html = SingleSpacedTrim(html);
return html;
}
private static String RemoveTag(String html, String startTag, String endTag)
{
Boolean bAgain;
do
{
bAgain = false;
Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
if (startTagPos < 0)
continue;
Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
if (endTagPos <= startTagPos)
continue;
html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
bAgain = true;
} while (bAgain);
return html;
}
private static String SingleSpacedTrim(String inString)
{
StringBuilder sb = new StringBuilder();
Boolean inBlanks = false;
foreach (Char c in inString)
{
switch (c)
{
case '\r':
case '\n':
case '\t':
case ' ':
if (!inBlanks)
{
inBlanks = true;
sb.Append(' ');
}
continue;
default:
inBlanks = false;
sb.Append(c);
break;
}
}
return sb.ToString().Trim();
}
var noHtml = Regex.Replace(inputHTML, #"<[^>]*(>|$)| |‌|»|«", string.Empty).Trim();
I have used the #RaviThapliyal & #Don Rolling's code but made a little modification. Since we are replacing the &nbsp with empty string but instead &nbsp should be replaced with space, so added an additional step. It worked for me like a charm.
public static string FormatString(string value) {
var step1 = Regex.Replace(value, #"<[^>]+>", "").Trim();
var step2 = Regex.Replace(step1, #" ", " ");
var step3 = Regex.Replace(step2, #"\s{2,}", " ");
return step3;
}
Used &nbps without semicolon because it was getting formatted by the Stack Overflow.
this:
(<.+?> | )
will match any tag or
string regex = #"(<.+?>| )";
var x = Regex.Replace(originalString, regex, "").Trim();
then x = hello
Sanitizing an Html document involves a lot of tricky things. This package maybe of help:
https://github.com/mganss/HtmlSanitizer
HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like < all in one go.
i'm using this syntax for remove html tags with
SessionTitle:result[i].sessionTitle.replace(/<[^>]+>|&**nbsp**;/g, '')
--Remove(*) **nbsp**
(<([^>]+)>| )
You can test it here:
https://regex101.com/r/kB0rQ4/1

Regex Replace on a JSON structure

I am currently trying to do a Regex Replace on a JSON string that looks like:
String input = "{\"`####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\",\"Answer_Options2\": \"not a monkey\"}";
a
The goal is to find and replace all the value fields who's key field starts with `####`
I currently have this:
static Regex _FieldRegex = new Regex(#"`####`\w+" + ".:.\"(.*)\",");
static public string MatchKey(string input)
{
MatchCollection match = _encryptedFieldRegex.Matches(input.ToLower());
string match2 = "";
foreach (Match k in match )
{
foreach (Capture cap in k.Captures)
{
Console.WriteLine("" + cap.Value);
match2 = Regex.Replace(input.ToLower(), cap.Value.ToString(), #"CAKE");
}
}
return match2.ToString();
}
Now this isn't working. Naturally I guess since it picks up the entire `####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\", as a match and replaces it. I want to just replace the match.Group[1] like you would for a single match on the string.
At the end of the day the JSON string needs to look something like this:
String input = "{\"`####`Answer_Options11\": \"CATS AND CAKE\",\"`####`Answer_Options\": \"CAKE WAS A LIE\",\"Answer_Options2\": \"not a monkey\"}";
Any idea how to do this?
you want a positive lookahead and a positive lookbehind :
(?<=####.+?:).*?(?=,)
the lookaheads and lookbehinds will verify that it matches those patterns, but not include them in the match. This site explains the concept pretty well.
Generated code from RegexHero.com :
string strRegex = #"(?<=####.+?:).*?(?=,)";
Regex myRegex = new Regex(strRegex);
string strTargetString = #" ""{\""`####`Answer_Options11\"": \""monkey22\"",\""`####`Answer_Options\"": \""monkey\"",\""Answer_Options2\"": \""not a monkey\""}""";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}
this will match "monkey22" and "monkey" but not "not a monkey"
Working from #Jonesy's answer I got to this which works for what I wanted. It includes the .Replace on the groups that I required. The negative look ahead and behinds were very interesting but I needed to replace some of those values hence groups.
static public string MatchKey(string input)
{
string strRegex = #"(__u__)(.+?:\s*)""(.*)""(,|})*";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
IQS_Encryption.Encryption enc = new Encryption();
int count = 1;
string addedJson = "";
int matchCount = 0;
foreach (Match myMatch in myRegex.Matches(input))
{
if (myMatch.Success)
{
//Console.WriteLine("REGEX MYMATCH: " + myMatch.Value);
input = input.Replace(myMatch.Value, "__e__" + myMatch.Groups[2].Value + "\"c" + count + "\"" + myMatch.Groups[4].Value);
addedJson += "c"+count + "{" +enc.EncryptString(myMatch.Groups[3].Value, Encoding.UTF8.GetBytes("12345678912365478912365478965412"))+"},";
}
count++;
matchCount++;
}
Console.WriteLine("MAC" + matchCount);
return input + addedJson;
}`
Thanks again to #Jonesy for the huge help.

how can I get a simpler data

I have matchCollection.
And I need group index 1.
Now I take out the data from a large number of casts, I would like to avoid it.
example: startTag = <a>, endTag = </a>
Html = <a>texttexttext</a>.
I need get "texttexttext" with out <a> and </a>
var regex = new Regex(startTag + "(.*?)" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (var item in matchCollection)
{
string temp = ((Match)(((Group)(item)).Captures.SyncRoot)).Groups[1].Value;
}
I would recommend you using Html Agility Pack to parse HTML instead of regex for various reasons.
So to apply it to your example with finding all anchor text inside an HTML document:
using System;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
string html = "";
using (var client = new WebClient())
{
html = client.DownloadString("http://stackoverflow.com");
}
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a"))
{
// Will print all text contained inside all anchors
// on http://stackoverflow.com
Console.WriteLine(link.InnerText);
}
}
}
You could use a capture group. You might also want to use a named group. Notice the parentheses I added to regex.
var html = "<a>xx yyy</a> <a>bbb cccc</a>";
var startTag = "<a>";
var endTag = "</a>";
var regex = new Regex(startTag + "((.*?))" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (Match item in matchCollection)
{
var data = item.Groups[1];
Console.WriteLine(data);
}
This is even a little nicer, because a named group is a little easier to grab.
var html = "<a>xx yyy</a> <a>bbb cccc</a>";
var startTag = "<a>";
var endTag = "</a>";
var regex = new Regex(startTag + "(?<txt>(.*?))" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (Match item in matchCollection)
{
var data = item.Groups["txt"];
Console.WriteLine(data);
}

Categories

Resources