Check if HtmlString is whitespace in C# - c#

I've got a wrapper that adds a header to a field whenever it has a value. The field is actually a string which holds HTML from a tinymce textbox.
Requirement: the header should not display when the field is empty or just whitespace.
Issue: whitespace in html is rendered as <p> </p>, so technically it's not an empty or whitespace value
I simply can't !String.IsNullOrWhiteSpace(Model.ContentField.Value) because it does have a value, albeit whitespace html.
I've tried to convert the value onto #Html.Raw(Model.ContentField.Value) but it's of a type HtmlString, so I can't use String.IsNullOrWhiteSpace.
Any ideas? Thanks!

You can use HtmlAgilityPack, something like this:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(Model.ContentField.Value);
string textValue = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
bool isEmpty = String.IsNullOrWhiteSpace(textValue);

What I eventually did (because I didn't want to add a 3rd party library just for this), is to add a function in a helper class that strips HTML tags:
const string HTML_TAG_PATTERN = "<.*?>";
public static string StripHTML(string inputString)
{
return Regex.Replace
(inputString, HTML_TAG_PATTERN, string.Empty);
}
After which, I combined that with HttpUtility.HtmlDecode to get the inner value:
var innerContent = StringHelper.StripHTML(HttpUtility.HtmlDecode(Model.ContentField.Value));
That variable is what I used to compare. Let me know if this is a bad idea.
Thanks!

I had similar question as your title, but in my case the html string was empty. So I ended up doing the following:
HtmlString someString = new HtmlString("");
string.IsNullOrEmpty(someString.ToString());
Might be obvious, but didn't realize it at first.

Related

how to Convert HTML characters like #amp; to their Proper Form in C#

How to convert these characters to plain text?
â„¢,  ®, â„¢, ® and —
this problem occurs when I get a text from the website during scraping and store it into the database.
But it adds special characters and & like character.
I want to remove these all.
you can use this:
Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(myvalue));
try this:
public static string RemoveUTFCharactes(this string input)
{
string output = string.Empty;
if (!string.IsNullOrEmpty(input))
{
byte[] data = System.Text.Encoding.Default.GetBytes(input);
output = System.Text.Encoding.UTF8.GetString(data);
}
return output;
}
The short solution of your question is below
if you have limited symbols you can use the Replace method in C# language, like this
string symbol="this is the book #amp; laptop";
string formattedterm = symbol.Replace("#amp;","&");

Is there any C# attributes that prevents character conversion when a data member is serialized?

I have an object that contains a string property, which returns a string that contains HTML tags:
[DataMember]
public string SomeProperty
{
return "<HTMLTag>";
}
When this object is serialized by DataContractJsonSerializer.WriteObject(), the '<' and '>' characters are converted to "<" and ">". Is there any attribute (or something else) that can prevent this from happening? I know WriteRaw will do the trick, but I can't change the DataContractJsonSerializer.WriteObject() is beyond my control. Thanks.
There is a similar problem considered. Mentioning this, you can encode your HTML into different data, transfer and then decode.
By using
string noEncoding = new XElement("foo", new XCData("a < b")).ToString();
This will produce
<foo><![CDATA[a < b]]></foo>

C#: Correct Encoding of escaped Unicode-Characters in a local string (i.e.: not \\u20ac)

C#: I have a string that comes from a webpage's sourcecode:
<script type="text/javascript">
var itemsLocalDeals = [{"category":"HEALTHCARE SERVICES",
"dealPermaLink":"/deals/aachen/NLP-Deutschlandde
5510969","dealPrice":"399,00 \u20ac",..........
I do some things with that string, such as extracting the dealPrice and add it to a List<> (in the whole string are more than one dealPrice).
Is there a method to decode all "\u20ac" to their real character ("€")?
There are also other characters, so not only the €-Character has to be decoded.
When I debug my code and look at the local fields/variables the string contains not the "€"-Character but the escaped sequence "\\u20ac".
Something like myString.DecodeUnicodeToRealCharacters.
I'm writing the result to a (UTF-8)result.txt
Thanks alot!
P.S.: Unfortunately .Net 2.0 only...
You can use Regex.Unescape("\u20ac");
But better use a json parser since your string seems to be a json string(starting with [{"category":"HEALTHCARE SERVICES",.....)
public string DecodeUnicodeToRealCharacters(string s)
{
return Encoding.Unicode.GetString(Encoding.Unicode.GetBytes(s));
}
Can you please show the code you're using to write the text?
this one works just fine:
string str = "\u20ac";
using (StreamWriter sw = new StreamWriter(#"C:\trythis.txt", false, Encoding.UTF8)){
sw.Write(str);
}

Replace with .Replace/.Regex

I am using Html.Raw(Html.Encode()) to allow some of html to be allowed. For example I want bold, italic, code etc... I am not sure it's the right method, code seems pretty ugly.
Input
Hello, this text will be [b]bold[/b]. [code]alert("Test...")[/code]
Output
Code
#Html.Raw(Html.Encode(Model.Body)
.Replace(Environment.NewLine, "<br />")
.Replace("[b]", "<b>")
.Replace("[/b]", "</b>")
.Replace("[code]", "<div class='codeContainer'><pre name='code' class='javascript'>")
.Replace("[/code]", "</pre></div>"))
My Solution
I want to make it all a bit different. Instead of using BB-Tags I want to use simpler tags.For example * will stand for bold. That means if I input This text is *bold*. it will replace text to This text is <b>bold</b>.. Kinda like this website is using BTW.
Problem
To implement this I need some Regex and I have little to no experience with it. I've searched many sites, but no luck.
My implementation of it looks something like this, but it fails since I can't really replace a char with string.
static void Main(string[] args)
{
string myString = "Hello, this text is *bold*, this text is also *bold*. And this is code: ~MYCODE~";
string findString = "\\*";
int firstMatch, nextMatch;
Match match = Regex.Match(myString, findString);
while (match.Success == true)
{
Console.WriteLine(match.Index);
firstMatch = match.Index;
match = match.NextMatch();
if (match.Success == true)
{
nextMatch = match.Index;
myString = myString[firstMatch] = "<b>"; // Ouch!
}
}
Console.ReadLine();
}
To implement this I need some Regex
Ah no, you don't need Regex. Manipulating HTML with Regex could lead to some undesired effects. So you could simply use MarkDownSharp which by the way is what this site uses to safely render Markdown markup into HTML.
Like this:
var markdown = new Markdown();
string html = markdown.Transform(SomeTextContainingMarkDown);
Of course to polish this you would write an HTML helper so that in your view:
#Html.Markdown(Model.Body)

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.
Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function
I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}
string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);
I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).
For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}
using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);
You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
Simply use string.StripHTML();

Categories

Resources