How to parse my string to XML with &nbsp elements? - c#

I have string s and it looks:
<root><p>hello world</p> my name is!</root>
I have next code:
try
{
m_Content = XDocument.Load(new StringReader(s));
}
catch (XmlException ex)
{
ex.Data["myerror"] = s;
throw;
}
As you see, I want to load string with all elements like and make it view. But I've got XmlException:
Reference to undeclared substitution to "nbsp"
Any ideas how to do it right?
Added
ChrisShao offered a good idea: put my string in <![CDATA[ tag, but unfortunately it doesnt solve my problem. I have a big string with lots of tags and few big texts in which I can meet elements. If use System.Web.HttpUtility.HtmlDecode I lose all these elements and get " " fields.

Responding to your Added section. The blank (" ") fields you get is correct representation of when it is rendered. Correct encoding of for use in xml is   [Reference].
If you really want to see instead of " " when the string loaded to XDocument, try to encode ampersand char (&) with &. Replace with &nbsp; [Reference].

use System.Web.HttpUtility.HtmlDecode or System.Net.WebUtility.HtmlDecode

I think you should put your string into CDATA block,like this:
<root><![CDATA[<p>hello world</p> my name is!]]></root>

Related

{"'\u0004', hexadecimal value 0x04, is an invalid character

I am trying to convert a file to XML format that contains some special characters but it's not getting converted because of that special characters in the data.
I have already this regex code still it's not working for me please help.
The code what I have tried:
string filedata = #"D:\readwrite\test11.txt";
string input = ReadForFile(filedata);
string re1 = #"[^\u0000-\u007F]+";
string re5 = #"\p{Cs}";
data = Regex.Replace(input, re1, "");
data = Regex.Replace(input, re5, "");
XmlDocument xmlDocument = new XmlDocument();
try
{
xmlDocument = (XmlDocument)JsonConvert.DeserializeXmlNode(data);
var Xdoc = XDocument.Parse(xmlDocument.OuterXml);
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
0x04 is a transmission control character and cannot appear in a text string. XmlDocument is right to reject it if it really does appear in your data. This does suggest that the regex you have doesn't do what you think it does, if I'm right that regex will find the first instance of one or more of those invalid characters at the beginning of a line and replace it, but not all of them. The real question for me is why this non-text 'character' appears in data intended as XML in the first place.
I have other questions. I've never seen JsonConvert.DeserializeXmlNode before - I had to look up what it does. Why are you using a JSON function against the root of a document which presumably therefore contains no JSON? Why are you then taking that document, converting it back to a string, and then creating an XDocument from it? Why not just create an XDocument to start with?

How to unescape special characters in c#

I have the following code
XElement element = new XElement("test", "a&b");
where
element.LastNode contains the value "a&b".
i wanted to be it "a&b".
How do i replace this?
Wait a moment,
<test>a&b</test>
is not valid XML. You cannot make XML that looks like this. This is clarified by the XML standard.
& has special meaning, it denotes an escaped character that may otherwise be invalid. An '&' character is encoded as & in XML.
for what its worth, this is invalid HTML for the same reason.
<!DOCTYPE html> <html> <body> a&b </body> </html>
If I write the code,
const string Value = "a&b";
var element = new XElement("test", Value);
Debug.Assert(
string.CompareOrdinal(Value, element.Value) == 0,
"XElement is mad");
it runs without error, XElement encodes and decodes to and from XML as necessary.
To unescape or decode the XML element you simply read XElement.Value.
If you want to make a document that looks like
<test>a&b</test>
you can but it is not XML or HTML, tools for working with HTML or XML won't intentionally help you. You'll have make your own Readers, Writers and Parsers.
The & is a reserved character so it will allways be encoded. So you have to decode:
Is this an option:
HttpUtility.HtmlDecode Method (String)
Usage:
string decoded = HttpUtility.HtmlDecode("a&b");
// returns "a&b"
Try following:
public static string GetTextFromHTML(String htmlstring)
{
// replace all tags with spaces...
htmlstring= Regex.Replacehtmlstring)#"<(.|\n)*?>", " ");
// .. then eliminate all double spaces
while (htmlstring).Contains(" "))
{
htmlstring= htmlstring.Replace(" ", " ");
}
// clear out non-breaking spaces and & character code
htmlstring = htmlstring.Replace(" ", " ");
htmlstring = htmlstring.Replace("&", "&");
return htmlstring;
}

Check if HtmlString is whitespace in C#

I've got a wrapper that adds a header to a field whenever it has a value. The field is actually a string which holds HTML from a tinymce textbox.
Requirement: the header should not display when the field is empty or just whitespace.
Issue: whitespace in html is rendered as <p> </p>, so technically it's not an empty or whitespace value
I simply can't !String.IsNullOrWhiteSpace(Model.ContentField.Value) because it does have a value, albeit whitespace html.
I've tried to convert the value onto #Html.Raw(Model.ContentField.Value) but it's of a type HtmlString, so I can't use String.IsNullOrWhiteSpace.
Any ideas? Thanks!
You can use HtmlAgilityPack, something like this:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(Model.ContentField.Value);
string textValue = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
bool isEmpty = String.IsNullOrWhiteSpace(textValue);
What I eventually did (because I didn't want to add a 3rd party library just for this), is to add a function in a helper class that strips HTML tags:
const string HTML_TAG_PATTERN = "<.*?>";
public static string StripHTML(string inputString)
{
return Regex.Replace
(inputString, HTML_TAG_PATTERN, string.Empty);
}
After which, I combined that with HttpUtility.HtmlDecode to get the inner value:
var innerContent = StringHelper.StripHTML(HttpUtility.HtmlDecode(Model.ContentField.Value));
That variable is what I used to compare. Let me know if this is a bad idea.
Thanks!
I had similar question as your title, but in my case the html string was empty. So I ended up doing the following:
HtmlString someString = new HtmlString("");
string.IsNullOrEmpty(someString.ToString());
Might be obvious, but didn't realize it at first.

What would be the best way of checking whether a string contains XML tags?

I know that the following would find potential tags, but is there a better way to check if a string contains XML tags to prevent exceptions when reading/writing the string between XML files?
string testWord = "test<a>";
bool foundTag = Regex.IsMatch(testWord, #"^*<*>*$"));
I'd use another Regex for that
Regex.IsMatch(testWord, #"<.+?>");
However, even if it does match, there is no guarantee that your file actually is an xml file, as the regex could also match strings like "<<a>" which is invalid, or "a <= b >= c" which is obviously not xml.
You should consider using the XmlDocument class instead.
XmlDocument xmlDoc = new XmlDocument();
try
{
xmlDoc.Load(testWord);
}
catch
{
// not an xml
}
Why don't you HtmlEncode the string before sending it via XML? This way you can avoid difficulties with Regex parsing tags.

XML, issue with while parsing to ASP.net label control

I have a field called Description in the front end form, it is a textarea where user can type/copy past the text which include line breaks aswell.
From asp.net all this data goes to sharepoint.
Now I have a search page which returns all these values from sharepoint using webserivices in the format of xml.
The problem is that all of the line breaks in the value in replaced with
I am trying to display the description field values to the label, but its not working I tried below things :
lblDesc.Text = xmlValuesPath.Attribute("ows_Description").Value.Replace("
", "\n");
lblDesc.Text = xmlValuesPath.Attribute("ows_Description").Value.Replace("
", "</p><p>");
The formatting works fine in a textbox, but nothing seems to be working, kindly help.
Did you clear out all HTML tags from it?
public static string ClearHTMLTagsFromString(string htmlString)
{
string regEx = #"\<[^\<\>]*\>";
string tagless = Regex.Replace(htmlString, regEx, string.Empty);
// remove rogue leftovers
tagless = tagless.Replace("<", string.Empty).Replace(">", string.Empty);
tagless = tagless.Replace("Body:", string.Empty);
return tagless;
}
Try to replace "
" with "<br/>" it should work in ASP.NET Label.
By default asp.net coverts it to \n .which at the run time wont be parsed by the html code to you just need to replace \n with ""
xmlValuesPath.Attribute("ows_Description").Value.Replace("\n", "</p><p>")

Categories

Resources