I have the following code
XElement element = new XElement("test", "a&b");
where
element.LastNode contains the value "a&b".
i wanted to be it "a&b".
How do i replace this?
Wait a moment,
<test>a&b</test>
is not valid XML. You cannot make XML that looks like this. This is clarified by the XML standard.
& has special meaning, it denotes an escaped character that may otherwise be invalid. An '&' character is encoded as & in XML.
for what its worth, this is invalid HTML for the same reason.
<!DOCTYPE html> <html> <body> a&b </body> </html>
If I write the code,
const string Value = "a&b";
var element = new XElement("test", Value);
Debug.Assert(
string.CompareOrdinal(Value, element.Value) == 0,
"XElement is mad");
it runs without error, XElement encodes and decodes to and from XML as necessary.
To unescape or decode the XML element you simply read XElement.Value.
If you want to make a document that looks like
<test>a&b</test>
you can but it is not XML or HTML, tools for working with HTML or XML won't intentionally help you. You'll have make your own Readers, Writers and Parsers.
The & is a reserved character so it will allways be encoded. So you have to decode:
Is this an option:
HttpUtility.HtmlDecode Method (String)
Usage:
string decoded = HttpUtility.HtmlDecode("a&b");
// returns "a&b"
Try following:
public static string GetTextFromHTML(String htmlstring)
{
// replace all tags with spaces...
htmlstring= Regex.Replacehtmlstring)#"<(.|\n)*?>", " ");
// .. then eliminate all double spaces
while (htmlstring).Contains(" "))
{
htmlstring= htmlstring.Replace(" ", " ");
}
// clear out non-breaking spaces and & character code
htmlstring = htmlstring.Replace(" ", " ");
htmlstring = htmlstring.Replace("&", "&");
return htmlstring;
}
Related
What is the difference between innerHTML, innerText and value in JavaScript?
The examples below refer to the following HTML snippet:
<div id="test">
Warning: This element contains <code>code</code> and <strong>strong language</strong>.
</div>
The node will be referenced by the following JavaScript:
var x = document.getElementById('test');
element.innerHTML
Sets or gets the HTML syntax describing the element's descendants
x.innerHTML
// => "
// => Warning: This element contains <code>code</code> and <strong>strong language</strong>.
// => "
This is part of the W3C's DOM Parsing and Serialization Specification. Note it's a property of Element objects.
node.innerText
Sets or gets the text between the start and end tags of the object
x.innerText
// => "Warning: This element contains code and strong language."
innerText was introduced by Microsoft and was for a while unsupported by Firefox. In August of 2016, innerText was adopted by the WHATWG and was added to Firefox in v45.
innerText gives you a style-aware, representation of the text that tries to match what's rendered in by the browser this means:
innerText applies text-transform and white-space rules
innerText trims white space between lines and adds line breaks between items
innerText will not return text for invisible items
innerText will return textContent for elements that are never rendered like <style /> and `
Property of Node elements
node.textContent
Gets or sets the text content of a node and its descendants.
x.textContent
// => "
// => Warning: This element contains code and strong language.
// => "
While this is a W3C standard, it is not supported by IE < 9.
Is not aware of styling and will therefore return content hidden by CSS
Does not trigger a reflow (therefore more performant)
Property of Node elements
node.value
This one depends on the element that you've targeted. For the above example, x returns an HTMLDivElement object, which does not have a value property defined.
x.value // => null
Input tags (<input />), for example, do define a value property, which refers to the "current value in the control".
<input id="example-input" type="text" value="default" />
<script>
document.getElementById('example-input').value //=> "default"
// User changes input to "something"
document.getElementById('example-input').value //=> "something"
</script>
From the docs:
Note: for certain input types the returned value might not match the
value the user has entered. For example, if the user enters a
non-numeric value into an <input type="number">, the returned value
might be an empty string instead.
Sample Script
Here's an example which shows the output for the HTML presented above:
var properties = ['innerHTML', 'innerText', 'textContent', 'value'];
// Writes to textarea#output and console
function log(obj) {
console.log(obj);
var currValue = document.getElementById('output').value;
document.getElementById('output').value = (currValue ? currValue + '\n' : '') + obj;
}
// Logs property as [propName]value[/propertyName]
function logProperty(obj, property) {
var value = obj[property];
log('[' + property + ']' + value + '[/' + property + ']');
}
// Main
log('=============== ' + properties.join(' ') + ' ===============');
for (var i = 0; i < properties.length; i++) {
logProperty(document.getElementById('test'), properties[i]);
}
<div id="test">
Warning: This element contains <code>code</code> and <strong>strong language</strong>.
</div>
<textarea id="output" rows="12" cols="80" style="font-family: monospace;"></textarea>
Unlike innerText, though, innerHTML lets you work with HTML rich text and doesn't automatically encode and decode text. In other words, innerText retrieves and sets the content of the tag as plain text, whereas innerHTML retrieves and sets the content in HTML format.
InnerText property html-encodes the content, turning <p> to <p>, etc. If you want to insert HTML tags you need to use InnerHTML.
In simple words:
innerText will show the value as is and ignores any HTML formatting which may
be included.
innerHTML will show the value and apply any HTML formatting.
Both innerText and innerHTML return internal part of an HTML element.
The only difference between innerText and innerHTML is that: innerText return HTML element (entire code) as a string and display HTML element on the screen (as HTML code), while innerHTML return only text content of the HTML element.
Look at the example below to understand better. Run the code below.
const ourstring = 'My name is <b class="name">Satish chandra Gupta</b>.';
document.getElementById('innertext').innerText = ourstring;
document.getElementById('innerhtml').innerHTML = ourstring;
.name {
color:red;
}
<p><b>Inner text below. It inject string as it is into the element.</b></p>
<p id="innertext"></p>
<br>
<p><b>Inner html below. It renders the string into the element and treat as part of html document.</b></p>
<p id="innerhtml"></p>
var element = document.getElementById("main");
var values = element.childNodes[1].innerText;
alert('the value is:' + values);
To further refine it and retrieve the value Alec for example, use another .childNodes[1]
var element = document.getElementById("main");
var values = element.childNodes[1].childNodes[1].innerText;
alert('the value is:' + values);
In terms of MutationObservers, setting innerHTML generates a childList mutation due to the browsers removing the node and then adding a new node with the value of innerHTML.
If you set innerText, a characterData mutation is generated.
innerText property sets or returns the text content as plain text of the specified node, and all its descendants, whereas the innerHTML property gets and sets the plain text or HTML contents in the elements. Unlike innerText, innerHTML lets you work with HTML rich text and doesn’t automatically encode and decode text.
InnerText will only return the text value of the page with each element on a newline in plain text, while innerHTML will return the HTML content of everything inside the body tag, and childNodes will return a list of nodes, as the name suggests.
The innerText property returns the actual text value of an html element while the innerHTML returns the HTML content. Example below:
var element = document.getElementById('hello');
element.innerText = '<strong> hello world </strong>';
console.log('The innerText property will not parse the html tags as html tags but as normal text:\n' + element.innerText);
console.log('The innerHTML element property will encode the html tags found inside the text of the element:\n' + element.innerHTML);
element.innerHTML = '<strong> hello world </strong>';
console.log('The <strong> tag we put above has been parsed using the innerHTML property so the .innerText will not show them \n ' + element.innerText);
console.log(element.innerHTML);
<p id="hello"> Hello world
</p>
To add to the list, innerText will keep your text-transform, innerHTML wont.
#rule:
innerHTML
write: whatever String you write to the ele.innerHTML, ele (the code of the element in the html file) will be exactly same as it is written in the String.
read : whatever you read from the ele.innerHTML to a String, the String will be exactly same as it is in ele (the html file).
=> .innerHTML will not make any modification for your read/write
innerText
write: when you write a String to the ele.innerText, any html reserved special character in the String will be encoded into html format first, then stored into the ele.
eg: <p> in your String will become <p> in the ele
read : when you read from the ele.innerText to a String,
any html reserved special character in the ele will be decoded back into a readable text format,
eg: <p> in the ele will become back into <p> in your String
any (valid) html tag in the ele will be removed -- so it becomes "plain text"
eg: if <em>you</em> can in the ele will become if you can in your String
about invalid html tag
if there is an invalid html tag originally in the ele (the html code), and you read from.innerText, how does the tag gets removed?
-- this ("if there is an invalid html tag originally") should not (is not possible to) happen
but its possible that you write an invalid html tag by .innerHTML (in raw) into ele -- then, this may be auto fixed by the browser.
dont take (-interpret) this as step [1.] [2.] with an order
-- no, take it as step [1.] [2.] are executed at the same time
-- I mean, if the decoded characters in [1.] will form a new tag after the conversion, [2.] does not remove it
(-- cuz [2.] considers what characters are in the ele during the conversion, not the characters they become into after the conversion)
then stored into the String.
jsfiddle: with explanation
(^ this contains much more explanations in comments of the js file, + output in console.log
below is a simplified view, with some output.
(try out the code yourself, also there is no guarantee that my explanations are 100% correct.))
<p id="mainContent">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="htmlWrite"></p>
<p id="textWrite"></p>
// > #basic (simple)
// read
var ele_mainContent = document.getElementById('mainContent');
alert(ele_mainContent.innerHTML); // This is a <strong>sample</strong> sentennce for Reading. // >" + => `.innerHTML` will **not make any modification** for your read/write
alert(ele_mainContent.innerText); // This is a sample sentennce for Reading. // >" 2. any (valid) `html tag` in the `ele` will be **removed** -- so it becomes "plain text"
// write
var str_WriteOutput = "Write <strong>this</strong> sentence to the output.";
var ele_htmlWrite = document.getElementById('htmlWrite');
var ele_textWrite = document.getElementById('textWrite');
ele_htmlWrite.innerHTML = str_WriteOutput;
ele_textWrite.innerText = str_WriteOutput;
alert(ele_htmlWrite.innerHTML); // Write <strong>this</strong> sentence to the output. // >" + => `.innerHTML` will **not make any modification** for your read/write
alert(ele_htmlWrite.innerText); // Write this sentence to the output. // >" 2. any (valid) `html tag` in the `ele` will be **removed** -- so it becomes "plain text"
alert(ele_textWrite.innerHTML); // Write <strong>this</strong> sentence to the output. // >" any `html reserved special character` in the String will be **encoded** into html format first
alert(ele_textWrite.innerText); // Write <strong>this</strong> sentence to the output. // >" 1. any `html reserved special character` in the `ele` will be **decoded** back into a readable text format,
// > #basic (more)
// write - with html encoded char
var str_WriteOutput_encodedChar = "What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?";
var ele_htmlWrite_encodedChar = document.getElementById('htmlWrite_encodedChar');
var ele_textWrite_encodedChar = document.getElementById('textWrite_encodedChar');
ele_htmlWrite_encodedChar.innerHTML = str_WriteOutput_encodedChar;
ele_textWrite_encodedChar.innerText = str_WriteOutput_encodedChar;
alert(ele_htmlWrite_encodedChar.innerHTML); // What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?
alert(ele_htmlWrite_encodedChar.innerText); // What if you have <strong>encoded</strong> char in the sentence?
alert(ele_textWrite_encodedChar.innerHTML); // What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?
alert(ele_textWrite_encodedChar.innerText); // What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?
// > #note-advance: read then write
var ele__htmlRead_Then_htmlWrite = document.getElementById('htmlRead_Then_htmlWrite');
var ele__htmlRead_Then_textWrite = document.getElementById('htmlRead_Then_textWrite');
var ele__textRead_Then_htmlWrite = document.getElementById('textRead_Then_htmlWrite');
var ele__textRead_Then_textWrite = document.getElementById('textRead_Then_textWrite');
ele__htmlRead_Then_htmlWrite.innerHTML = ele_mainContent.innerHTML;
ele__htmlRead_Then_textWrite.innerText = ele_mainContent.innerHTML;
ele__textRead_Then_htmlWrite.innerHTML = ele_mainContent.innerText;
ele__textRead_Then_textWrite.innerText = ele_mainContent.innerText;
alert(ele__htmlRead_Then_htmlWrite.innerHTML); // This is a <strong>sample</strong> sentennce for Reading.
alert(ele__htmlRead_Then_htmlWrite.innerText); // This is a sample sentennce for Reading.
alert(ele__htmlRead_Then_textWrite.innerHTML); // This is a <strong>sample</strong> sentennce for Reading.
alert(ele__htmlRead_Then_textWrite.innerText); // This is a <strong>sample</strong> sentennce for Reading.
alert(ele__textRead_Then_htmlWrite.innerHTML); // This is a sample sentennce for Reading.
alert(ele__textRead_Then_htmlWrite.innerText); // This is a sample sentennce for Reading.
alert(ele__textRead_Then_textWrite.innerHTML); // This is a sample sentennce for Reading.
alert(ele__textRead_Then_textWrite.innerText); // This is a sample sentennce for Reading.
// the parsed html after js is executed
/*
<html><head>
<meta charset="utf-8">
<title>Test</title>
</head>
<body>
<p id="mainContent">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="htmlWrite">Write <strong>this</strong> sentence to the output.</p>
<p id="textWrite">Write <strong>this</strong> sentence to the output.</p>
<!-- P2 -->
<p id="htmlWrite_encodedChar">What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?</p>
<p id="textWrite_encodedChar">What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?</p>
<!-- P3 #note: -->
<p id="htmlRead_Then_htmlWrite">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="htmlRead_Then_textWrite">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="textRead_Then_htmlWrite">This is a sample sentennce for Reading.</p>
<p id="textRead_Then_textWrite">This is a sample sentennce for Reading.</p>
</body></html>
*/
innerhtml will apply html codes
innertext will put content as text so if you have html tags it will show as text only
1)innerHtml
sets all the html content inside the tag
returns all the html content inside the tag
includes styling + whitespaces
2)innerText
sets all the content inside the tag (with tag wise line breaks)
returns all html content inside the tag (with tag wise line breaks)
ignores tags (shows only text)
ignores styling + whitespaces
if we have style:"visibility:hidden;" inside tag
|_ innerText includes the styling -> hides content
3)textContent
sets all the content inside the tag (no tag wise line breaks)
returns all content inside the tag (no tag wise line breaks)
includes whitespaces
if we have style:"visibility:hidden;" inside tag
|_ textContent ignores the styling -> shows content
textContent has better performance because its value is not parsed as HTML.
I am trying to convert a file to XML format that contains some special characters but it's not getting converted because of that special characters in the data.
I have already this regex code still it's not working for me please help.
The code what I have tried:
string filedata = #"D:\readwrite\test11.txt";
string input = ReadForFile(filedata);
string re1 = #"[^\u0000-\u007F]+";
string re5 = #"\p{Cs}";
data = Regex.Replace(input, re1, "");
data = Regex.Replace(input, re5, "");
XmlDocument xmlDocument = new XmlDocument();
try
{
xmlDocument = (XmlDocument)JsonConvert.DeserializeXmlNode(data);
var Xdoc = XDocument.Parse(xmlDocument.OuterXml);
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
0x04 is a transmission control character and cannot appear in a text string. XmlDocument is right to reject it if it really does appear in your data. This does suggest that the regex you have doesn't do what you think it does, if I'm right that regex will find the first instance of one or more of those invalid characters at the beginning of a line and replace it, but not all of them. The real question for me is why this non-text 'character' appears in data intended as XML in the first place.
I have other questions. I've never seen JsonConvert.DeserializeXmlNode before - I had to look up what it does. Why are you using a JSON function against the root of a document which presumably therefore contains no JSON? Why are you then taking that document, converting it back to a string, and then creating an XDocument from it? Why not just create an XDocument to start with?
I have string s and it looks:
<root><p>hello world</p> my name is!</root>
I have next code:
try
{
m_Content = XDocument.Load(new StringReader(s));
}
catch (XmlException ex)
{
ex.Data["myerror"] = s;
throw;
}
As you see, I want to load string with all elements like and make it view. But I've got XmlException:
Reference to undeclared substitution to "nbsp"
Any ideas how to do it right?
Added
ChrisShao offered a good idea: put my string in <![CDATA[ tag, but unfortunately it doesnt solve my problem. I have a big string with lots of tags and few big texts in which I can meet elements. If use System.Web.HttpUtility.HtmlDecode I lose all these elements and get " " fields.
Responding to your Added section. The blank (" ") fields you get is correct representation of when it is rendered. Correct encoding of for use in xml is [Reference].
If you really want to see instead of " " when the string loaded to XDocument, try to encode ampersand char (&) with &. Replace with [Reference].
use System.Web.HttpUtility.HtmlDecode or System.Net.WebUtility.HtmlDecode
I think you should put your string into CDATA block,like this:
<root><![CDATA[<p>hello world</p> my name is!]]></root>
I've got a wrapper that adds a header to a field whenever it has a value. The field is actually a string which holds HTML from a tinymce textbox.
Requirement: the header should not display when the field is empty or just whitespace.
Issue: whitespace in html is rendered as <p> </p>, so technically it's not an empty or whitespace value
I simply can't !String.IsNullOrWhiteSpace(Model.ContentField.Value) because it does have a value, albeit whitespace html.
I've tried to convert the value onto #Html.Raw(Model.ContentField.Value) but it's of a type HtmlString, so I can't use String.IsNullOrWhiteSpace.
Any ideas? Thanks!
You can use HtmlAgilityPack, something like this:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(Model.ContentField.Value);
string textValue = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
bool isEmpty = String.IsNullOrWhiteSpace(textValue);
What I eventually did (because I didn't want to add a 3rd party library just for this), is to add a function in a helper class that strips HTML tags:
const string HTML_TAG_PATTERN = "<.*?>";
public static string StripHTML(string inputString)
{
return Regex.Replace
(inputString, HTML_TAG_PATTERN, string.Empty);
}
After which, I combined that with HttpUtility.HtmlDecode to get the inner value:
var innerContent = StringHelper.StripHTML(HttpUtility.HtmlDecode(Model.ContentField.Value));
That variable is what I used to compare. Let me know if this is a bad idea.
Thanks!
I had similar question as your title, but in my case the html string was empty. So I ended up doing the following:
HtmlString someString = new HtmlString("");
string.IsNullOrEmpty(someString.ToString());
Might be obvious, but didn't realize it at first.
I have a very simple asp:textbox with the multiline attribute enabled. I then accept just text, with no markup, from the textbox. Is there a common method by which line breaks and returns can be converted to <p> and <br/> tags?
I'm not looking for anything earth shattering, but at the same time I don't just want to do something like:
html.Insert(0, "<p>");
html.Replace(Enviroment.NewLine + Enviroment.NewLine, "</p><p>");
html.Replace(Enviroment.NewLine, "<br/>");
html.Append("</p>");
The above code doesn't work right, as in generating correct html, if there are more than 2 line breaks in a row. Having html like <br/></p><p> is not good; the <br/> can be removed.
I know this is old, but I couldn't find anything better after some searching, so here is what I'm using:
public static string TextToHtml(string text)
{
text = HttpUtility.HtmlEncode(text);
text = text.Replace("\r\n", "\r");
text = text.Replace("\n", "\r");
text = text.Replace("\r", "<br>\r\n");
text = text.Replace(" ", " ");
return text;
}
If you can't use HttpUtility for some reason, then you'll have to do the HTML encoding some other way, and there are lots of minor details to worry about (not just <>&).
HtmlEncode only handles the special characters for you, so after that I convert any combo of carriage-return and/or line-feed to a BR tag, and any double-spaces to a single-space plus a NBSP.
Optionally you could use a PRE tag for the last part, like so:
public static string TextToHtml(string text)
{
text = "<pre>" + HttpUtility.HtmlEncode(text) + "</pre>";
return text;
}
Your other option is to take the text box contents and instead of trying for line a paragraph breaks just put the text between PRE tags. Like this:
<PRE>
Your text from the text box...
and a line after a break...
</PRE>
Depending on exactly what you are doing with the content, my typical recommendation is to ONLY use the <br /> syntax, and not to try and handle paragraphs.
How about throwing it in a <pre> tag. Isn't that what it's there for anyway?
I know this is an old post, but I've recently been in a similar problem using C# with MVC4, so thought I'd share my solution.
We had a description saved in a database. The text was a direct copy/paste from a website, and we wanted to convert it into semantic HTML, using <p> tags. Here is a simplified version of our solution:
string description = getSomeTextFromDatabase();
foreach(var line in description.Split('\n')
{
Console.Write("<p>" + line + "</p>");
}
In our case, to write out a variable, we needed to prefix # before any variable or identifiers, because of the Razor syntax in the ASP.NET MVC framework. However, I've shown this with a Console.Write, but you should be able to figure out how to implement this in your specific project based on this :)
Combining all previous plus considering titles and subtitles within the text comes up with this:
public static string ToHtml(this string text)
{
var sb = new StringBuilder();
var sr = new StringReader(text);
var str = sr.ReadLine();
while (str != null)
{
str = str.TrimEnd();
str.Replace(" ", " ");
if (str.Length > 80)
{
sb.AppendLine($"<p>{str}</p>");
}
else if (str.Length > 0)
{
sb.AppendLine($"{str}</br>");
}
str = sr.ReadLine();
}
return sb.ToString();
}
the snippet could be enhanced by defining rules for short strings
I understand that I was late with the answer for 13 years)
but maybe someone else needs it
sample line 1 \r\n
sample line 2 (last at paragraph) \r\n\r\n [\r\n]+
sample line 3 \r\n
Example code
private static Regex _breakRegex = new("(\r?\n)+");
private static Regex _paragrahBreakRegex = new("(?:\r?\n){2,}");
public static string ConvertTextToHtml(string description) {
string[] descrptionParagraphs = _paragrahBreakRegex.Split(description.Trim());
if (descrptionParagraphs.Length > 0)
{
description = string.Empty;
foreach (string line in descrptionParagraphs)
{
description += $"<p>{line}</p>";
}
}
return _breakRegex.Replace(description, "<br/>");
}