Ok I feel really stupid asking this. I see plenty of other questions that resemble my question, but none seem to be able to answer it.
I am creating an xml file for a program that is very picky about syntax. Sadly I am making the XML file from scratch. Meaning, I am placing each line in individually (lots of file.WriteLine(String)).
I know this is ugly, but its the only way I can get the logic to work out.
ANYWAY. I have a few strings that are coming through with '&' in them.
if (value.Contains("&"))
{
value.Replace("&", "&");
}
Does not seem to work. The value.Contains() seems to see it, but the replace does not work. I am using C# .Net 2.0 sp2. VS 2005.
Please help me out here.. Its been a long week..
If you really want to go that route, you have to assign the result of Replace (the method returns a new string because strings are immutable) back to the variable:
value = value.Replace("&", "&");
I would suggest rethinking the way you're writing your XML though. If you switch to using the XmlTextWriter, it will handle all of the encoding for you (not only the ampersand, but all of the other characters that need encoded as well):
using(var writer = new XmlTextWriter(#"C:\MyXmlFile.xml", null))
{
writer.WriteStartElement("someString");
writer.WriteText("This is < a > string & everything will get encoded");
writer.WriteEndElement();
}
Should produce:
<someString>This is < a > string &
everything will get encoded</someString>
You should really use something like Linq to XML (XDocument etc.) to solve it. I'm 100% sure you can do it without all your WriteLine´s ;) Show us your logic?
Otherwise you could use this which will be bullet proof (as opposed to .Replace("&")):
var value = "hej&hej<some>";
value = new System.Xml.Linq.XText(value).ToString(); //hej&hej<some>
This will also take care of < which you also HAVE TO escape :)
Update: I have looked at the code for XText.ToString() and internally it creates a XmlWriter + StringWriter and uses XNode.WriteTo. This may be overkill for a given application so if many strings should be converted, XText.WriteTo would be better. An alternative which should be fast and reliant is System.Web.HttpUtility.HtmlEncode.
Update 2: I found this System.Security.SecurityElement.Escape(xml) which may be the fastest and ensures max compatibility (supported since .Net 1.0 and does not require the System.Web reference).
you can also use HttpUtility.HtmlEncode class under System.Web namespace instead of doing the replacement yourself.
here you go: http://msdn.microsoft.com/en-us/library/73z22y6h.aspx
You can use Regex for replace char "&" only in node values:
input data example (string)
<select>
<option id="11">Gigamaster&Minimaster</option>
<option id="12">Black & White</option>
<option id="13">Other</option>
</select>
Replace with Regex
Regex rgx = new Regex(">(?<prefix>.*)&(?<sufix>.*)<");
data = rgx.Replace(data, ">${prefix}&${sufix}<");
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(data);
result data
<select>
<option id="11">Gigamaster&MiniMaster</option>
<option id="12">Black & White</option>
<option id="13">Other</option>
</select>
I'm Obviously very late to this, but the right answer is:
System.Text.RegularExpressions.Regex.Replace(input, "&(?!amp;)", "&");
Hope this helps somebody!
You can try:
value = value.Replace("&", "&");
Strings are immutable. You need to write:
value = value.Replace("&", "&");
Note that if you do this and your string contains "&", it's going to get changed to "&".
I've created the following function to encode & and ' without messing up with already encoded & or ' or "
public static string encodeSelectXMLCharacters(string xmlString)
{
string returnValue = Regex.Replace(xmlString, "&(?!quot;|apos;|amp;|lt;|gt;#x?.*?;)|'",
delegate(Match m)
{
string encodedValue;
switch (m.Value)
{
case "&":
encodedValue = "&";
break;
case "'":
encodedValue = "'";
break;
default:
encodedValue = m.Value;
break;
}
return encodedValue;
});
return returnValue;
}
not sure if this is useful to anyone... I was fighting this for a while... here is a glorious regex you can use to fix all your links, javascript, content. I had to deal with a ton of legacy content that nobody wanted to correct.
Add this to your Render override in your master page, control or recode to run a string through it. Please don't flame me for putting this in the wrong place:
// remove the & from href="blaw?a=b&b=c" and replace with &
//in urls - this corrects any unencoded & not just those in URL's
// this match will also ignore any matches it finds within <script> blocks AND
// it will also ignore the matches where the link includes a javascript command like
// blaw
html = Regex.Replace(html, "&(?!(?<=(?<outerquote>[\"'])javascript:(?>(?!\\k<outerquote>|[>]).)*)\\k<outerquote>?)(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\\d+);)(?!(?>(?:(?!<script|\\/script>).)*)\\/script>)", "&", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Its a broad stroke for a rendered page but this can be adapted to many uses without blowing up your page.
What about
Value = Server.HtmlEncode(Value);
I am quite sure it will work if you "embrace" your value with CDATA, so the result is something like
<ampersandData><![CDATA[value with ampersands like …]]></ampersandData>
Hope it helps.
Michael
Very late here, but I want to share my solution which handles the cases where you have both & (incorrect xml) and & (valid xml) in the document in addition to other xml character entities.
This solution is only meant for cases where you cannot control generation of the xml, usually because it comes from some external source. If you control the xml generation please use XmlTextWriter as suggested by #Justin Niessner
It is also quite fast and handles all the different xml character entities/references
Predefined character entities:
& quot;
& amp;
& apos;
& lt;
& gt;
Numeric character entities/references:
& #nnnn;
& #xhhhh;
PS! The space after & should not be included in the entities/references, I just added it here to avoid it being encoded in the page rendering
Code
public static string CleanXml(string text)
{
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (text[i] == '&')
{
var remaining = Math.Abs(length - i + 1);
var subStrLength = Math.Min(remaining, 12);
var subStr = text.Substring(i, subStrLength);
var firstIndexOfSemiColon = subStr.IndexOf(';');
if (firstIndexOfSemiColon > -1)
subStr = subStr.Substring(0, firstIndexOfSemiColon + 1);
var matches = Regex.Matches(subStr, "&(?!quot;|apos;|amp;|lt;|gt;|#x?.*?;)|'");
if (matches.Count > 0)
stringBuilder.Append("&");
else
stringBuilder.Append("&");
}
else if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}
Related
I'm trying to figure out a way to count the number of characters in a string, truncate the string, then returns it. However, I need this function to NOT count HTML tags. The problem is that if it counts HTML tags, then if the truncate point is in the middle of a tag, then the page will appear broken.
This is what I have so far...
public string Truncate(string input, int characterLimit, string currID) {
string output = input;
// Check if the string is longer than the allowed amount
// otherwise do nothing
if (output.Length > characterLimit && characterLimit > 0) {
// cut the string down to the maximum number of characters
output = output.Substring(0, characterLimit);
// Check if the character right after the truncate point was a space
// if not, we are in the middle of a word and need to remove the rest of it
if (input.Substring(output.Length, 1) != " ") {
int LastSpace = output.LastIndexOf(" ");
// if we found a space then, cut back to that space
if (LastSpace != -1)
{
output = output.Substring(0, LastSpace);
}
}
// end any anchors
if (output.Contains("<a href")) {
output += "</a>";
}
// Finally, add the "..." and end the paragraph
output += "<br /><br />...<a href='Announcements.aspx?ID=" + currID + "'>see more</a></p>";
}
return output;
}
But I'm not happy with this. Is there a better way to do this? If you could provide a new solution to this, or perhaps suggestions on what to add to what I have so far, that would be great.
Disclaimer: I've never worked with C#, so I'm not familiar with the concepts related to the language... I'm doing this because I have to, not by choice.
Thanks,
Hristo
Use the right tool for the problem.
HTML is not a simple format to parse. I would advise that you use a proven, existing parser rather than rolling your own. If you know that you will only ever parse XHTML - then you could use an XML parser instead.
These are the only reliable ways to perform operations on HTML that will preserve the semantic representation.
Don't try to use regular expressions. HTML is not a regular language and you can only cause yourself grief and misery going in that direction.
Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!
You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979
I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}
It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.
You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');
If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"
Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file
I can't seem to find a good solution to this issue. I've got an array of strings that are fed in from a report that I recieve about lost or stolen equipment. I've been using the string.IndexOf function through the rest of the form and it works quite well. This issue is with the field that says if the device was lost or stolen.
Example:
"Lost or Stolen? Lost"
"Lost or Stolen? Stolen"
I need to be able to read this but when I do string.IndexOf(#"Lost") it will always return lost because it's in the question.
Unfortunately I'm not able to change the form itself in any way and due to the nature of how it's submited I can't just write code the knocks the first 15 or so characters off the string because that may be too few in some cases.
I would really like something in C# that would allow me to continue to search a string after the first result is found so that the logic would look like:
string my_string = "Lost or Stolen? Stolen";
searchFor(#"Stolen" in my_string)
{
Found Stolen;
Does it have "or " infront of it? yes;
ignore and keep searching;
Found Stolen again;
return "Equipment stolen";
}
Couple of options here. You could look for the last index of a space and take the rest of the string:
string input = "Lost or Stolen? Stolen";
int lastSpaceIndex = input.LastIndexOf(' ');
string result = input.Substring(lastSpaceIndex + 1);
Console.WriteLine(result);
Or you could split it and take the last word:
string input = "Lost or Stolen? Lost";
string result = input.Split(' ').Last();
Console.WriteLine(result);
Regex is also an option, but overkill given the simpler solutions above. A nice shortcut that fits this scenario is to use the RegexOptions.RightToLeft option to get the first match starting from the right:
string result = Regex.Match(input, #"\w+", RegexOptions.RightToLeft).Value;
If I understand your requirement, you're looking for an instance of Lost or Stolen after a ?:
var q = myString.IndexOf("?");
var lost = q >= 0 && myString.IndexOf("Lost", q) > 0;
var stolen = q >= 0 && myString.IndexOf("Stolen", q) > 0;
// or
var lost = myString.LastIndexOf("Lost") > myString.IndexOf("?");
var stolen = myString.LastIndexOf("Stolen") > myString.IndexOf("?");
// don't forget
var neither = !lost && !stolen;
You can look for the string 'Lost' and if it occurs twice, then you can confirm it is 'Lost'.
Its possible in this case that you could use index of on a substring knowing that it is always going to say lost or stolen first
so you parse out the lost or stolen, then like for you keyword to match the remaining string.
something like:
int questionIndex = inputValue.indexOf("?");
string toMatch = inputValue.Substring(questionIndex);
if(toMatch == "Lost")
If it works for your use case, it might be easier to use .EndsWith().
bool lost = my_string.EndsWith("Lost");
i am having a variable in c# holding some string like this
string myText="my text which contains <div>i am text inside div</div>";
now i want to replace all "\n" (new line character) with "<br>" for this variable's data except for text inside div.
How do i do this??
Others have suggested using libraries such as HTMLAgilityPack. The former is indeed a nice tool, but if you don't need HTML parsing functionality beyond what you have requested, a simple parser should suffice:
string ReplaceNewLinesWithBrIfNotInsideDiv(string input) {
int divNestingLevel = 0;
StringBuilder output = new StringBuilder();
StringComparison comp = StringComparison.InvariantCultureIgnoreCase;
for (int i = 0; i < input.Length; i++) {
if (input[i] == '<') {
if (i < (input.Length - 3) && input.Substring(i, 4).Equals("<div", comp)){
divNestingLevel++;
} else if (divNestingLevel != 0 && i < (input.Length - 5) && input.Substring(i, 6).Equals("</div>", comp)) {
divNestingLevel--;
}
}
if (input[i] == '\n' && divNestingLevel == 0) {
output.Append("<br/>");
} else {
output.Append(input[i]);
}
}
return output.ToString();
}
This should handle nested divs as well.
For something like this you will need to parse the HTML in order to distinguish the parts that you do want to make the replacement in from the ones you don't.
I suggest looking at the HTML agility pack - it can parse HTML fragments as well as malformed HTML. You can then query the resulting parse tree using XPath notation and do your replacement on the selected nodes.
That would require some fairly complicated RegEx, out of my league.
But you could try splitting the string:
string[] parts = myText.Split("<div>", "</div>");
for (int i = 0; i < parts.Length; i += 2) // only the even parts
parts[i] = string.Replace(...);
And then use a StringBuilder to re-assemble the parts.
I would split the string on div then look at the tokens if it starts with "div" then don't replace \n with BR if it does start with div then you need to find the closing div and split on that.. then take the 2nd token and do what you just did... of course as you are going to have to keep appending the tokens to a master string... I'll code up a sample here in a few minutes...
Use the string.Replace() method like this:
myText = myText.Replace("\n", "<br>")
You could consider using the Environment.NewLine property to find the newline chars. Are you sure they are not \n\r or \r\n etc...
You may have to pull the text inside the div out first if you dont want to parse that. Use a regex to find it and remove it then do the Replace() as above, then put the strings backtogether.
Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.
Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function
I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like & and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}
string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);
I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).
For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}
using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);
You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
Simply use string.StripHTML();