XmlException - given illegal XML from 3rd party; must process

XmlException - given illegal XML from 3rd party; must process - c#

There are several SO questions and answers about this when creating an XML file; but can't find any pertaining to when you are given bad XML from a 3rd party that you must process; note, the 3rd party cannot be held accountable for the illegal XML.
Ultimately, the .InnerText needs to be escaped or encoded (e.g. changed to legal XML characters) - and later decoded after proper XML parsing.
QUESTION: Are there any libraries that will Load() Invalid/Illegal XML files to allow quick navigation for such escaping/encoding? Or am I stuck having to manually parse the invalid xml, fixing it along the way ... ?
<?xml version="1.0" encoding="utf-8"?>
<ChunkData>
<Fields>
<Field1>some words < other words</Field1>
<Field2>some words > other words</Field2>
</Fields>
</ChunkData>

Although HttpAgilityPack is awesome (and I'm using it in another project of my own), I was given no the time to follow Alexei's advice - which is exactly the direction that I was looking for -- can't parse it as XML? cool, parse it as HTML ... didn't even cross my mind ...
Ended up with this, which does the trick (but is exactly what Alexei advised against):
private static string EncodeValues(string xml)
{
var doc = new List<string>();
var lines = xml.Split('\n');
foreach (var line in lines)
{
var output = line;
if (line.Contains("<Field") && !line.Contains("Fields>"))
{
var value = line.Parse(">", "</");
var encoded = HttpUtility.UrlEncode(value);
output = line.Replace(value, encoded);
}
doc.Add(output);
}
return string.Join("", doc);
}
private static Hashtable DecodeValues(IDictionary data)
{
var output = new Hashtable();
foreach (var key in data.Keys)
{
var value = (string)data[key];
output.Add(key, HttpUtility.UrlDecode(value));
}
return output;
}
Used in conjunction with an Extension method I wrote quite awhile ago ...
public static string Parse(this string s, string first, string second)
{
try
{
if (string.IsNullOrEmpty(s)) return "";
var start = s.IndexOf(first, StringComparison.InvariantCulture) + first.Length;
var end = s.IndexOf(second, start, StringComparison.InvariantCulture);
var length = end - start;
return (end > 0 && length < s.Length) ? s.Substring(start, length) : s.Substring(start);
}
catch (Exception) { return ""; }
}
Used as such (kept separate from the Transform and Hashtable creation methods for clarity):
xmlDocs[0] = EncodeValues(xmlDocs[0]); // in order to handle illegal chars in XML, encode InnerText
var doc = TransformXmlDocument(orgName, xmlDocs[0], xmlDocs[1]);
var data = GetHashtableFromXml(doc);
data = DecodeValues(data); // decode the values extracted from the hashtable
Regardless, I'm always looking for insight ... feel free to comment on this solution - or provide another.

Related

Custom Uppercase on String

hi i was trying to make a program that modified a word in a string to a uppercase word.
the uppercase word is in a tag like this :
the <upcase>weather</upcase> is very <upcase>hot</upcase>
the result :
the WEATHER is very HOT
my code is like this :
string upKey = "<upcase>";
string lowKey = "</upcase>";
string quote = "the lazy <upcase>fox jump over</upcase> the dog <upcase> something here </upcase>";
int index = quote.IndexOf(upKey);
int indexEnd = quote.IndexOf(lowKey);
while(index!=-1)
{
for (int a = 0; a < index; a++)
{
Console.Write(quote[a]);
}
string upperQuote = "";
for (int b = index + 8; b < indexEnd; b++)
{
upperQuote += quote[b];
}
upperQuote = upperQuote.ToUpper().ToString();
Console.Write(upperQuote);
for (int c = indexEnd+9;c<quote.Length;c++)
{
if (quote[c]=='<')
{
break;
}
Console.Write(quote[c]);
}
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
}
Console.WriteLine();
}
i have been trying using this code,and a while(while (indexEnd != -1)) :
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
but that not work, the program run into unlimited loop, btw i'm a noob so please give a answer that i can understand :)

You can use a regular expression for this:
string input = "the <upcase>weather</upcase> is very <upcase>hot</upcase>";
var regex = new Regex("<upcase>(?<theMatch>.*?)</upcase>");
var result = regex.Replace(input, match => match.Groups["theMatch"].Value.ToUpper());
// result will be: "the WEATHER is very HOT"
Here's an explanation taken from here for the regular expression used above:
<upcase> matches the characters <upcase> literally (case sensitive)
(?<theMatch>.\*?) Named capturing group theMatch
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
< matches the characters < literally
/ matches the character / literally
upcase> matches the characters upcase> literally (case sensitive)

The following will work as long as there are only matching tags and none of them are nested.
public static string Upper(string str)
{
const string start = "<upcase>";
const string end = "</upcase>";
var builder = new StringBuilder();
// Find the first start tag
int startIndex = str.IndexOf(start);
// If no start tag found then return the original
if (startIndex == -1)
return str;
// Append the part before the first tag as is
builder.Append(str.Substring(0, startIndex));
// Continue as long as we find another start tag.
while (startIndex != -1)
{
// Find the end tag for the current start tag
var endIndex = str.IndexOf(end, startIndex);
// Append the text between the start and end as upper case.
builder.Append(
str.Substring(
startIndex + start.Length,
endIndex - startIndex - start.Length).ToUpper());
// Find the next start tag.
startIndex = str.IndexOf(start, endIndex);
// Append the part after the end tag, but before the next start as is
builder.Append(
str.Substring(
endIndex + end.Length,
(startIndex == -1 ? str.Length : startIndex) - endIndex - end.Length));
}
return builder.ToString();
}

I'm not rewriting your code. Just answering your (main) question:
You need to keep a variable of the index you're at, and check for IndexOf from there only (See MSDN). Something like this:
int index = 0;
while (quote.IndexOf(upKey, index) != -1)
{
//Your code, including updating the value of index.
}
(I didn't check this on Visual Studio. This is just to point you in the direction that I think you're looking for.)
The reason for the infinite loop is that you're always testing IndexOf of the same index. Perhaps you mean to have quote.IndexOf(upKey, index += 1); which would change the value of index?

The way to go here is to probably use Regex but these easy parsing excercises are always fun to do manually. This can be easily solved using a very simple state machine.
What states can we have when dealing with strings of this nature? I can think of 4:
We are either parsing normal text
Or we are parsing an opening format tag '<...>'
Or we are parsing a closing format tag '</...>'
Or we are parsing text to be formatted between tags
I can't think of any other states. Now we need to think about the normal flow / transition between states. What should happen when we a parse string with the correct format?
Parser starts up expecting normal text. That is easy to understand.
If expecting normal text we encounter a '<' then the parser should switch to parsing opening format tag state. There is no other valid state transition.
If in parsing opening format tag state we encounter a '>' then the parser should switch to parsing text to be formatted. There is no other valid state transition.
If in parsing text to be formatted we encounter a '<' then the parser should switch to parsing closing tag. Again, there is no other valid state transition.
If in parsing closing tag we encounter a '>' then the parser should switch to normal text. Once more, there is no other valid transition. Note that we are disallowing nested tags.
Ok, so that seems pretty easy to understand. What do we need to implement this?
First we'll need something to represent the parsing states. A good old enum will do:
private enum ParsingState
{
UnformattedText,
OpenTag,
CloseTag,
FormattedText,
}
Now we need some string buffers to keep track of the final formatted string, the current format tag we are parsing and finally the substring we need to format. We will use several StringBuilder's for these as we don't know how long these buffers are and how many concatenations will be performed:
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
We will also need to keep track of the parser's state and the current active tag if any (so we can make sure that the parsed closing tag matches the current active tag):
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
And now we are good to go, but before we do, can we generalize this so it works with any format tag?
Yes we can, we just need to tell the parser what to do for each supported tag. We can do this easily just passing a along a Dictionary that ties each tag with the action it should perform. We do this the following way:
var formatter = new Dictionary<string, Func<string, string>>();
formatter.Add("upcase", s => s.ToUpperInvariant());
formatter.Add("lcase", s => s.ToLowerInvariant());
Great! Now our implementation could be the following:
public static string Parse(this string str, Dictionary<string, Func<string,string>> formatter)
{
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
foreach (var c in str)
{
switch (state)
{
case ParsingState.UnformattedText:
{
if (c != '<')
{
formattedStringBuffer.Append(c);
}
else
{
state = ParsingState.OpenTag;
}
break;
}
case ParsingState.OpenTag:
{
if (c != '>')
{
tagBuffer.Append(c);
}
else
{
state = ParsingState.FormattedText;
activeFormatTag = tagBuffer.ToString();
tagBuffer.Clear();
}
break;
}
case ParsingState.FormattedText:
{
if (c != '<')
{
formatBuffer.Append(c);
}
else
{
state = ParsingState.CloseTag;
}
break;
}
case ParsingState.CloseTag:
{
if (c!='>')
{
tagBuffer.Append(c);
}
else
{
var expectedTag = $"/{activeFormatTag}";
var tag = tagBuffer.ToString();
if (tag != expectedTag)
throw new FormatException($"Expected closing tag not found: <{expectedTag}>.");
if (formatter.ContainsKey(activeFormatTag))
{
var formatted = formatter[activeFormatTag](formatBuffer.ToString());
formattedStringBuffer.Append(formatted);
tagBuffer.Clear();
formatBuffer.Clear();
state = ParsingState.UnformattedText;
}
else
throw new FormatException($"Format tag <{activeFormatTag}> not recognized.");
}
break;
}
}
}
if (state != ParsingState.UnformattedText)
throw new FormatException($"Bad format in specified string '{str}'");
return formattedStringBuffer.ToString();
}
Is it the most elegant solution? No, Regex will do a much better job, but being a beginner I would not recommend you start solving these kind of problems that way, you'll learn a whole lot more solving them manualy. You'll have plenty of time to learn Regex later on.

How can I Split(',') a string while ignore commas in between quotes?

I am using the .Split(',') method on a string that I know has values delimited by commas and I want those values to be separated and put into a string[] object. This works great for strings like this:
78,969.82,GW440,.
But the values start to look different when that second value goes over 1000, like the one found in this example:
79,"1,013.42",GW450,....
These values are coming from a spreadsheet control where I use the controls built in ExportToCsv(...) method and that explains why a formatted version of the actual numerical value.
Question
Is there a way I can get the .Split(',') method to ignore commas inside of quotes? I don't actually want the value "1,013.42" to be split up as "1 and 013.42".
Any ideas? Thanks!
Update
I really would like to do this without incorporating a 3rd party tool as my use case really doesn't involve many other cases besides this one and even though it is part of my work's solution, having a tool like that incorporated doesn't really benefit anyone at the moment. I was hoping there was something quick to solve this particular use case that I was missing, but now that it is the weekend, I'll see if I can't give one more update to this question on Monday with the solution I eventually come up with. Thank you everyone for you assistance so far, I'll will assess each answer further on Monday.

This is a fairly straight forward CSV Reader implementation we use in a few projects here. Easy to use and handles those cases you are talking about.
First the CSV Class
public static class Csv
{
public static string Escape(string s)
{
if (s.Contains(QUOTE))
s = s.Replace(QUOTE, ESCAPED_QUOTE);
if (s.IndexOfAny(CHARACTERS_THAT_MUST_BE_QUOTED) > -1)
s = QUOTE + s + QUOTE;
return s;
}
public static string Unescape(string s)
{
if (s.StartsWith(QUOTE) && s.EndsWith(QUOTE))
{
s = s.Substring(1, s.Length - 2);
if (s.Contains(ESCAPED_QUOTE))
s = s.Replace(ESCAPED_QUOTE, QUOTE);
}
return s;
}
private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
Then a pretty nice Reader implementation - If you need it. You should be able to do what you need with just the CSV class above.
public sealed class CsvReader : System.IDisposable
{
public CsvReader(string fileName)
: this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
}
public CsvReader(Stream stream)
{
__reader = new StreamReader(stream);
}
public System.Collections.IEnumerable RowEnumerator
{
get
{
if (null == __reader)
throw new System.ApplicationException("I can't start reading without CSV input.");
__rowno = 0;
string sLine;
string sNextLine;
while (null != (sLine = __reader.ReadLine()))
{
while (rexRunOnLine.IsMatch(sLine) && null != (sNextLine = __reader.ReadLine()))
sLine += "\n" + sNextLine;
__rowno++;
string[] values = rexCsvSplitter.Split(sLine);
for (int i = 0; i < values.Length; i++)
values[i] = Csv.Unescape(values[i]);
yield return values;
}
__reader.Close();
}
}
public long RowIndex { get { return __rowno; } }
public void Dispose()
{
if (null != __reader) __reader.Dispose();
}
//============================================
private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex(#",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))");
private static Regex rexRunOnLine = new Regex(#"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$");
}
Then you can use it like this.
var reader = new CsvReader(new FileStream(file, FileMode.Open));
Note: This would open an existing CSV file, but can be modified fairly easily to take a string[] like you need.

Since you're reading a CSV file, the best course of action would be to use an existing CSV reader. There's more to CSV than just commas between quotes. Finding all of the cases you need to handle would be more work than it's worth.
Here's a CSV reader question on SO.

You should probably read this article: Regular Expression for Comma Based Splitting Ignoring Commas inside Quotes
Although it is for Java, but the regular expression is the same.

String escape into XML

Is there any C# function which could be used to escape and un-escape a string, which could be used to fill in the content of an XML element?
I am using VSTS 2008 + C# + .Net 3.0.
EDIT 1: I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand, for example, I need to put a<b into <foo></foo>, so I need escape string a<b and put it into element foo.

SecurityElement.Escape(string s)

public static string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
public static string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}

EDIT: You say "I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand".
I would strongly advise you not to do it by hand. Use the XML APIs to do it all for you - read in the original files, merge the two into a single document however you need to (you probably want to use XmlDocument.ImportNode), and then write it out again. You don't want to write your own XML parsers/formatters. Serialization is somewhat irrelevant here.
If you can give us a short but complete example of exactly what you're trying to do, we can probably help you to avoid having to worry about escaping in the first place.
Original answer
It's not entirely clear what you mean, but normally XML APIs do this for you. You set the text in a node, and it will automatically escape anything it needs to. For example:
LINQ to XML example:
using System;
using System.Xml.Linq;
class Test
{
static void Main()
{
XElement element = new XElement("tag",
"Brackets & stuff <>");
Console.WriteLine(element);
}
}
DOM example:
using System;
using System.Xml;
class Test
{
static void Main()
{
XmlDocument doc = new XmlDocument();
XmlElement element = doc.CreateElement("tag");
element.InnerText = "Brackets & stuff <>";
Console.WriteLine(element.OuterXml);
}
}
Output from both examples:
<tag>Brackets & stuff <></tag>
That's assuming you want XML escaping, of course. If you're not, please post more details.

Thanks to #sehe for the one-line escape:
var escaped = new System.Xml.Linq.XText(unescaped).ToString();
I add to it the one-line un-escape:
var unescapedAgain = System.Xml.XmlReader.Create(new StringReader("<r>" + escaped + "</r>")).ReadElementString();

George, it's simple. Always use the XML APIs to handle XML. They do all the escaping and unescaping for you.
Never create XML by appending strings.

And if you want, like me when I found this question, to escape XML node names, like for example when reading from an XML serialization, use the easiest way:
XmlConvert.EncodeName(string nameToEscape)
It will also escape spaces and any non-valid characters for XML elements.
http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape%28VS.80%29.aspx

Another take based on John Skeet's answer that doesn't return the tags:
void Main()
{
XmlString("Brackets & stuff <> and \"quotes\"").Dump();
}
public string XmlString(string text)
{
return new XElement("t", text).LastNode.ToString();
}
This returns just the value passed in, in XML encoded format:
Brackets & stuff <> and "quotes"

WARNING: Necromancing
Still Darin Dimitrov's answer + System.Security.SecurityElement.Escape(string s) isn't complete.
In XML 1.1, the simplest and safest way is to just encode EVERYTHING.
Like for \t.
It isn't supported at all in XML 1.0.
For XML 1.0, one possible workaround is to base-64 encode the text containing the character(s).
//string EncodedXml = SpecialXmlEscape("привет мир");
//Console.WriteLine(EncodedXml);
//string DecodedXml = XmlUnescape(EncodedXml);
//Console.WriteLine(DecodedXml);
public static string SpecialXmlEscape(string input)
{
//string content = System.Xml.XmlConvert.EncodeName("\t");
//string content = System.Security.SecurityElement.Escape("\t");
//string strDelimiter = System.Web.HttpUtility.HtmlEncode("\t"); // XmlEscape("\t"); //XmlDecode(" ");
//strDelimiter = XmlUnescape(";");
//Console.WriteLine(strDelimiter);
//Console.WriteLine(string.Format("&#{0};", (int)';'));
//Console.WriteLine(System.Text.Encoding.ASCII.HeaderName);
//Console.WriteLine(System.Text.Encoding.UTF8.HeaderName);
string strXmlText = "";
if (string.IsNullOrEmpty(input))
return input;
System.Text.StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; ++i)
{
sb.AppendFormat("&#{0};", (int)input[i]);
}
strXmlText = sb.ToString();
sb.Clear();
sb = null;
return strXmlText;
} // End Function SpecialXmlEscape
XML 1.0:
public static string Base64Encode(string plainText)
{
var plainTextBytes = System.Text.Encoding.UTF8.GetBytes(plainText);
return System.Convert.ToBase64String(plainTextBytes);
}
public static string Base64Decode(string base64EncodedData)
{
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.UTF8.GetString(base64EncodedBytes);
}

Following functions will do the work. Didn't test against XmlDocument, but I guess this is much faster.
public static string XmlEncode(string value)
{
System.Xml.XmlWriterSettings settings = new System.Xml.XmlWriterSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
StringBuilder builder = new StringBuilder();
using (var writer = System.Xml.XmlWriter.Create(builder, settings))
{
writer.WriteString(value);
}
return builder.ToString();
}
public static string XmlDecode(string xmlEncodedValue)
{
System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
using (var stringReader = new System.IO.StringReader(xmlEncodedValue))
{
using (var xmlReader = System.Xml.XmlReader.Create(stringReader, settings))
{
xmlReader.Read();
return xmlReader.Value;
}
}
}

Using a third-party library (Newtonsoft.Json) as alternative:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped); ;
}
public static string XmlUnescape(string escaped)
{
if (escaped == null) return null;
return JsonConvert.DeserializeObject(escaped, typeof(string)).ToString();
}
Examples of escaped string:
a<b ==> "a<b"
<foo></foo> ==> "foo></foo>"
NOTE:
In newer versions, the code written above may not work with escaping, so you need to specify how the strings will be escaped:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped, new JsonSerializerSettings()
{
StringEscapeHandling = StringEscapeHandling.EscapeHtml
});
}
Examples of escaped string:
a<b ==> "a\u003cb"
<foo></foo> ==> "\u003cfoo\u003e\u003c/foo\u003e"

SecurityElementEscape does this job for you
Use this method to replace invalid characters in a string before using the string in a SecurityElement. If invalid characters are used in a SecurityElement without being escaped, an ArgumentException is thrown.
The following table shows the invalid XML characters and their escaped equivalents.
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape?view=net-5.0

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.

Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}

Regex.Replace(htmlText, "<.*?>", string.Empty);

protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.

For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}

string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);

I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).

For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}

using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);

You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.

For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/

Simply use string.StripHTML();

What's the simplest way to encoding List<String> into plain String and decode it back?

I think I've come across this requirement for a dozen times. But I could never find a satisfying solution. For instance, there are a collection of string which I want to serialize (to disk or through network) through a channel where only plain string is allowed.
I almost always end up using "split" and "join" with ridiculous separator like
":::==--==:::".
like this:
public static string encode(System.Collections.Generic.List<string> data)
{
return string.Join(" :::==--==::: ", data.ToArray());
}
public static string[] decode(string encoded)
{
return encoded.Split(new string[] { " :::==--==::: " }, StringSplitOptions.None);
}
But this simple solution apparently has some flaws. The string cannot contains the separator string. And consequently, the encoded string can no longer re-encoded again.
AFAIK, the comprehensive solution should involve escaping the separator on encoding and unescaping on decoding. While the problem sound simple, I believe the complete solution can take significant amount of code. I wonder if there is any trick allowed me to build encoder & decoder in very few lines of code ?

Add a reference and using to System.Web, and then:
public static string Encode(IEnumerable<string> strings)
{
return string.Join("&", strings.Select(s => HttpUtility.UrlEncode(s)).ToArray());
}
public static IEnumerable<string> Decode(string list)
{
return list.Split('&').Select(s => HttpUtility.UrlDecode(s));
}
Most languages have a pair of utility functions that do Url "percent" encoding, and this is ideal for reuse in this kind of situation.

You could use the .ToArray property on the List<> and then serialize the Array - that could then be dumped to disk or network, and reconstituted with a deserialization on the other end.
Not too much code, and you get to use the serialization techniques already tested and coded in the .net framework.

You might like to look at the way CSV files are formatted.
escape all instances of a deliminater, e.g. " in the string
wrap each item in the list in "item"
join using a simple seperator like ,
I don't believe there is a silver bullet solution to this problem.

Here's an old-school technique that might be suitable -
Serialise by storing the width of each string[] as a fixed-width prefix in each line.
So
string[0]="abc"
string[1]="defg"
string[2]=" :::==--==::: "
becomes
0003abc0004defg0014 :::==--==:::
...where the size of the prefix is large enough to cater for the string maximum length

You could use an XmlDocument to handle the serialization. That will handle the encoding for you.
public static string encode(System.Collections.Generic.List<string> data)
{
var xml = new XmlDocument();
xml.AppendChild(xml.CreateElement("data"));
foreach (var item in data)
{
var xmlItem = (XmlElement)xml.DocumentElement.AppendChild(xml.CreateElement("item"));
xmlItem.InnerText = item;
}
return xml.OuterXml;
}
public static string[] decode(string encoded)
{
var items = new System.Collections.Generic.List<string>();
var xml = new XmlDocument();
xml.LoadXml(encoded);
foreach (XmlElement xmlItem in xml.SelectNodes("/data/item"))
items.Add(xmlItem.InnerText);
return items.ToArray();
}

I would just prefix every string with its length and an terminator indicating the end of the length.
abc
defg
hijk
xyz
546
4.X
becomes
3: abc 4: defg 4: hijk 3: xyz 3: 546 3: 4.X
No restriction or limitations at all and quite simple.

Json.NET is a very easy way to serialize about any object you can imagine. JSON keeps things compact and can be faster than XML.
List<string> foo = new List<string>() { "1", "2" };
string output = JsonConvert.SerializeObject(foo);
List<string> fooToo = (List<string>)JsonConvert.DeserializeObject(output, typeof(List<string>));

It can be done much simpler if you are willing to use a separator of 2 characters long:
In java code:
StringBuilder builder = new StringBuilder();
for(String s : list) {
if(builder.length() != 0) {
builder.append("||");
}
builder.append(s.replace("|", "|p"));
}
And back:
for(String item : encodedList.split("||")) {
list.add(item.replace("|p", "|"));
}

You shouldn't need to do this manually. As the other answers have pointed out, there are plenty of ways, built-in or otherwise, to serialize/deserialize.
However, if you did decide to do the work yourself, it doesn't require that much code:
public static string CreateDelimitedString(IEnumerable<string> items)
{
StringBuilder sb = new StringBuilder();
foreach (string item in items)
{
sb.Append(item.Replace("\\", "\\\\").Replace(",", "\\,"));
sb.Append(",");
}
return (sb.Length > 0) ? sb.ToString(0, sb.Length - 1) : string.Empty;
}
This will delimit the items with a comma (,). Any existing commas will be escaped with a backslash (\) and any existing backslashes will also be escaped.
public static IEnumerable<string> GetItemsFromDelimitedString(string s)
{
bool escaped = false;
StringBuilder sb = new StringBuilder();
foreach (char c in s)
{
if ((c == '\\') && !escaped)
{
escaped = true;
}
else if ((c == ',') && !escaped)
{
yield return sb.ToString();
sb.Length = 0;
}
else
{
sb.Append(c);
escaped = false;
}
}
yield return sb.ToString();
}

Why not use Xstream to serialise it, rather than reinventing your own serialisation format?
Its pretty simple:
new XStream().toXML(yourobject)

Include the System.Linq library in your file and change your functions to this:
public static string encode(System.Collections.Generic.List<string> data, out string delimiter)
{
delimiter = ":";
while(data.Contains(delimiter)) delimiter += ":";
return string.Join(delimiter, data.ToArray());
}
public static string[] decode(string encoded, string delimiter)
{
return encoded.Split(new string[] { delimiter }, StringSplitOptions.None);
}

There are loads of textual markup languages out there, any would function
Many would function trivially given the simplicity of your input it all depends on how:
human readable you want the encoding
resilient to api changes it should be
how easy to parse it is
how easy it is to write or get a parser for it.
If the last one is the most important then just use the existing xml libraries MS supply for you:
class TrivialStringEncoder
{
private readonly XmlSerializer ser = new XmlSerializer(typeof(string[]));
public string Encode(IEnumerable<string> input)
{
using (var s = new StringWriter())
{
ser.Serialize(s, input.ToArray());
return s.ToString();
}
}
public IEnumerable<string> Decode(string input)
{
using (var s = new StringReader(input))
{
return (string[])ser.Deserialize(s);
}
}
public static void Main(string[] args)
{
var encoded = Encode(args);
Console.WriteLine(encoded);
var decoded = Decode(encoded);
foreach(var x in decoded)
Console.WriteLine(x);
}
}
running on the inputs "A", "<", ">" you get (edited for formatting):
<?xml version="1.0" encoding="utf-16"?>
<ArrayOfString
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>A</string>
<string><</string>
<string>></string>
</ArrayOfString>
A
<
>
Verbose, slow but extremely simple and requires no additional libraries

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XmlException - given illegal XML from 3rd party; must process - c#

Related

Custom Uppercase on String

How can I Split(',') a string while ignore commas in between quotes?

String escape into XML

How can I strip HTML tags from a string in ASP.NET?

What's the simplest way to encoding List<String> into plain String and decode it back?

Categories

Resources