I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.
I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:
WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?
I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.
After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here
This may be of some help to you in case you want to modify the htmlagilitypack source yourself.
Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:
// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
var sb = new StringBuilder(str);
//TODO: add other replacements, as needed
return sb.Replace(".", ".")
.Replace("ă", "ă")
.Replace("â", "â")
.ToString();
}
In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.
This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.
My HTML had a block of text like so:
... found in sections: 233.9 & 517.3; ...
Despite the spacing and decimal point, it was interpreting & 517.3; as a unicode character.
Simply HTML Encoding the raw text fixed the problem for me.
string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);
In my case I have fixed this by updating HtmlAgilityPack to version 1.5.0
Related
This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);
I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).
FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);
//strip out bad characters
content = content.Replace("’", "'");
This doesn't work and it changes the slanted apostrophes into ? marks.
I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:
content[0]; // 65533 '�'
The reason why the replace isn't working is simple - content doesn't contain the string you gave it:
content.IndexOf("’"); // -1
As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));
(See this question).
You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:
content = content.Replace("\u0092", "'");
My bet is the file is encoded in Windows-1252. This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. i.e. 0x92)
//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");
// This should replace smart single quotes with a straight single quote
Regex.Replace(content, #"(\u2018|\u2019)", "'");
//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));
If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.
Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):
123[]45[]6789
I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?
Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?
The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".
What you have here is either:
Perfectly normal characters for which your font doesn't have a glyph.
Perfectly normal characters that aren't printable (e.g. control characters).
An artefact of how the debugger works.
The first thing is to find out what that character is. Find the integer value of the character, and then look it up.
An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).
Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.
Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.
Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.
Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.
Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.
Take a deeper look at the characters themselves, what are the acutal char values?
When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.
edit, nope
In your example I'd venture a guess that your seeing imbedded newline characters.
Define the allowed characters and block everything else, i.e.:
// only lowercase letters and digits
if(Regex.IsMatch(yourString, #"^[a-z0-9]*$"))
{
// allowed
}
But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.
PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.
Edit: possible solution
Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.
When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:
Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header
But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.
Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like é and the dash may look like —, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.
Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.
public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}
Is there any classes to convert ascii to xml characterset preferably opensource i will be using this class either in vc++ or C#
My ascii has some printable characters which is not there in xml character set
i just tried to sen a resume which is in ascii character set and i tried to store it in a online crm and i got this error message
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,22]
Message: Character reference "" is an invalid XML character.]
Thanks in advance
I had the same problem with Excel using the OpenXML document creation in C#.
My Excel Export feature would blow-up when building a doc with a bad ASCII character.
Somehow the string data, in my company's database, has funky characters in it.
Even though I used the Microsoft DocumentFormat.OpenXML assembly from their OpenXML SDK 2.0, it still didn't take care of this when assigning string values using their objects.
The Fix:
t.Text = Regex.Replace(sValue, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]", "?");
This cleans up the sValue string by removing the offending characters and replacing them with a question mark. You could replace with any string or just use an empty string.
The XML Spec Allows 0x09 (TAB), 0x0A (LF - Line Feed or NL - New Line), and 0x0D (CR - Carriage Return). The RegEx above takes care not remove those.
The XML 1.1 Spec allows you to escape some of these characters.
For example: Using for 0x03 appears as in HTML and as L in Office documents and notepad.
I use Asp.net and this is automatically taken care of in my GridView, so I do not need to replace these values - but I believe it may be the browser that takes care of it for all I know.
I thought of escaping these values in OpenXML, but when I looked at the output, it showed the excape markup. So MikeTeeVee still shows up as MikeTeeVee in Excel instead of something like MikeTeeVee, or MikeLTeeVee. This is why I preferred the Mike?TeeVee approach.
My hunch is this is a bug in the current OpenXML which encodes the allowed XML ASCII characters, but allows the unsupported ASCII characters to slip on through.
UPDATE:
I forgot I could look up how these characters are displayed using the "Open XML SDK 2.0 Productivity Tool" to see inside docs like Excel.
There I found it uses the format: _x0000_
Remember: XML 1.0 does not support escaping these values, but XML 1.1 does, so if you're using 1.1, then you can use this code to escape them.
Regular XML 1.1 Escaping:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("&#x" + string.Format("{0:00}", (byte)(m.Value[0])) + ";");
});
If you're escaping strings for OpenXML, then use this instead:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("_x" + string.Format("{0:0000}", (byte)(m.Value[0])) + "_");
});
Your text won't have any printable characters which aren't available in XML - but it may have some unprintable characters which aren't available in XML.
In particular, Unicode values U+0000 to U+001F are invalid except for tab. carriage return and line feed. If you really need those other control characters, you'll have to create your own form of escaping for them, and unescape them at the other end.
The character reference  is indeed not a valid XML character. You probably want either 
 or 
.
Out of curiousity, I took a few minutes to write a simple routinein C# to pump out a XML string of the 128 ASCII characters, to my surprise, .NET didn't output a really valid XML document. I guess the way I output the element text wasn't quite right. Anyway here is the code (comments are welcomed):
XmlDocument doc = new XmlDocument();
doc.AppendChild(doc.CreateXmlDeclaration("1.0", "us-ascii", ""));
XmlElement elem = doc.CreateElement("ASCII");
doc.AppendChild(elem);
byte[] b = new byte[1];
for (int i = 0; i < 128; i++)
{
b[0] = Convert.ToByte(i);
XmlElement e = doc.CreateElement("ASCII_" + i.ToString().PadLeft(3,'0'));
e.InnerText = System.Text.ASCIIEncoding.ASCII.GetString(b);
elem.AppendChild(e);
}
Console.WriteLine(doc.OuterXml);
Here is the formatted output:
<?xml version="1.0" encoding="us-ascii" ?>
<ASCII>
<ASCII_000></ASCII_000>
<ASCII_001></ASCII_001>
<ASCII_002></ASCII_002>
<ASCII_003></ASCII_003>
<ASCII_004></ASCII_004>
<ASCII_005></ASCII_005>
<ASCII_006></ASCII_006>
<ASCII_007></ASCII_007>
<ASCII_008></ASCII_008>
<ASCII_009> </ASCII_009>
<ASCII_010>
</ASCII_010>
<ASCII_011></ASCII_011>
<ASCII_012></ASCII_012>
<ASCII_013>
</ASCII_013>
<ASCII_014></ASCII_014>
<ASCII_015></ASCII_015>
<ASCII_016></ASCII_016>
<ASCII_017></ASCII_017>
<ASCII_018></ASCII_018>
<ASCII_019></ASCII_019>
<ASCII_020></ASCII_020>
<ASCII_021></ASCII_021>
<ASCII_022></ASCII_022>
<ASCII_023></ASCII_023>
<ASCII_024></ASCII_024>
<ASCII_025></ASCII_025>
<ASCII_026></ASCII_026>
<ASCII_027></ASCII_027>
<ASCII_028></ASCII_028>
<ASCII_029></ASCII_029>
<ASCII_030></ASCII_030>
<ASCII_031></ASCII_031>
<ASCII_032> </ASCII_032>
<ASCII_033>!</ASCII_033>
<ASCII_034>"</ASCII_034>
<ASCII_035>#</ASCII_035>
<ASCII_036>$</ASCII_036>
<ASCII_037>%</ASCII_037>
<ASCII_038>&</ASCII_038>
<ASCII_039>'</ASCII_039>
<ASCII_040>(</ASCII_040>
<ASCII_041>)</ASCII_041>
<ASCII_042>*</ASCII_042>
<ASCII_043>+</ASCII_043>
<ASCII_044>,</ASCII_044>
<ASCII_045>-</ASCII_045>
<ASCII_046>.</ASCII_046>
<ASCII_047>/</ASCII_047>
<ASCII_048>0</ASCII_048>
<ASCII_049>1</ASCII_049>
<ASCII_050>2</ASCII_050>
<ASCII_051>3</ASCII_051>
<ASCII_052>4</ASCII_052>
<ASCII_053>5</ASCII_053>
<ASCII_054>6</ASCII_054>
<ASCII_055>7</ASCII_055>
<ASCII_056>8</ASCII_056>
<ASCII_057>9</ASCII_057>
<ASCII_058>:</ASCII_058>
<ASCII_059>;</ASCII_059>
<ASCII_060><</ASCII_060>
<ASCII_061>=</ASCII_061>
<ASCII_062>></ASCII_062>
<ASCII_063>?</ASCII_063>
<ASCII_064>#</ASCII_064>
<ASCII_065>A</ASCII_065>
<ASCII_066>B</ASCII_066>
<ASCII_067>C</ASCII_067>
<ASCII_068>D</ASCII_068>
<ASCII_069>E</ASCII_069>
<ASCII_070>F</ASCII_070>
<ASCII_071>G</ASCII_071>
<ASCII_072>H</ASCII_072>
<ASCII_073>I</ASCII_073>
<ASCII_074>J</ASCII_074>
<ASCII_075>K</ASCII_075>
<ASCII_076>L</ASCII_076>
<ASCII_077>M</ASCII_077>
<ASCII_078>N</ASCII_078>
<ASCII_079>O</ASCII_079>
<ASCII_080>P</ASCII_080>
<ASCII_081>Q</ASCII_081>
<ASCII_082>R</ASCII_082>
<ASCII_083>S</ASCII_083>
<ASCII_084>T</ASCII_084>
<ASCII_085>U</ASCII_085>
<ASCII_086>V</ASCII_086>
<ASCII_087>W</ASCII_087>
<ASCII_088>X</ASCII_088>
<ASCII_089>Y</ASCII_089>
<ASCII_090>Z</ASCII_090>
<ASCII_091>[</ASCII_091>
<ASCII_092>\</ASCII_092>
<ASCII_093>]</ASCII_093>
<ASCII_094>^</ASCII_094>
<ASCII_095>_</ASCII_095>
<ASCII_096>`</ASCII_096>
<ASCII_097>a</ASCII_097>
<ASCII_098>b</ASCII_098>
<ASCII_099>c</ASCII_099>
<ASCII_100>d</ASCII_100>
<ASCII_101>e</ASCII_101>
<ASCII_102>f</ASCII_102>
<ASCII_103>g</ASCII_103>
<ASCII_104>h</ASCII_104>
<ASCII_105>i</ASCII_105>
<ASCII_106>j</ASCII_106>
<ASCII_107>k</ASCII_107>
<ASCII_108>l</ASCII_108>
<ASCII_109>m</ASCII_109>
<ASCII_110>n</ASCII_110>
<ASCII_111>o</ASCII_111>
<ASCII_112>p</ASCII_112>
<ASCII_113>q</ASCII_113>
<ASCII_114>r</ASCII_114>
<ASCII_115>s</ASCII_115>
<ASCII_116>t</ASCII_116>
<ASCII_117>u</ASCII_117>
<ASCII_118>v</ASCII_118>
<ASCII_119>w</ASCII_119>
<ASCII_120>x</ASCII_120>
<ASCII_121>y</ASCII_121>
<ASCII_122>z</ASCII_122>
<ASCII_123>{</ASCII_123>
<ASCII_124>|</ASCII_124>
<ASCII_125>}</ASCII_125>
<ASCII_126>~</ASCII_126>
<ASCII_127></ASCII_127>
</ASCII>
Update:
Added XML decalration with "us-ascii" encoding
Possibly you don't fully understand what a character set is. XML is not a character set, though XML based output does use character sets to encode data.
I'd recommend reading through Joel Spolsky's excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), then come back and have another go at your question.
You won't need an additional library to do that. From different encodings to embedded binary data, all of that is possible through the common .net library. Can you just give a simple example?
I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "'");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow& which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
' '
" "
< <
> >
But it might you get things like  , which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.