easiest way to remove invalid characters from a xml file?

easiest way to remove invalid characters from a xml file? - c#

I have a xml file with invalid characters. I searched through internet and haven't found any other way than reading the file as a text file and replace invalid characters one by one.
Can somebody please tell me an easiest way to remove invalid characters from a xml file..
ex xml stream:
<Year>where 12 > 13 occures </Year>

I would try HtmlAgilityPack. At least better than trying to parse manually.
HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml("<Year>where 12 > 13 occures </Year>");
using(StringWriter wr = new StringWriter())
{
using (XmlWriter xmlWriter = XmlWriter.Create(wr,
new XmlWriterSettings() { OmitXmlDeclaration = true }))
{
hdoc.Save(xmlWriter);
Console.WriteLine(wr.ToString());
}
}
this outputs:
<year>where 12 > 13 occures </year>

Start by thinking of the question differently. Your problem is that the input isn't valid XML. So you actually want to remove invalid characters from a non-XML file. That might sound pedantic, but it immediately indicates that tools designed for processing XML will be no use to you, because your input is not XML.
Fixing the problem at source is always better than trying to repair the damage later. But it you are going to embark on a repair strategy, the first thing is to define precisely what faults in the data you want to repair and how you intend to repair them. It's also a good idea to say clearly what constraints you apply to the solution: for example, does it matter if your repair accidentally changes the contents of any comments or CDATA sections?
Once you have defined your repair strategy: e.g. "replace any & by & if it is not immediately followed by either #nn; or #xnn; or a name followed by ';', coding it up becomes quite straightforward.

Related

Reading xml file with linq and getting error : Invalid character in the given encoding [duplicate]

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

xml document read having 0x19 hex character

Reading xml document with hex 0x19 character C#
Apparently the above character is invalid in C# and the XML document I am about to load contains it.
Therefore it throws XML exception everytime I try to read it.
<description><![CDATA[Whenever I run<br>Whenever I run to you lost one<br>Its never done<br>Just hanging on<br><br>Just past has let me be<br>Returning as if dream<br>Shattered as belief<br><br>If you have to go dont say goodbye<br>If you have to go dont you cry<br>If you have to go I will get by<br>Someday Ill follow you and see you on the other side<br><br>But for the grace of love<br>Id will the meaning of<br>Heaven from above<br><br>Your picture out of time<br>Left aching in my mind<br>Shadows kept alive<br><br>If you have to go dont say goodbye<br>If you have to go dont you cry<br>If you have to go I will get by<br>I will follow you and see you on the other side<br><br>But for the grace of love<br>Id will the meaning of<br>Heaven from above<br><br>Long horses we are born<br>Creatures more than torn<br>Mourning our way home]]></description>
Apparently the above line contains that character

Just had the same problem. 0x19 is a control-character (End of Medium) which is pretty much invisible with text editors. What I did to parse the XML anyway was to replace this char with a space before parsing, like this:
var xmlString = GetXmlString();
xmlString = xmlString.Replace((char)0x19, ' ');
var doc = new XmlDocument();
doc.LoadXml(xmlString);

How to remove xml "&" symbol

i am parasing xml file to dataset.I getting error if xml data contain "&" or some special char how to remove that?
How to remove "&" from below tag?
xml
< department departmentid=1 name="pen & Note" >
$
string departmentpath = HostingEnvironment.MapPath("~/App_Data/Department.xml");
DataSet departmentDS = new DataSet();
System.IO.FileStream dpReadXml = new System.IO.FileStream(departmentpath, System.IO.FileMode.Open);
try
{
departmentDS.ReadXml(dpReadXml);
}
catch (Exception ex)
{
//logg
}

You can replace it with &

The culture of XML is that the person who creates the XML is responsible for delivering well-formed XML that conforms to the spec; the recipient is expected to reject it if they get it wrong. So by trying to repair bad XML and turn it into good XML you are going against the grain. It's like getting served bad food in a restaurant: you should complain, rather than asking the people at the next table how to make it digestible.
The input you've provided has a lot more wrong with it than the ampersands. It's hardly recognizable as XML at all. You're never going to turn this mess into a robust data flow.

The code seems to be C#. But do add the correct language tag!
There are five special characters that often require escaping within XML documents. You can read this SO question.
There are two possibilities:
Let your DataSet::ReadXML method handle these special characters [Recommended]
Change all special characters from your input files [Not recommended]
The second method is not recommended since you cannot possibly always control the incoming data (and you probably would be wasting time pre-processing them if you do want to). In order for ReadXML to properly parse the special characters you will need to define a proper encoding too in your input XML.

xml and & issue

I am new to XML and I am now trying to read an xml file.
I googled and try this way to read xml but I get this error.
Reference to undeclared entity 'Ccaron'. Line 2902, position 9.
When I go to line 2902 I got this,
<H0742>&Ccaron;opova 14, POB 1725,
SI-1000 Ljubljana</H0742>
This is the way I try
XmlDocument xDoc = new XmlDocument();
xDoc.Load(file);
XmlNodeList nodes = xDoc.SelectNodes("nodeName");
foreach (XmlNode n in nodes)
{
if (n.SelectSingleNode("H0742") != null)
{
row.IrNbr = n.SelectSingleNode("H0742").InnerText;
}
.
.
.
}
When I look at w3school, & is illegal in xml.
EDIT :
This is the encoding. I wonder it's related with xml somehow.
encoding='iso-8859-1'
Thanks in advance.
EDIT :
They gave me an .ENT file and I can reference online ftp.MyPartnerCompany.com/name.ent.
In this .ENT file
I see entities like that
<!ENTITY Cacute "Ć"> <!-- latin capital letter C with acute,
U+0106 Latin Extended-A -->
How can I reference it in my xml Parsing ?
I prefer to reference online since they may add new anytime.
Thanks in advance !!!

The first thing to be aware of is that the problem isn't in your software.
As you are new to XML, I'm going to guess that definining entities isn't something you've come across before. Character entities are shortcuts for arbitrary pieces of text (one or more characters). The most common place you are going to see them is in the situation you are in now. At some point, your XML has been created by someone who wanted to type the character 'Č' or 'č' (that's upper and lower case C with Caron if your font can't display it).
However, in XML we only have a few predeclared entities (ampersand, less than, greater than, double quote and apostraphe). Any other character entities need to be declared. In order to parse your file correctly you will need to do one of two things - either replace the character entity with something that doesn't cause the parser issues or declare the entity.
To declare the entity, you can use something called an "internal subset" - a specialised form of the DTD statement you might see at the top of your XML file. Something like this:
<!DOCTYPE root-element
[ <!ENTITY Ccaron "Č">
<!ENTITY ccaron "č">]
>
Placing that statement at the beginning of the XML file (change the 'root-element' to match yours) will allow the parser to resolve the entity.
Alternatively, simply change the &Ccaron; to Č and your problem will also be resolved.
The &# notation is a numeric entity, giving appropriate unicode value for the character (the 'x' indicates that it's in hex).
You could always just type the character too but that requires knowledge of the ins and outs of your keyboard and region.

&Ccaron; isn't XML it's not even defined in the HTML 4 entity reference. Which btw isn't XML. XML doesn't support all those entities, in fact, it supports very few of them but if you look up the entity and find it, you'll be able to use it's Unicode equivalent, which you can use. e.g. Š is invalid XML but Š isn't. (Scaron was the closest I could find to Ccaron).

Your XML file isn't well-formed and, so, can't be used as XmlDocument. Period.
You have two options:
Open that file as a regular text file and fixed that symptom.
Fix your XML generator, and that's your real problem. That generator isn't generating that file using System.Xml, but probably concatening several strings, as "XML is just a text file". You should repair it, or opening a generated XML file will be always a surprise.
EDIT: As you can't fix your XML generator, I recommend to open it with File.ReadAllText and execute an regular expression to re-encode that & or to strip off entire entity (as we can't translate it)
Console.WriteLine(
Regex.Replace("<H0742>&Ccaron;opova 14, { POB & SI-1000 &</H0742>",
#"&((?!#)\S*?;)?", match =>
{
switch (match.Value)
{
case "<":
case ">":
case "&":
case """:
case "&apos;":
return match.Value; // correctly encoded
case "&":
return "&";
default: // here you can choose:
// to remove entire entity:
return "";
// or just encode that & character
return "&" + match.Value.Substring(1);
}
}));

&Ccaron; is an entity reference. It is likely that the entity reference is intended to be for the character Č, in order to produce: Čopova.
However, that entity must be declared, or the XML parser will not know what should be substituted for the entity reference as it parses the XML.

solution :-
byte[] encodedString = Encoding.UTF8.GetBytes(xml);
// Put the byte array into a stream and rewind it to the beginning
MemoryStream ms = new MemoryStream(encodedString);
ms.Flush();
ms.Position = 0;
// Build the XmlDocument from the MemorySteam of UTF-8 encoded bytes
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(ms);

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):
123[]45[]6789
I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?
Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?

The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".
What you have here is either:
Perfectly normal characters for which your font doesn't have a glyph.
Perfectly normal characters that aren't printable (e.g. control characters).
An artefact of how the debugger works.
The first thing is to find out what that character is. Find the integer value of the character, and then look it up.
An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).
Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.
Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.
Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.

Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.
Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.

Take a deeper look at the characters themselves, what are the acutal char values?
When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.
edit, nope
In your example I'd venture a guess that your seeing imbedded newline characters.

Define the allowed characters and block everything else, i.e.:
// only lowercase letters and digits
if(Regex.IsMatch(yourString, #"^[a-z0-9]*$"))
{
// allowed
}
But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.
PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.
Edit: possible solution
Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.
When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:
Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header
But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.
Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.
Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.

public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.