how to parse an xml document and validate the fragment that is not valid using c# ignoring the binary data at the end. is it possible to only parse the xml elements enclosed between the root elements and ignore the binary data.
You can use the XDocument validation methods to validate the document as a whole, then as long as you use the override that embeds the validation information in the XDcoument, you can go back over specific elements and get their validity.
Sorry I don't have any code to hand for this at the moment...
Related
I've got a library project where object are serialized to XML format for further download by users in ASP.NET application. Additionaly i've used XSD to generate types for serialization. The number of types for serialization is very big. Each type is serialized to its own XML. Some types have string properties, sometimes those properties contains empty strings. During serializations those properties are been serialized to some like this
<propertyName />
So this properties become invalid by XSD (they are not required but have some restrictions like string minimal string length etc.
Is there any way to configure XMLSerializer not no serialize empty strings to empty xml elements for all types that are been serialized.
For serializing I use System.Xml.XmlSerializer.
You'd need to implement xml writer/reader for the serializations to work;
You would also need to edit the writer and reader to work on conditionals, first check if a param is an empty string before writing a new xml element and placing its value.
if(string.isNullOrEmpty(this.testString)){
break; // if in a loop of params, just giving an example, rest of the
// xmlwriter implementation would be normal
// note you might need to also implement the reader a bit different - unsure of that.
}
Reference material:
http://forum.codecall.net/topic/58239-c-tutorial-reading-and-writing-xml-files/
http://www.dotnetperls.com/xmlwriter
I would advise you to go back and read the XML specification carefully. See http://www.w3.org/TR/REC-xml/#sec-starttags
where it says:
[Definition: An element with no content is said to be empty.] The representation >of an empty element is either a start-tag immediately followed by an end-tag, or
an empty-element tag. [Definition: An empty-element tag takes a special form:]
So this:
<propertyName />
is exactly equivalent to this:
<propertyName></propertyName>
...and any XML processor that treats them differently is not conforming to the specification.
I find that people often confuse the following concepts when dealing with XML and XML schema:
tag with empty content.
Either form is acceptable. Empty is not the same as 'null' or 'nil'.
An element is allowed to be empty or nil even if minOccurs=1 in the schema.
null value / nil value.
Not the same as empty content. XML has a specific attribute to indicate that the value is 'nil'.
missing tag.
The tag is entirely omitted from the XML document. Not the same as empty or nil.
This will trigger a validation error if minOccurs=1
If you are fetching data from database then you can apply if condition like :-
if (Dbobject.propertyName == ""){
XMLObject.propertyName = null;
} else {
XMLObject.propertyName = Dbobject.propertyName;
}
The null values will not be serialized and the property name will be skipped during XML Serialization.
I have a Description textbox on the page. When enter the data in that and submit the page. I will pass that string to XML tag in the XML file.
If user enter any invalid characters in textbox which are not allowed for xml. How to remove or parse them from string? I need to validate string for XML data.
If you're using the XmlDocument or XDocument classes to build the XML then you don't need to worry as they'll do the encoding for you.
Otherwise, if you generating the XML by hand you can use the SecurityElement.Escape method to encode invalid XML characters
That depends on how you are creating the XML. If you are assembling the XML string yourself, there are A LOT of things you should do and take into consideration.
Thus, you should not be doing that (assembling the string yourself).
.NET provides you with abstraction layers so you don't have to deal with that. Example: XDocument
XmlWriter.WriteRaw will preserve ' and not send an actual apostrophe. Is there a method to read in ' and keep it as such?
You need to encode it properly. Let's take for example the following XML:
<root>'</root>
The value of the <root> node is ' no matter which XML parser you use to read this XML.
On the other hand if you have the following XML:
<root>'</root>
the value of the <root> node is '.
So in both cases we have properly encoded XML so that when a standard compliant parser reads it, it is able to correctly retrieve the value.
So be very careful when using the WriteRaw method when generating the XML. Since it properly encode the argument it is now your responsibility to ensure that you are passing correct data to it.
I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?
Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.
The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.
I use a xsl tranform to convert a xml file to html in dotNet. I transform the node values in the xml to html tag contents and attributes.
I compose the xml by using .Net DOM manipulation, setting the InnerText property of the nodes with the arbitrary and possibly malicious text.
Right now, maliciously crafted input strings will make my html unsafe. Unsafe in the sense that some javascript might come from the the user and find its way to a link href attribute in the output html, for example.
The question is simple, what is the sanitizing, if any, that I have to do with my text before assigning it to the InnerText property? I thought that assigning to InnerText instead of InnerXml would do all the needed sanitization of the text, but that seems to not be the case.
Does my transform have to have any special characteristics to make this work safely? Any .net specific caveats that I should be aware?
Thanks!
You should sanitize your XML before transforming it with XSLT. You probably will need something like:
string encoded = HttpUtility.HtmlEncode("<script>alert('hi')</script>");
XmlElement node = xml.CreateElement("code");
node.InnerText = encoded;
Console.WriteLine(encoded);
Console.WriteLine(node.OuterXml);
With this, you'll get
<script>alert('hi')</script>
When you add this text into your node, you'll get
<code><script>alert('hi')</script></code>
Now, if you run your XSLT, this encoded HTML will not cause any problems in your output.
It turns out that the problem came from the xsl itself, wich used disable-output-escaping. Without that the Tranform itself will do all the encoding necessary.
If you must use disable-output-escaping, you have to use the appriate encodeinf function for each element. HtmlEncode for tag contents, HtmlAttributeEncode for attribute values and UrlEncode for html attribute values (e.g href)