How to clean an XML file removing all elements not present in a provided XSD?
This does not work:
public static void Main()
{
XmlTextReader xsdReader = new XmlTextReader(#"books.xsd");
XmlSchema schema = XmlSchema.Read(xsdReader, null);
XmlReaderSettings settings = new XmlReaderSettings();
settings.Schemas.Add(schema);
settings.ValidationType = ValidationType.Schema;
settings.ValidationEventHandler += new ValidationEventHandler(ValidationCallBack);
XmlReader xmlReader = XmlReader.Create(#"books.xml", settings);
XmlWriter xmlWriter = XmlWriter.Create(#"books_clean.xml");
xmlWriter.WriteNode(xmlReader, true);
xmlWriter.Close();
xmlReader.Close();
}
private static void ValidationCallBack(object sender, ValidationEventArgs args)
{
((XmlReader)sender).Skip();
}
When I use the above, instead of removing all "junk" tags, it removes only the first junk tag and leaves the second one. As far as why I need to accept this file, I am using an old SQLServer 2012 instance which requires the XML to match the XSD exactly even if the extra elements in the XML are not used by the application. I do not have control over the source XML which is provided by a 3rd party tool with an unpublished XSD.
Sample Files:
Books.xsd
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="bookstore">
<xs:complexType>
<xs:sequence>
<xs:element name="book" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="title"/>
<xs:element type="xs:float" name="price"/>
</xs:sequence>
<xs:attribute type="xs:string" name="genre" use="optional"/>
<xs:attribute type="xs:string" name="ISBN" use="optional"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Books.xml
<bookstore>
<book genre='novel' ISBN='10-861003-324'>
<title>The Handmaid's Tale</title>
<price>19.95</price>
<junk>skdjgklsdg</junk>
<junk2>skdjgklsdg</junk2>
</book>
<book genre='novel' ISBN='1-861001-57-5'>
<title>Pride And Prejudice</title>
<price>24.95</price>
<junk>skdjgssklsdg</junk>
</book>
</bookstore>
Code mostly copied from: Validating an XML against referenced XSD in C#
If it's simply a question of removing all elements whose names don't appear anywhere in the schema, then it possibly feasible, as described below. However, in the general case (a) this doesn't ensure the instance will be valid against the schema (the elements might be in the wrong order, for example), and (b) it might remove elements that the schema actually allows (because of wildcards).
If the approach of removing unknown elements looks useful, you could do it as follows:
(a) write an XSLT stylesheet that extracts all the element names from the schema by looking for xs:element[#name] declarations, generating a document with the format:
<allowedElements>
<allow name="book" namespace=""/>
<allow name="isbn" namespace=""/>
</allowedElement>
(b) write a second (streamable) XSLT stylesheet:
<xsl:transform version="3.0" xmlns:xsl="....">
<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
<xsl:key name="k" match="allow" use="#name, #namespace" composite="yes"/>
<xsl:template match="*[not(key('k', (local-name(), namespace-uri()), doc('allowed-elements.xml'))]"/>
</xsl:transform>
The below successfully removes all of the junk tags from the provided examples. The second xsl:template tag is applied first and matches everything except the specifically white-listed tags. Then the first xsl:template tag writes a copy of the nodes to XmlWriter.
Code:
public static void Main()
{
XmlReader xmlReader = XmlReader.Create("books.xml");
XslCompiledTransform myXslTrans = new XslCompiledTransform();
myXslTrans.Load("books.xslt");
XmlTextWriter myWriter = new XmlTextWriter("books_clean.xml", null);
myXslTrans.Transform(xmlReader, null, myWriter);
xmlReader.Close();
myWriter.Close();
}
books.xslt
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[
not(name()='bookstore') and
not(name()='book') and
not(name()='title') and
not(name()='price')
]" />
</xsl:stylesheet>
Related
I have a requirement to create a XSD file to safely process XML that we are getting. The XML will be like this:
<Books>
<Book>
<abc>hello</abc>
<xyz>crazy</xyz>
<q123>world</q123>
...
</Book>
<Book>
<abc>bye</abc>
<xyz>bye</xyz>
<q123></q123>
...
</Book>
</Books>
The <Books> element is the root element so there will only be one.
The <Book> element occur between 1 to 100.
The trouble is with the subelements of the <Book> element.
The occurance of must be between 1 to 500.
The subelements can be any name.
The name must be 1 to 100 characters in length.
They can be in any order.
The subelement cannot have any attributes.
The subelement value can be 0 to 100 characters in length.
The good news is each <Book> element will have the same number and the same order of subelements. Below is the XSD I have so far. This based off of the answer from XSD for varying element names
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:vc="http://www.w3.org/2007/XMLSchema-versioning" vc:minVersion="1.1">
<xs:element name="Books">
<xs:complexType>
<xs:sequence>
<xs:element name="Book" minOccurs="0" maxOccurs="1000">
<xs:complexType>
<xs:sequence>
<xs:any processContents="strict" namespace="##local" minOccurs="0" maxOccurs="500"/>
</xs:sequence>
<xs:assert test="every $e in * satisfies matches(local-name($e), '.{1,100}')"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
When I try to parse the XML I'm getting this error: The 'http://www.w3.org/2001/XMLSchema:assert' element is not supported in this context.
I'm not sure why I am getting that error. Also I have no idea how to test the value of the subelement is between 0 and 100 characters, and how to make sure it has 0 attributes.
I'm working in C# .NET 4.6. Thanks in advance!
I searched and did not find any questions addressing this problem.
I am attempting to validate various XML against a schema and it seems to be validating ALL well-formed XML, instead of just XML that conforms to the schema. I have posted the code I am using, the Schema, a sample valid XML and a sample invalid XML.
I have been struggling with this for awhile. I am in the dark on most of this. I've had to learn how to write an XSD, write the XSD, then learn how to parse XML in C#. None of which I have ever done before. I have used many tutorials and the microsoft website to come up with the following. I think this should work, but it doesn't.
What am I doing wrong?
private bool ValidateXmlAgainstSchema(string sourceXml, string schemaUri)
{
bool validated = false;
try
{
// REF:
// This should create a SCHEMA-VALIDATING XMLREADER
// http://msdn.microsoft.com/en-us/library/w5aahf2a(v=vs.110).aspx
XmlReaderSettings xmlSettings = new XmlReaderSettings();
xmlSettings.Schemas.Add("MySchema.xsd", schemaUri);
xmlSettings.ValidationType = ValidationType.Schema;
xmlSettings.ValidationFlags = XmlSchemaValidationFlags.None;
XmlReader xmlReader = XmlReader.Create(new StringReader(sourceXml), xmlSettings);
// parse the input (not sure this is needed)
while (xmlReader.Read()) ;
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(xmlReader);
validated = true;
}
catch (XmlException e)
{
// load or parse error in the XML
validated = false;
}
catch (XmlSchemaValidationException e)
{
// Validation failure in XML
validated = false;
}
catch (Exception e)
{
validated = false;
}
return validated;
}
The XSD / Schema. The intent is to accept XML that contains either an Incident or a PersonOfInterest.
<?xml version="1.0" encoding="utf-8"?>
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="MySchema.xsd"
xmlns="MySchema.xsd"
elementFormDefault="qualified"
>
<xs:element name="Incident" type="IncidentType"/>
<xs:element name="PersonOfInterest" type="PersonOfInterestType"/>
<xs:complexType name="IncidentType">
<xs:sequence>
<xs:element name="Description" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="PersonOfInterest" type="PersonOfInterestType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="PersonOfInterestType">
<xs:sequence>
<xs:element name="Name" type="xs:string" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
</xs:schema>
Here is a sample of valid XML
<?xml version="1.0" encoding="utf-8" ?>
<Incident
xmlns="MySchema.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3schools.com MySchema.xsd"
>
<Description>something happened</Description>
<PersonOfInterest>
<Name>Joe</Name>
</PersonOfInterest>
<PersonOfInterest>
<Name>Sue</Name>
</PersonOfInterest>
</Incident>
This is a sample of well-formed invalid XML which should throw an exception (I thought), but when I try it, the code returns true, indicating it is valid against the schema.
<ghost>Boo</ghost>
The reason your <ghost>Boo</ghost> validates is that the parser cannot find any schema matching the xml. If there is no schema then the parser assumed validity, providing the xml is well-formed. It's counter-intuitive I know, and will probably differ based on parser implementation.
This notwithstanding, there are several problems with your code:
Two Root Elements
This is a big no-no in xsd - you can only have a single root element. Some parsers will actually throw an exception, others tolerate it but will only use the first root element (in your case Incident) for any subsequent validation.
Use of schemaLocation attribute
This should take the value (namespace) (URI) where the namespace is the targetNamespace of the schema and the URI is the location of the schema. In your case you appear to be using the schema file name as your target namespace. Additionally, looking at your code, you are loading the schema into your xml reader so you don't actually need the schemaLocation attribute at all. This is an optional attribute and some parsers completely ignore it.
I would suggest the following changes:
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://MyMoreMeaningfulNamespace"
xmlns="http://MyMoreMeaningfulNamespace"
elementFormDefault="qualified"
>
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="Incident" type="IncidentType"/>
<xs:element maxOccurs="unbounded" name="PersonOfInterest" type="PersonOfInterestType"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="IncidentType">
<xs:sequence>
<xs:element name="Description" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="PersonOfInterest" type="PersonOfInterestType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="PersonOfInterestType">
<xs:sequence>
<xs:element name="Name" type="xs:string" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
</xs:schema>
Which validates this instance
<Root xmlns="http://MyMoreMeaningfulNamespace">
<Incident>
<Description>something happened</Description>
<PersonOfInterest>
<Name>Joe</Name>
</PersonOfInterest>
<PersonOfInterest>
<Name>Sue</Name>
</PersonOfInterest>
</Incident>
<Incident>
...
</Incident>
<PersonOfInterest>
<Name>Manny</Name>
</PersonOfInterest>
<PersonOfInterest>
...
</PersonOfInterest>
</Root>
I'm having a problem validating XML against schema. Simplified code and examples:
Verification code:
public static void ValidateXmlAgainstSchema(StreamReader xml, XmlSchema xmlSchema)
{
var settings = new XmlReaderSettings { IgnoreWhitespace = true, IgnoreComments = true };
settings.Schemas.Add(xmlSchema);
settings.ValidationType = ValidationType.Schema;
settings.ValidationEventHandler += (obj, args) => { if (args.Exception != null) throw args.Exception; };
using (var reader = XmlReader.Create(xml, settings))
using (XmlReader validatingReader = XmlReader.Create(reader, settings))
{
while (validatingReader.Read()){}
}
}
Schema:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://foo.com/"
xmlns="http://foo.com/">
<xs:simpleType name="myBool">
<xs:restriction base="xs:string">
<xs:enumeration value="true"/>
<xs:enumeration value="false"/>
<xs:enumeration value="file_not_found"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="dataType">
<xs:sequence>
<xs:element name="id" type="xs:string" minOccurs="1" maxOccurs="1" />
<xs:element name="name" type="xs:string" minOccurs="0" maxOccurs="1" />
</xs:sequence>
</xs:complexType>
<xs:element name="foo">
<xs:complexType>
<xs:sequence>
<xs:element name="data" type="dataType" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="myBool" type="myBool" use="optional" />
</xs:complexType>
</xs:element>
</xs:schema>
XML:
1.
<?xml version="1.0"?>
<foo xmlns="http://foo.com/" myBool="true">
<data>
<id>1</id>
<name>abc</name>
</data>
</foo>
This example throws an exception:
System.Xml.Schema.XmlSchemaValidationException: The element 'foo' in namespace 'http://foo.com/' has invalid child element 'data'
in namespace 'http://foo.com/'. List of possible elements expected: 'data'.
My understanding is that if the namespace is defined for an element, all child elements will have the same namespace, unless defined otherwise. It doesn't work though. I can make it validate by adding elementFormDefault="qualified" to the schema, which makes all elements default to targetNamespace. Is that a good way of doing it?
2.
<?xml version="1.0"?>
<a:foo xmlns:a="http://foo.com/" a:myBool="true">
<a:data>
<a:id>1</a:id>
<a:name>abc</a:name>
</a:data>
</a:foo>
This example fails with the message:
The 'http://foo.com/:myBool' attribute is not declared.
Every element and attribute has an explicit namespace, so the xml should be valid. Even the error message suggest parser is looking for the attribute I expect it to, but fails to find it. I can make it validate by changing a:myBool to myBool. Why doesn't it work in the first form and works in the other?
elementFormDefault won't do anything to attributes, to set the equivilent for those you need attributeFormDefault. However, by default both of these are set to "unqualified".
The reason approach 2 - a:myBool="true" - failed is becuase the attributeFormDefault value wasn't overridden. If you want to namespace attributes, you can either set this to "qualified" or set the form attribute on the attribute declaration itself to "qualified", like so:
<xs:attribute name="myBool" type="myBool" use="optional" form="qualfied"/>
This should make this a valid element start for approach 2:
<a:foo xmlns:a="http://foo.com/" a:myBool="true">
As for why approach 1 failed, I'm not sure, your XSD and XML match. It might be worth adding setting the attributeFormDefault attribute on the root XSD element to "unqualified", just in case the XSLT engine doesn't recognise their default settings when they aren't declared. Like so:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://foo.com/"
attributeFormDefault="unqualified"
xmlns="http://foo.com/">
<products>
<product id="1">
<photos>
<photo addr="1.jpg" />
</photos>
<parameters>
<parameter name="name" />
</parameters>
</product>
</products>
Hello, I have this xml, I want to get values, like photo addr or parameter name I can't.
in DataGrid I am getting like new table. How to Read this parameters by product id?
foreach (DataTable t in dataSet.Tables)
{
Console.WriteLine(t);
}
I am getting: product, photos, photo, parameters, parameter.
Sorry for my English
OK, you need to use schema inferencing to generate an XSD file, then run xsd.exe to generate the DataSet that you're looking for.
To begin with, the XML you've provided won't work off the bat, because you don't have a root node. I'm taking a wild guess here, but if <products> is the root node, we could reformat your XML to look like this:
<products>
<product id="1">
<photos>
<photo addr="1.jpg" />
</photos>
<parameters>
<parameter name="name1" />
</parameters>
</product>
<product id="2">
<photos>
<photo addr="2.jpg" />
</photos>
<parameters>
<parameter name="name2" />
</parameters>
</product>
</products>
We could then take this XML and infer the XSD from it via the XmlSchemaInference class. When I inferred the schema from the XML above, I got this:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="products">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="product">
<xs:complexType>
<xs:sequence>
<xs:element name="photos">
<xs:complexType>
<xs:sequence>
<xs:element name="photo">
<xs:complexType>
<xs:attribute name="addr" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="parameters">
<xs:complexType>
<xs:sequence>
<xs:element name="parameter">
<xs:complexType>
<xs:attribute name="name" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="id" type="xs:unsignedByte" use="required" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
I then ran xsd.exe with the /dataset option, which gave me a fully functional DataSet representing the original XML. Populating the dataset and the using the DataSet.GetXml() returns the expected XML result.
Final note: I'm not advocating the use of DataSets over domain objects, etc etc. I just wanted to show you what would be the steps for accomplishing your stated objective.
By default DataSet assumes a schema based on elements, not attributes. You'd need to build an xsd for your data to get DataSet to read that XML correctly.
Due to the way this XML is constructed there's no easy way to read this into a DataSet directly. You'd have to use xsd.exe to generate an xsd file and then hand edit it with appropriate attributes from the msdata namespace. It's a quite painful process.
You coud transform the document to a form which would probably parse correctly into DataSet either by rewriting the XML using LINQ or applying an XslCompiledTransform:
<products>
<product id="1">
<photo addr="1.jpg" />
<parameter name="name" />
</product>
</products>
You could also consider using LINQ to XML on it instead of trying to use a DataSet:
var productsInfo = from product in productsElement.Descendants("product")
from photo in product.Descendants("photo")
from parameter in product.Descendants("parameter")
let id = product.Attribute("id")
let addr = photo.Attribute("addr")
let name = parameter.Attribute("name")
select new { ID = id.Value, Addr = addr.Value, Name = name.Value};
If you're developing a new XML layout rather than working with an existing one, you can create the DataSet first using the designer. Populate it with some dummy data, then you can use WriteXml to get the xsd schema and sample data. To preserve relationshps in the XML ouput be sure to set the Nested property on your Relations to be true (See Nesting DataRelations (ADO.NET)).
Hi all i have my XML file as follows
Name of XML XMLFile2.xml
<?xml version="1.0"?>
<Product ProductID="123"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="Product.xsd">
<ProductName>XYZ</ProductName>
</Product>
My XSD is as follows
<?xml version="1.0" encoding="utf-8"?>
<xs:schema id="Product"
targetNamespace="http://tempuri.org/Product.xsd"
elementFormDefault="qualified"
xmlns="http://tempuri.org/Product.xsd"
xmlns:mstns="http://tempuri.org/Product.xsd"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Product">
<xs:complexType>
<xs:sequence>
<xs:element name="ProductName" type="xs:string"></xs:element>
</xs:sequence>
<xs:attribute name="ProductID" type="xs:int" use="required"/>
</xs:complexType>
</xs:element>
This is my code
string strPath = Server.MapPath("XMLFile2.xml");
XmlTextReader r = new XmlTextReader(strPath);
XmlValidatingReader v = new XmlValidatingReader(r);
v.ValidationType = ValidationType.Schema;
v.ValidationEventHandler +=
new ValidationEventHandler(MyValidationEventHandler);
while (v.Read())
{
}
v.Close();
if (isValid)
Response.Write("Document is valid");
else
Response.Write("Document is invalid");
I am getting the following errors
Validation event
The targetNamespace parameter '' should be the same value as the targetNamespace 'http://tempuri.org/Product.xsd' of the schema.Validation event
The 'Product' element is not declared.Validation event
Could not find schema information for the attribute 'ProductID'.Validation event
The 'ProductName' element is not declared.Document is invalid
Can any one tell where i went wrong.
Your XSD is set to validate the "http://tempuri.org/Product.xsd" namespace, but your XML contains only elements from the "" namespace.
You need to either (a) change the XML file to use the "http://tempuri.org/Product.xsd" namespace, or (b) change the XSD file to use the "" namespace, depending on your user requirements.