Xelement.XPathSelectElement - c#

I have the following XPath String
"(//DEAL_SETS/DEAL_SET/DEALS/DEAL/PARTIES/PARTY[ROLES/ROLE/
ROLE_DETAIL/PartyRoleType='Borrower'])
[(ROLES/PARTY_ROLE_IDENTIFIERS/PARTY_ROLE_IDENTIFIER/
PartyRoleIdentifier='1' or position() = 1)]/ROLES/ROLE/BORROWER/
RESIDENCES/RESIDENCE/ADDRESS/PostalCode[../../RESIDENCE_DETAIL/
BorrowerResidencyType='Current']"
which works when I put it in Altova XML Spy and gives me a result.
But when I use it directly as it is Xelement.XPathSelectElement(XPath), it does not work, but what works is Xelement.XPathSelectElement(collective, nameSpaceManager) where collective is namesspace prefix (say named "ns") plus my XPath string.
But the problem is I have to change XPath string to something like this
"(//ns:DEAL_SETS/ns:DEAL_SET/ns:DEALS/ns:DEAL/ns:PARTIES/ns:PARTY[ns:ROLES/
ns:ROLE/ns:ROLE_DETAIL/ns:PartyRoleType='Borrower'])[(ns:ROLES/
ns:PARTY_ROLE_IDENTIFIERS/ns:PARTY_ROLE_IDENTIFIER/
ns:PartyRoleIdentifier='1' or position() = 1)]/ns:ROLES/
ns:ROLE/ns:BORROWER/ns:RESIDENCES/ns:RESIDENCE/
ns:ADDRESS/ns:PostalCode[../../ns:RESIDENCE_DETAIL/
ns:BorrowerResidencyType='Current']"
Is there any way to avoid having to put the namespace prefix (ns:) at each node.
Sorry for not posting the sample xml earlier,you may replicate the Party tag and its elements and fill up with false data, PartyRoleType gives me what party it is could be borrower,appraiser etc partyroleidentifier gives me a way to differentiate between two party with same partyroletypes for eg there could be 2 borrowers,the partyroleidentifier differentiates them as 1 and 2
<?xml version="1.0" encoding="utf-8"?>
<MESSAGE xmlns="http://www.example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xxxxReferenceModelIdentifier="3.0.0.263.12" xsi:schemaLocation="http://www.example.org/schemas C:\Subversion\xxx_3_0.xsd">
<DEAL_SETS>
<DEAL_SET>
<DEALS>
<DEAL>
<PARTIES>
<PARTY>
<ROLES>
<ROLE>
<PARTY_ROLE_IDENTIFIERS>
<PARTY_ROLE_IDENTIFIER>
<PartyRoleIdentifier>1</PartyRoleIdentifier>
</PARTY_ROLE_IDENTIFIER>
</PARTY_ROLE_IDENTIIERS>
<BORROWER>
<RESIDENCES>
<RESIDENCE>
<ADDRESS>
<PostalCode>56236</PostalCode>
</ADDRESS>
</RESIDENCE>
</RESIDENCES>
</BORROWER>
<ROLE_DETAIL>
<PartyRoleType>Borrower</PartyRoleType>
</ROLE_DETAIL>
</ROLE>
</ROLES>
</PARTY>
</PARTIES>
</DEAL>
</DEALS>
</DEAL_SET>
</DEAL_SETS>
</MESSAGE>

According to the MSDN documentation, when you call AddNamespace for your namespace manager, you can use String.Empty as the prefix for your namespace URI. That should enable you to do an XPath expression without all those ns: prefixes.
Update:
OK, I remember encountering this issue before. The behavior of empty namespace prefixes in XPath expressions is not consistent with the rest of the .NET XML suite. There is some evidence that some Microsoft folks are aware of the issue. See this forum thread. I'll look for a while more for solutions, but I'm thinking that you'll have to make do with using namespace prefixes.
Update 2:
Feel free to vote on this Microsoft Connect Bug that I've just created. I see that they have closed similar bugs in the past, but none of them focused on usability concerns before. Maybe we'll get some traction with more votes.
Update 3:
Just some clarifications, my filed bug is not and never was about compliance with XPath standards, but rather about usability. No matter the standards, users need a way to make XPath queries with namespaced XML more succinct, and that is currently not supported. As far as standards support goes, XSLT 2.0 has a way to do XPath 2.0 with empty namespace prefixes associated with a namespace URI, and XMLSpy has that capability as well. The MSDN documentation claims that .NET has XPath 2.0 support, but the veracity of that claim is under dispute. One could argue that .NET is properly implementing XPath 1.0 by disallowing the modification of what an empty prefix represents.

The currently accepted answer is wrong !
Any compliant XPath processor interprets a non-prefixed name to belong to "no-namespace".
This is the biggest FAQ in XPath:
There is no way in XPath to select elements in a default namespace, by using their un-prefixed name.
The reason is that XPath considers any unprefixed name to reside in "no namespave" and it is looking to find elements with such name that are in "no namespace" Such elements aren't found and selected, because all elements are in a non-empty, default namespace.
Here is a precise quote from the W3C XPath 1.0 specification:
"if the QName does not have a prefix, then the namespace URI is null".
There are two main solutions:
Using the XPath-related API available in your programming language, register an association between the default namespace and a string prefix (lets say "x"). Then modify your wxpression and replace anyunprefixed someElementName with x:someElementName. In .NET, to register the association, an XmlNamespaceManager object typically has to be created, and its AddNamespace() method needs to be used for registering the association. Then, when calling the XPath-selecting method, the XmlNamespaceManager instance is passed as an additional argument.
If one wants to avoid registering the default namespace, one workaround is, instead of:
..
/a/b/c
to use:
/*[name()='a']/*[name()='b']/*[name()='c']

Related

how to skip invalid xml nodes when reading with xmlreader? [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌​}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

Convert string to valid XML [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌​}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

C# multiple writeattributestring to xml not showing in correct order

xtw.WriteStartElement("cXML");
xtw.WriteAttributeString("payloadID", payloadidstr);
xtw.WriteAttributeString("timestamp", utctime());
xtw.WriteAttributeString("version", "1.2.024");
above code working fine to generate xml attribute. if open xml file in notepad shows the following string which is correct.
cXML payloadID="1392408819113-4172669982087053277#123.456.789.10" timestamp="2014-02-14T12:13:39-08:00" version="1.2.024"
but when open xml file in any browser the attribute order is changed showing like this.
cXML version="1.2.024" timestamp="2015-01-15T16:54:48-08:00" payloadID="150120150454480293-832257153#123.456.789.10"
Can someone let me know why browser not showing in correct order or how to shows multiple string under one element.
XML does not define ordering of attributes, so there is no "correct" order - compliant reader/writers are free to order the way they pleased.
Per the spec section 3.1:
the order of attribute specifications in a start-tag or empty-element tag is not significant

Get all attributes with the same name

I'm using XDocument and I need to parse my XML file to retrieve all attribute with the same name event if its node's name is different from the other.
For example, for this XML :
<document>
<person name='jame'/>
<animals>
<dog name='robert'/>
</animals>
</document>
I want to retrieve all attributes named 'name'.
Can I do that with one request XPath or do I need to parse every node to find thos attributes ?
Thanks for your help !
The XPath expression
//#name
will select all attributes called name, regardless of where they appear.
By the way, 'parsing' is something that happens to the XML document before XPath ever enters the picture. So when you say "do I need to parse every node", I think this isn't really what you mean. The entire document is typically already parsed before you run an XPath query. However, I'm not sure what you do mean instead of 'parse'. Probably something like "do I need to visit every element" to find those attributes? In which case the answer is no, unless in some vague implementation-dependent sense that doesn't make any difference to you.

How do I match complete XML objects in a string?

I'm attempting to find complete XML objects in a string. They have been placed in the string by an XmlSerializer, but may or may not be complete. I've toyed with the idea of using a regular expression, because it seems like the kind of thing they were built for, except for the fact that I'm trying to parse XML.
I'm trying to find complete objects in the form:
<?xml version="1.0"?>
<type>
<field>value</field>
...
</type>
My thought was a regex to find <?xml version="1.0"?><type> and </type>, but if a field has the same name as type, it obviously won't work.
There's plenty of documentation on XML parsers, but they seem to all need a complete, fully-formed document to parse. My XML objects can be in a string surrounded by pretty much anything else (including other complete objects).
hw<e>reR#lot$0fr#ndm&nchrs%<?xml version="1.0"?><type><field>...</field>...</type>#ndH#r$omOre!!>nuT6erjc?y!<?xml version="1.0"?><type><field>...</field>...</type>ty!=]
A regex would be able to match a string while excluding the random characters, but not find a complete XML object. I'd like some way to extract an object, parse it with a serializer, then repeat until the string contains no more valid objects.
Can you use a regular expression to search for the "<?xml" piece and then assume that's the beginning of an XML object, then use an XMLReader to read/check the remainder of the string until you have parsed one entire element at the root level (then stop reading from the stream with XMLReader after the root node has been completely parsed)?
Edit: For more information about using XMLReader, I suggest one of the questions I asked: I can never predict xmlreader behavior, any tips on understanding?
My final solution was to stick with the "Read" method when parsing XML and avoid other methods that actually read from the stream advancing the current position.
You could try using the Html Agility Pack, which can be used to parse "malformed XML" and make it accessible with a DOM.
It would be necessary to know which element you are looking for (like <type> in your example), because it will be parsing the accidental elements too (like <e> in your example).

Categories

Resources