Parsing incomplete OpenXML document - c#

Here is a sample of a document with ".xls" extension that I assume is part of some OpenXml zip archive (i have just the xls file):
<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
<Author>Some Author</Author>
<LastAuthor>Something</LastAuthor>
<Company>Some random company</Company>
</DocumentProperties>
<ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
</ExcelWorkbook>
<!--Styles removed-->
<Worksheet ss:Name="WB1">
<Table>
<Column ss:AutoFitWidth="1" ss:Width="100" ss:Span="100"/>
<Row ss:StyleID="tableHeader">
<Cell>
<Data ss:Type="String">Random column 1</Data>
</Cell>
<Cell>
<Data ss:Type="String">Random column 2</Data>
</Cell>
<Cell>
<Data ss:Type="String">Random column 3</Data>
</Cell>
</Row>
<Row ss:StyleID="tableRow">
<Cell ss:StyleID="tableRow">
<Data ss:Type="String">1</Data>
</Cell>
<Cell ss:StyleID="tableRow">
<Data ss:Type="String">random value 1</Data>
</Cell>
</Row>
<Row ss:StyleID="tableRow">
<Cell ss:StyleID="tableRow">
<Data ss:Type="String">2</Data>
</Cell>
<Cell ss:StyleID="tableRow">
<Data ss:Type="String">random value 2</Data>
</Cell>
</Row>
<Row ss:StyleID="tableRow">
<Cell ss:StyleID="tableRow">
<Data ss:Type="String">3</Data>
</Cell>
<Cell ss:StyleID="tableRow">
<Data ss:Type="String">random value 3</Data>
</Cell>
</Row>
</Table>
<WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
<Selected/>
<Zoom>90</Zoom>
<FreezePanes/>
<FrozenNoSplit/>
<SplitHorizontal>1</SplitHorizontal>
<TopRowBottomPane>1</TopRowBottomPane>
<ActivePane>2</ActivePane>
</WorksheetOptions>
</Worksheet>
</Workbook>
The question is, what would be a fast way to parse this document?
What I've tried so far:
Open an OleDbConnection using the Microsoft provider: Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0}; Extended Properties='Excel 12.0; HDR = Yes; IMEX = 1'; which throws the following exception: System.Data.OleDb.OleDbException: 'External table is not in the expected format.' when trying to open a connection. I've also switched between Excel 12 and 8 with same result.
Open the file using DocumentFormat.OpenXml.Packaging but it throws an error saying: System.IO.InvalidDataException: 'End of Central Directory record could not be found.' most likely because the file is not in a zipped archive specific to OpenXml structure.
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
Even if I try to open the said file with Excel, it says that The format and extension don't match, and I agree. It's basically just an xml file renamed as xls, but still, it opens as an Excel spreadsheet.
Should I treat this as an actual XML or is there some parsing tool that can help me read this so called "xls" file?

Related

I have created SpreadsheetML and try to open into Excel file. it was unable to open

From This Image
I have created SpreadsheetML and try to open into Excel file. it was unable to open. because Data node contain nested XML.
Is there any solution for this other than interop excel dll ?
<XMl Version="">
<workbool xmlns="urn:schemas-microsoft-com:office:spreadsheet"------>
---
---
<Row>
<Cell ss:styyleID =" -----">
<data>
<?XML version="1.0" encoding="UTF-8">
<ExecutionDate> 10/10/2017 </ExecutionDate>
</Data>
</Cell>
</Row>
<Row>
---
---
</Worksheet>
</Workbook>>

How to return a list of serializable XML elements

I know how to make function return a "pretty" result as xml, and I want just that, but without the extra parent element.
Using this function
[HttpGet]
public XElement MyFunction() { ... }
gets me half way. I get this:
<Result>
<row>
<cell>...</cell>
<cell>...</cell>
<cell>...</cell>
</row>
<row>
<cell>...</cell>
<cell>...</cell>
<cell>...</cell>
</row>
</Result>
I want result like this:
<row>
<cell>...</cell>
<cell>...</cell>
<cell>...</cell>
</row>
<row>
<cell>...</cell>
<cell>...</cell>
<cell>...</cell>
</row>
And using function like this:
[HttpGet]
public IEnumerable<XElement> MyFunction() { ... }
Gets me a result like this:
<ArrayOfXElement
xmlns:i="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://schemas.datacontract.org/2004/07/System.Xml.Linq">
<XElement>
<row xmlns="">
<cell>...</cell>
<cell>...</cell>
<cell>...</cell>
</row>
</XElement>
<XElement>
<row xmlns="">
<cell>...</cell>
<cell>...</cell>
<cell>...</cell>
</row>
</XElement>
etc...
Which is, I mean... look at it... its hideous.
I beleave that there must be a simple solution to this, or at least a workaround.
EDIT As #Fabio suggested in the first comment, XML is invalid if it has no root element, and thus, this question becomes silly, since I do need other systems to be able to read the result of my functions. Even so, I will not remove this question, because I hope that somebody else who needs a solution like this will, like me, realise that that kind of XML document is invalid. Also, if anyone answers the question, I will approve the answer.
get a reader and get the innerXml. This should get you what you need.
var MainElement= XElementA.CreateReader();
MainElement.MoveToContent();
var Content=MainElement.ReadInnerXml();

XmlException Text node cannot appear in this state (Boolean notWhiteSpace)

So I'm having this error occur when I try to create an XDocument from a file, but the issue is that it occurs only about 20% of the time, but my program needs to call this function that creates the XDocument every 20 seconds, so this is critical to fix.
This is a small snippit of my function loading from the file into the XDocument
//Read all patterns
DirectoryInfo directory = new DirectoryInfo ("Assets/_Scripts/Items/Orb Patterns");
orbPatterns = directory.GetFiles ().Cast<FileInfo> ().ToList ();
//Pick a random pattern
XDocument xmlDoc = XDocument.Load (orbPatterns [Random.Range (0, orbPatterns.Count - 1)].FullName);
The error occurs on the XDocument.Load() line.
The actual exception being thrown is
XmlException: Text node cannot appear in this state.
file:///Assets/_Scripts/Items/Orb Patterns/pattern1.xml.meta Line 1, position 1.
Mono.Xml2.XmlTextReader.ReadText (Boolean notWhitespace)
Mono.Xml2.XmlTextReader.ReadContent ()
Mono.Xml2.XmlTextReader.Read ()
System.Xml.XmlTextReader.Read ()
Mono.Xml.XmlFilterReader.Read ()
System.Xml.Linq.XDocument.ReadContent (System.Xml.XmlReader reader, LoadOptions options)
System.Xml.Linq.XDocument.LoadCore (System.Xml.XmlReader reader, LoadOptions options)
System.Xml.Linq.XDocument.Load (System.String uri, LoadOptions options)
System.Xml.Linq.XDocument.Load (System.String uri)
As I said, it works about 80% of the time, the other 20% it'll throw the exception. However, this just means the function won't run for the current iteration. After 20 seconds it will try again and usually work.
My XML document should be perfectly fine, here is an example
<?xml version="1.0" encoding="UTF-8"?>
<table>
<cell column="1" row="1">Red</cell>
<cell column="1" row="2">Red</cell>
<cell column="1" row="3">Red</cell>
<cell column="1" row="4">Red</cell>
<cell column="2" row="1">Red</cell>
<cell column="2" row="2">Blue</cell>
<cell column="2" row="3">Blue</cell>
<cell column="2" row="4">Red</cell>
<cell column="3" row="1">Red</cell>
<cell column="3" row="2">Multi</cell>
<cell column="3" row="3">Multi</cell>
<cell column="3" row="4">Red</cell>
<cell column="4" row="1">Red</cell>
<cell column="4" row="2">Blue</cell>
<cell column="4" row="3">Blue</cell>
<cell column="4" row="4">Red</cell>
</table>
I've read similar issues and they seem to be related to the encoding and to encode it without BOM, and I've tried all of that but the issue still occurs. I figure that if it will run most of the time than encoding shouldn't be the issue. Any ideas?
I think that maybe the XML encoding type (UTF-8 is this case) is causing the error. The solution is to convert the file to ascii (removing the BOM), or encoding in UTF-8 without BOM.
Take a look at this question http://answers.unity3d.com/questions/10904/xmlexception-text-node-canot-appear-in-this-state.html which asks help for the same problem.

XML to LINQ with metadata

I'm very new to both XML and LINQ. I've read several XML to LINQ tutorials, but none of the XML documents seem to be formatted the way mine is. I'm not sure if (and how) it changes things.
All the examples I've read on the internet seem to follow this format:
<data>
<row>
<Term>201320</Term>
<Subj>ACCT</Subj>
<Subj_desc>Accounting</Subj_desc>
</row>
<row>
<Term>201320</Term>
<Subj>ACCT</Subj>
<Subj_desc>Accounting</Subj_desc>
</row>
</data>
If I wanted to read that I think the code would look something like this:
XDocument document = XDocument.Load("URLHERE.xml");
var term = from row in document.Descendants("row")
select new
{
Term = row.Element("Term").Value,
Subject = row.Element("Subj").Value,
Subject_Description = row.Element("Subj_desc").Value,
};
Here's the problem: my XML document doesn't follow the same format. Instead of repeating the different tags for each term, it has a set of metadata at the top and then uses the SAME tag for each set of data.
Here's a sample of my XML document:
<metadata>
<item name="TERM" type="xs:string" length="128"/>
<item name="SUBJ" type="xs:string" length="128"/>
<item name="SUBJECT_DESC" type="xs:string" length="512"/>
</metadata>
<data>
<row>
<value>201320</value>
<value>ACCT</value>
<value>Accounting</value>
</row>
<row>
<value>201320</value>
<value>ACCT</value>
<value>Accounting</value>
</row>
</data>
How would I extract data from it?
var term = from row in document.Descendants("row")
select new
{
Term = row.Element("value").Value,
Subject = row.Element("value").Value,
};
Doesn't seem right. I'm using C# BTW (not sure if I need to say that or if the tag's sufficient).
Your XML is not properly formatted, you need a root element that encapsulates the entire document. such as
<?xml version='1.0' encoding='utf-8'?>
<root>
<metadata>
<item name="TERM" type="xs:string" length="128"/>
<item name="SUBJ" type="xs:string" length="128"/>
<item name="SUBJECT_DESC" type="xs:string" length="512"/>
</metadata>
<data>
<row>
<value>201320</value>
<value>ACCT</value>
<value>Accounting</value>
</row>
<row>
<value>201320</value>
<value>ACCT</value>
<value>Accounting</value>
</row>
</data>
</root>
Then using XDocument you could load the file
var doc = XDocument.Load("file.xml");
then you can extract the data, kinda depends what you want to extract, you never specified. example of getting the metadata
var item = doc.Descendants("metadata");
getting rows, containing an IEnumerable of values
XDocument document = XDocument.Load("c:\\tmp\\test.xml");
var rows = from i in document.Descendants("row")
select new {values = i.Elements("value").Select(a=>a.Value)};

How do I convert this XPath query to LINQ to XML?

I have some data which looks like:
<data>
<row>
<v>0.0264</v>
<v>1073655665.0000</v> <!-- select this -->
<v>1073749988.0000</v>
</row>
<row>
<v>0.0056</v>
<v>1073655714.0000</v> <!-- select this -->
<v>1073751235.0000</v>
</row>
<row>
<v>0.0052</v>
<v>1073655812.0000</v> <!-- select this -->
<v>1073741221.0000</v>
</row>
</data>
How do I select every n'th <v> element in each <row> using LINQ to XML.
Using XPath I'd just do /data/row/v[2] to select every 2nd <v> element but I can't seem to figure out how to do this using LINQ to XML.
var qry = from row in dataNode.Elements("row")
select row.Elements("v").ElementAt(1);
Should do? (untested)

Categories

Resources