There are things that I dont like in DataTable.WriteXML() and .ReadXML(). These are:
Not possible to configure number of rows or columns to load/save.
Need it when processing huge data amounts
Not even possible to get
rid of the XML element that means dataset name (DocumentElement by
default)
Want to make custom processing on read/write nodes
Is there any good serializer for DataRows and DataTables? I can write my own but it seems weird that they still don;t exist (i tried to find).
Thanks!
Related
I've been working on a VB application which parses multiple XML files, and create an Excel file from them.
The main problem of this is that I am, simply, reading each line of each XML and outputs them to the Excel file when a specific node is found. I would like to know if exists any method to store the data from each element, just to use it once everything (all the XML files) have been parsed.
I was thinking about databases but I think this is excessive and unnecesary. Maybe you can give me some ideas in order to make it working.
System.Data.DataSet can be used as an "in memory database".
You can use a DataSet to store information in memory - a DataSet can contain multiple DataTables and you can add columns to those at runtime, even if there are already rows in the DataTable. So even if you don't know the XML node names ahead of time, you can add them as columns as they appear.
You can also use DataViews to filter the data inside the DataSet.
My typical way of pre-parsing XML is to create a two-column DataTable with the XPATH address of each node and its value. You can then do a second pass that matches XPATH addresses to your objects/dataset.
I am working on optimizing some code I have been assigned from a previous employee's code base. Beyond the fact that the code is pretty well "spaghettified" I did run into an issue where I'm not sure how to optimize properly.
The below snippet is not an exact replication, but should detail the question fairly well.
He is taking one DataTable from an Excel spreasheet and placing rows into a consistantly formatted DataTable which later updates the database. This seems logical to me, however, the way he is copying data seems convoluted, and is a royal pain to modify, maintain or add new formats.
Here is what I'm seeing:
private void VendorFormatOne()
{
//dtSumbit is declared with it's column schema elsewhere
for (int i = 0; i < dtFromExcelFile.Rows.Count; i++)
{
dtSubmit.Rows.Add(i);
dtSubmit.Rows[i]["reference_no"] = dtFromExcelFile.Rows[i]["VENDOR REF"];
dtSubmit.Rows[i]["customer_name"] = dtFromExcelFile.Rows[i]["END USER ID"];
//etc etc etc
}
}
To me this is completely overkill for mapping columns to a different schema, but I can't think of a way to do this more gracefully. In the actual solution, there are about 20 of these methods, all using different formats from dtFromExcelFile and the column list is much longer. The column schema of dtSubmit remains the same across the board.
I am looking for a way to avoid having to manually map these columns every time the company needs to load a new file from a vendor. Is there a way to do this more efficiently? I'm sure I'm overlooking something here, but did not find any relevant answers on SO or elsewhere.
This might be overkill, but you could define an XML file that describes which Excel column maps to which database field, then input that along with each new Excel file. You'd want to whip up a class or two for parsing and consuming that file, and perhaps another class for validating the Excel file against the XML file.
Depending on the size of your organization, this may give you the added bonus of being able to offload that tedious mapping to someone less skilled. However, it is quite a bit of setup work and if this happens only sparingly, you might not get a significant return on investment for creating so much infrastructure.
Alternatively, if you're using MS SQL Server, this is basically what SSIS is built for, though in my experience, most programmers find SSIS quite tedious.
I had originally intended this just as a comment but ran out of space. It's in reply to Micah's answer and your first comment therein.
The biggest problem here is the amount of XML mapping would equal that of the manual mapping in code
Consider building a small tool that, given an Excel file with two
columns, produces the XML mapping file. Now you can offload the
mapping work to the vendor, or an intern, or indeed anyone who has a
copy of the requirement doc for a particular vendor project.
Since the file is then loaded at runtime in your import app or
whatever, you can change the mappings without having to redeploy the
app.
Having used exactly this kind of system many, many times in the past,
I can tell you this: you will be very glad you took the time to do
it - especially the first time you get a call right after deployment
along the lines of "oops, we need to add a new column to the data
we've given you, and we realised that we've misspelled the 19th
column by the way."
About the only thing that can perhaps go wrong is data type
conversions, but you can build that into the mapping file (type
from/to) and generalise your import routine to perform the
conversions for you.
Just my 2c.
A while ago I ran into similar problem where I had over 400 columns from 30 odd tables to be mapped to about 60 in the actual table in the database. I had the same dilemma whether to go with a schema or write something custom.
There was so much duplication that I ended up writing a simple helper class with a couple of overridden methods that basically took in a column name from import table and spit out the database column name. Also, for column names, I built a separate class of the format
public static class ColumnName
{
public const string FirstName = "FirstName";
public const string LastName = "LastName";
...
}
Same thing goes for TableNames as well.
This made it much simpler to maintain table names and column names. Also, this handled duplicate columns across different tables really well avoiding duplicate code.
I have a requirement to generate an XML file. This is easy-peasy in C#. The problem (aside from slow database query [separate problem]) is that the output file reaches 2GB easily. On top of that, the output XML is not in a format that can easily be done in SQL. Each parent element aggregates elements in its children and maintains a sequential unique identifier that spans the file.
Example:
<level1Element>
<recordIdentifier>1</recordIdentifier>
<aggregateOfLevel2Children>11</aggregateOfL2Children>
<level2Children>
<level2Element>
<recordIdentifier>2</recordIdentifier>
<aggregateOfLevel3Children>92929</aggregateOfLevel3Children>
<level3Children>
<level3Element>
<recordIdentifier>3</recordIdentifier>
<level3Data>a</level3Data>
</level3Element>
<level3Element>
<recordIdentifier>4</recordIdentifier>
<level3Data>b</level3Data>
</level3Element>
</level3Children>
</level2Element>
<level2Element>
<recordIdentifier>5</recordIdentifier>
<aggregateOfLevel3Children>92929</aggregateOfLevel3Children>
<level3Children>
<level3Element>
<recordIdentifier>6</recordIdentifier>
<level3Data>h</level3Data>
</level3Element>
<level3Element>
<recordIdentifier>7</recordIdentifier>
<level3Data>e</level3Data>
</level3Element>
</level3Children>
</level2Element>
</level2Children>
</level1Element>
The schema in use actually goes up five levels. For the sake of brevity, I'm including only 3. I do not control this schema, nor can I request changes to it.
It's a simple, even trivial matter to aggregate all of this data in objects and serialize out to XML based on this schema. But when dealing with such large amounts of data, out of memory exceptions occur while using this strategy.
The strategy that is working for me is this: I'm populating a collection of entities through an ObjectContext that hits a view in a SQL Server database (a most ineffectively indexed database at that). I'm grouping this collection then iterating through, then grouping the next level then iterating through that until I get to the highest level element. I then organize the data into objects that reflect the schema (effectively just mapping) and setting the sequential recordIdentifier (I've considered doing this in SQL, but the amount of nested joins or CTEs would be ridiculous considering that the identifier spans the header elements into the child elements). I write a higher level element (say the level2Element) with its children to the output file. Once I'm done writing at this level, I move to the parent group and insert the header with the aggregated data and its identifier.
Does anyone have any thoughts concerning a better way output such a large XML file?
As far as I understand your question, your problem is not with the limited space of storage i.e HDD. You have difficulty to maintain a large XDocument object in memory i.e RAM. To deal with this you can ignore make such a huge object. For each recovrdIdentifier element you can call .ToString() and get a string. Now, simply append this strings to a file. Put declaration and root tag in this file and you're done.
I have to extract data from a saved search and drop each column into a csv file. This search is routinely over 300 lines long and I need to parse each record into a separate csv file ( so 300+ csv files need to be created )
With all the previous searches I have done this with the amount of columns required were small (less then 10) and the amount of joins minimal to none, so efficiency wasn't a large concern.
I now have a project that has 42 fields in in the saved search. The search is built off of a sales order and includes joins to the customer record and item records.
The search makes extensive use of custom fields as well as formula's.
What is the most efficient way for me to step through all of this?
I am thinking that the easiest method (and maybe the quickest) is to wrap it in a
foreach (TransactionSearchRow row in searchResult.searchRowList)
{
using (var sw = System.IO.File.CreateText(path+filename))
{
....
}
}
block, but I want to try and avoid
if (customFieldRef is SelectCustomFieldRef)
{
SelectCustomFieldRef selectCustomFieldRef = (SelectCustomFieldRef)customFieldRef;
if (selectCustomFieldRef.internalId.Equals("custom_field_name"))
{
....
}
}
as I expect this code to become excessively long with this process. So any ideas are appreciated.
Using the NetSuite WSDL-generated API, there is no alternative to the nested type/name tests when reading custom fields. It just sucks and you have to live with it.
You could drop down to manual SOAP and parse the XML response yourself. That sounds like torture to me but with a few helper functions you could make the process of reading custom fields much more logical.
The other alternative would be to ditch SuiteTalk entirely and do your search in a SuiteScript RESTlet. The JavaScript API has much simpler, more direct access to custom fields than the SOAP API. You could do whatever amount of pre-processing you wanted on the server side before returning data (which could be JSON, XML, plain text, or even the final CSV) to the calling application.
Here's the deal. I have an XML document with a lot of records. Something like this:
print("<?xml version="1.0" encoding="utf-8" ?>
<Orders>
<Order>
<Phone>1254</Phone>
<City>City1</City>
<State>State</State>
</Order>
<Order>
<Phone>98764321</Phone>
<City>City2</City>
<State>State2</State>
</Order>
</Orders>");
There's also an XSD schema file. I would like to extract data from this file and insert these records into a database table. First of course I would like to validate each order record. For example if there are 5 orders in the file and 2 of them fail validation I would like to insert the 3 that passed validation into the db and left the other 2. There can be thousands of records in one xml file. What would be the best approach here. And how would the validation go for this since I need to discard the failed records and only use the ones that passed validation. At the moment I'm using XmlReaderSettings to validate the XML document records. Should I extract these records into another XML file or a Dataset or a custom object before I insert into a DB. I'm using .Net 3.5. Any code or link is welcome.
If the data maps fairly cleanly to an object model, you could try using xsd.exe to generate some classes from the .xsd, and process the classes into your DAL of choice. The problem is that if the volume is high (you mention thousands of records), you will most likely have a lot of round-trips.
Another option might be to pass the data "as is" through to the database and use SQL/XML to process the data in TSQL - presumably as a stored procedure that accepts a parameter of type xml (SQL Server 2005 etc).
I agree with idea that you should use an XmlReader, but I thought I'd try something a little different.
Basically, I am first validating the whole XDocument, then if there are errors, I enumerate through the orders and bin them as needed. It's not pretty, but maybe it'll give you some ideas.
XDocument doc = XDocument.Load("sample.xml");
XmlSchemaSet schemas = new XmlSchemaSet();
schemas.Add("", "sample.xsd");
bool errors = false;
doc.Validate(schemas, (sender, e) =>
{
errors = true;
});
List<XElement> good = new List<XElement>();
List<XElement> bad = new List<XElement>();
var orders = doc.Descendants("Order");
if (errors)
{
foreach (var order in orders)
{
errors = false;
order.Validate(order.GetSchemaInfo().SchemaElement, schemas, (sender, e) =>
{
errors = true;
});
if (errors)
bad.Add(order);
else
good.Add(order);
}
}
else
{
good = orders.ToList();
}
Instead of the lambda expressions, you could use a common function, but I just threw this together. Also, you could build two XDocuments instead of shoving the order elements into a list. I'm sure there are a ton of other problems here too, but maybe this will spark something.
You have a couple of options:
XmlDataDocument or XmlDocument. The downside to this approach is that the data will be cached in memory, which is bad if you have a lot of it. On the other hand, you get good in-memory querying facilities with DataSet. XmlDocument requires that you use XPath queries to work on the data, whereas XmlDataDocument gives you an experience more like the DataSet functionality.
XmlReader. This is a good, fast approach because the data isn't cached; you read it in a bit at a time as a stream. You move from one element to the next, and query information about that element in your application to decide what to do with it. This does mean that you maintain in your application's memory the tree level that you're at, but with a simple XML file structure like yours this should be very simple.
I recommend option 2 in your case. It should scale well in terms of memory usage, and should provide the simplest implementation for processing a file.
A lot of that depends on what "validation" means in your scenario. I assume, since you're using an .xsd, you are already validating that the data is syntactically correct.
So, validation probably means you'll be calling other services or procedures to determine if an order is valid?
You might want to look at Sql Server Integration Services. The XML Task in SSIS lets you do things like XPath queries, merging, likely anything and everything you'd need to do with that document. You could also use that do to all of your upfront validation with schema file too.
Marc's option of passing that data to a stored procedure might work in this scenario too, but SSIS (or, even DTS but you're going to give up too much related to XML to make it as nice of an option) will let you visually orchestrate all of this work. Plus, it'll make it easier for these things to run out of process so you should end up with a much more scalable solution.
By validation I mean validating each node. The nodes that have at least one error need to be inserted into a new xml document. Basically at the end I should have 2 xml documents. One containing the successful nodes and the other containing the failure nodes. Any way I can accomplish that? I'm using LINQ.