Very large XML file generation

Very large XML file generation - c#

I have a requirement to generate an XML file. This is easy-peasy in C#. The problem (aside from slow database query [separate problem]) is that the output file reaches 2GB easily. On top of that, the output XML is not in a format that can easily be done in SQL. Each parent element aggregates elements in its children and maintains a sequential unique identifier that spans the file.
Example:
<level1Element>
<recordIdentifier>1</recordIdentifier>
<aggregateOfLevel2Children>11</aggregateOfL2Children>
<level2Children>
<level2Element>
<recordIdentifier>2</recordIdentifier>
<aggregateOfLevel3Children>92929</aggregateOfLevel3Children>
<level3Children>
<level3Element>
<recordIdentifier>3</recordIdentifier>
<level3Data>a</level3Data>
</level3Element>
<level3Element>
<recordIdentifier>4</recordIdentifier>
<level3Data>b</level3Data>
</level3Element>
</level3Children>
</level2Element>
<level2Element>
<recordIdentifier>5</recordIdentifier>
<aggregateOfLevel3Children>92929</aggregateOfLevel3Children>
<level3Children>
<level3Element>
<recordIdentifier>6</recordIdentifier>
<level3Data>h</level3Data>
</level3Element>
<level3Element>
<recordIdentifier>7</recordIdentifier>
<level3Data>e</level3Data>
</level3Element>
</level3Children>
</level2Element>
</level2Children>
</level1Element>
The schema in use actually goes up five levels. For the sake of brevity, I'm including only 3. I do not control this schema, nor can I request changes to it.
It's a simple, even trivial matter to aggregate all of this data in objects and serialize out to XML based on this schema. But when dealing with such large amounts of data, out of memory exceptions occur while using this strategy.
The strategy that is working for me is this: I'm populating a collection of entities through an ObjectContext that hits a view in a SQL Server database (a most ineffectively indexed database at that). I'm grouping this collection then iterating through, then grouping the next level then iterating through that until I get to the highest level element. I then organize the data into objects that reflect the schema (effectively just mapping) and setting the sequential recordIdentifier (I've considered doing this in SQL, but the amount of nested joins or CTEs would be ridiculous considering that the identifier spans the header elements into the child elements). I write a higher level element (say the level2Element) with its children to the output file. Once I'm done writing at this level, I move to the parent group and insert the header with the aggregated data and its identifier.
Does anyone have any thoughts concerning a better way output such a large XML file?

As far as I understand your question, your problem is not with the limited space of storage i.e HDD. You have difficulty to maintain a large XDocument object in memory i.e RAM. To deal with this you can ignore make such a huge object. For each recovrdIdentifier element you can call .ToString() and get a string. Now, simply append this strings to a file. Put declaration and root tag in this file and you're done.

Related

DocumentDB: How to better structure data for updates

So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.

While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.

NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection

You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"

I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.

Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

XMLSerialized Object in Database Field. Is it good design?

Suppose i have one table that holds Blogs.
The schema looks like :
ID (int)| Title (varchar 50) | Value (longtext) | Images (longtext)| ....
In the field Images i store an XML Serialized List of images that are associated with the blog.
Should i use another table for this purpose?

Yes, you should put the images in another table. Having several values in the same field indicates denormalized data and makes it hard to work with the database.
As with all rules, there are exceptions where it makes sense to put XML with multiple values in one field in the database. The first rule is that:
The data should always read/written together. No need to read or update just one of the values.
If that is fulfilled, there can be a number of reasons to put the data together in one field:
Storage efficiency, if space has proved to be a problem.
Retrieval efficiency, if performance has proved to be a problem.
Schema flexilibity; where one XML field can eliminate tens or hundreds of different tables.

I would certainly use another table. If you use XML, what happens when you need to go through and update the references to all images? (Would you just rather do an Update blog_images Set ..., or parse through the XML for each row, make the update, then re-generate the updated XML for each?

Well, it is a bit "inner platform", but it will work. A separate table would allow better image querying, although on some RDBMS platforms this could also be achieved via an XML-type column and SQL/XML.
If this data only has to be opaque storage, then maybe. However, keep in mind you'll generally have to bring back the entire XML to the app-tier to do anything interesting with it (or: depending on platform, use SQL/XML, but I advise against this, as the DB isn't the place to do such processing in most cases).
My advice in all other cases: separate table.

That depends on whether you'd need to query on the actual image data itself. If you see a possible need to query on certain images, or images with certain attributes, then it would probably be best to store that image data in a different way.
Otherwise, leave it the way it is.
But remember, only include the fields in your SELECT when you need them.

Should i use another table for this purpose?
Not necessarily. You just have to ensure that you are not selecting the images field in your queries when you don't need it. But if you wanted to denormalize your schema you could use another table and when you need the images perform a join.

What is the most space efficient way to store an N-ary tree while preserving hierarchy traversal?

I read this paper.
But I'd love to avoid a ton of research to solve this problem if someone has already done it. I need this space efficient tree for a reasonably (conceptually) simple GUI control: TreeDataGridView with virtual mode
An example tree might look like:
(RowIndexHierarchy),ColumnIndex
(0),0
(0,0),0
(0,0,0),0
(0,0,0,0),0
(0,0,0,0,0),0
(0,0,0,1),0
(0,0,0,2),0
(0,0,1),0
(0,0,2),0
(0,0,2,0),0
(0,0,2,1),0
(0,0,2,2),0
(0,1),0
(0,2),0
(0,2,0),0
(0,2,0,0),0
(0,2,0,1),0
(0,2,0,2),0
(0,2,1),0
(0,2,2),0
(0,2,2,0),0
(0,2,2,1),0
(0,2,2,2),0
(1),0
I need operations like "find flat row index from row hierarchy" and "find row hierarchy from flat row index". Also, to support expand/collapse, I need "find next node with the same or less depth".

For read-only tree you can store sorted array of nodes by its parent index.
0 a
1 (a/)b
1 (a/)c
2 (a/b/)d
2 (a/b/)e
2 (a/b/)f
3 (a/c/)c
Each time you'll need to find child nodes you can use binary search to find upper and lower boundaries of nodes range.

I'm not sure if I am following your needs exactly, but we access a database that has a tree-view UI. The tree runs from
Top Level (the user's company);
Direct Client Company;
Office Location;
Employee;
Indirect Client Company;
Proposal;
Specific Vendor Bid;
Detail Financials (invoices, adjustments, etc)
As we have thousands of direct clients and the tree branches pretty heavily at each tier, so we don't load the entire data-set at any time. Instead we only load Type, Guid, DisplayName and some administrative data for each child and load a "details" pane for the currently focused item. Unexplored paths through the tree simply don't exist in memory. To avoid loading the full lists of names, we have "rolodex" level that just divides the dataset into batches of 100 records or less. (So "Z" stands alone, but "Sa-St" subdivides S). These auto-insert when a subset grows beyond the 100 record threshold.
Periodically (when the system is idle) we check the loaded count and if it exceeds a threshold we drop the least recently used nodes until we are below the threshold.
The actual data access is done when the user navigates: we access the database and refresh the subset they are navigating through. We do have the advantage that Type determines the table to query (both for that level and the children) and thus we can keep the individual record types indexed and accessible.
This has given the user the ability to navigate through the data in any way they want, while keeping the in-memory retained data minimized. Of course we off standard search modes and a menu of "recently used history" (for each type) as well, but often the work they do requires moving up and down a narrow chain of nodes, so the tree view keeps it all in front of them while working with a given client and the subset.
With that background, I become curious as to the nature of data that would have such undifferentiated levels that such a tier by tier data store wouldn't be appropriate. The advantage that tier by tier storage has is that all I need is the current node's Guid and I can search the child table on that as the foreign key (which is indexed, so quick to return the subset).
I guess this amounts to "unasking the question", but it seems that most tree structures have distinct data at each level, so it would seem far easier to work with something established (like a table query on an indexed field, which keeps the whole affair out of memory in the first place) than making a custom structure.
For example, I have never asked for the "next node at the current level" except within a given parent (because leaving a given parent takes me to another context). But within a parent I already have the children and their order.
Perhaps it is because of the space I'm in, but I find a tree control that knows how to bind to different tables based on parent->child relationships of tables more useful, which is why I wrote one. I also find lazy loading of data and aggressive dismissal of data keep the memory footprint minimized. Finally I find the programming model incredibly simple: just create a new "treenode" subclass for any table I want to access and make the treenode responsible for loading their children.
(Clarifying, due to question below:)
In my implementation each TreeNode is actually a SpecificTreeNode, derived from BaseNode, which in turn is derived from TreeNode. Being inherited from TreeNode, they can be used directly by the tree, but because they have overrides of the BaseNode properties such as LoadChildren and display properties, the display and retrieval any given set of data is implied by the type of node (and the Guid that the item represents).
This means that as the user navigates the tree, the SpecificTreeNode generates the required ORM query on the fly. For performance, child tables have any parent IDs as indexes, so navigating down the tree (even by multiple layers, if using a SpecificTreeNode that does rollups) is just a quick index lookup.
In this way, I keep very little of the data in memory at any time, pulling only what we need from the database. Likewise, queries against the tree are converted to ORM queries against our database, pulling only the results and limiting the amount any query can pull (if you are using a Tree UI and you pull over 100 records at once, the UI isn't really the optimal place for whatever you are doing).
When your dataset is hundreds of GB in size, it seems the only reasonable recourse. The advantage I feel this has is that the Tree itself has no idea that different levels and paths render and query differently... it just asks the BaseNode (from its perspective) to do something, and the overrides on SpecificTreeNode actually do the lifting. Thus, the "data structure" is simply the way the tree works already combined with data queries on my tables and views.
(End of clarification.)
Meanwhile all the tree controls on the market seem to miss that and go with something far more complex.

The most space-efficient way to store a balanced N-ary tree is in an array... zero space-overhead! And actually very efficient to traverse ... just some simple math required to compute your parent index from your index... or your N children's indices from your index.
To find some code for this, look up heapsort... the heap structure (nothing to do with memory heap) is a balanced binary tree... and most people implement it in an array. Although it is binary, the same thing can be done for N-ary.
If your N-ary tree is not kept balanced, but tends to be fairly dense, then the array implementation will still be more space-efficient than most... the empty nodes being the only space overhead. However, if your trees are ever highly imbalanced, then the array implementation can be very space-inefficient.

Data structure for relationships

I am converting a VB6 to C# and I want to make my data structure that holds values and relationships more efficient. In VB I have a collection of values and another collection of relationships between those values with priorities for those relationships. I also have an algorithm that when a set of values is passed to it all relationships required to join those values together is returned. For example, say the values collection contains 1-10 and the relationship collection contains
1,2
3,2
5,2
2,8
8,10
9,10
If the input was 1,9,10 the returned relationships would be --
1,2
2,8
8,10
9,10
Since there may be multiple paths the least amount of relationships would be returned but there is a caveat of relationship priorities. If a relationship has a higher priority then that relationship would be added and the rest of the relationships would be added from there. I am thinking of using a Disjoint-set data structure but I am not sure.
Any ideas?
More information --
The number of values would normally be less than 100 and the relationships less than 500. The collections are static and the algorithm will be used again and again to find paths. Also, I did not ask this but would the algorithm in Disjoint-set data structure be the most efficient?

It sounds like what you have is a Graph. This is a structure with Nodes and Edges. There are many many libraries and tools that deal with Graphs. Microsoft even put out a paper on how to deal with them. I think graphs are great and extremely useful in many situations.
One big benefit with graphs is the ability to assign priorities to the edges between the nodes. Then when you want to find the path between two nodes, boom, the graph can choose the path with the ideal priority.
In your situation, your values are the nodes and your relationships are the edges.

You need to ask yourself (and tell us) what kind of pattern of use you expect. Do these relations get added in order or randomly, do yours queries come in order (as you show them) or randomly, and is this essentially a batch process -- load them up, read off the queries -- or do you expect to do it "on line" in the sense that you may add some, then query some, then add some more and query some more?
Will you know how many you want to store beforehand, and how many do you expect to store? Dozens? Thousands? Tens of millions?
Here are some suggestions:
if you know beforehand how many you
expect to store, it's not a really
big number, you don't expect to add
them after first loading up, there
aren't any duplicates in the
left-hand side of the pair, and
they're reasonably "dense" in the
sense that there aren't big gaps
between numbers in the left-hand one
of the pair, then you probably want
an array. Insertion is O(1), access is O(1),
but can't have duplicate indices and
expanding it after you build it is a pain.
if the number is really large, like > 108,
you probably want some kind of database.
Databases are relatively very slow -- 4 to 5 orders of
magnitude greater than in-memory data structures --
but handle really big data.
If you have insertions after the
first load, and you care about
order, you're going to want some
sort of tree, like a 2-3 tree. Insertion
and access both O(lg n).
You'd probably find an implmentation
under a name like "ordered list"
(I'm not a C# guy.)
Most any other case, you probably
want a hash. Average insertion and
access both O(1), like an array;
worst case [which you won't hit with
this data] is O(n)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.