C# - Loading XML file in parts

C# - Loading XML file in parts - c#

My task is to load new set of data (which is written in XML file) and then compare it to the 'old' set (also in XML). All the changes are written to another file.
My program loads new and old file into two datasets, then row after row I compare primary key from the new set with the old one. When I find corresponding row, I check all fields and if there are differences with the old one, I write it to third set and then this set to a file.
Right now I use:
newDS.ReadXml("data.xml");
oldDS.ReadXml("old.xml");
and then I just find rows with corresponding primary key and compare other fields. It is working quite good for small files.
The problem is that my files may have up to about 4GB. If my new and old data are that big it is quite problematic to load 8GB of data to memory.
I would like to load my data in parts, but to compare I need whole old data (or how to get specific row with corresponding primary key from XML file?).
Another problem is that I don't know the structure of a XML file. It is defined by user.
What is the best way to work with such a big files? I thought about using LINQ to XML, but I don't know if it has options that can help with my problem. Maybe it would be better to leave XML and use something different?

You are absolutely right that you should leave XML. It is not a good tool for datasets this size, especially if the dataset consists of many 'records' all with the same structure. Not only are 4GB files unwieldy, but almost anything you use to load and parse them is going to use even more memory overhead than the size of the file.
I would recommend that you look at solutions involving an SQL database, but I have no idea how it can make sense to be analysing a 4GB file where you "don't know the structure [of the file]" because "it is defined by the user". What meaning do you ascribe to 'rows' and 'primary keys' if you don't understand the structure of the file? What do you know about the XML?
It might make sense eg. to read one file, store all the records with primary keys in a certain range, do the same for the other file, do the comparison of that data, then carry on. By segmenting the key space you make sure that you always find matches if they exist. It could also make sense to break your files into smaller chunks in the same way (although I still think XML storage this large is usually inappropriate). Can you say a little more about the problem?

Related

Indexing a large XML file

Given a large (74GB) XML file, I need to read specific XML nodes by a given Alphanumeric ID. It takes too long to read from top-to-bottom of the file looking for the ID.
Is there an analogy of an Index for XML files like there is for relational databases?, I imagine a small Index file, where the Alphanumeric ID is quick to find, and points to the location in the larger file.
Do Index files for XML exist?, how can they be implemented in C#?

XML databases such as BaseX, eXistDB, or MarkLogic do what you are looking for: they load XML documents into a persistent form on disk and allow fast access to parts of the document by use of indexes.
Some XML databases are optimized for handling many small documents, others are able to handle a small number of large documents, so choose your product carefully (I can't advise you on this), and consider breaking the document up into smaller parts as it is loaded.
If you need to split the large document into lots of small documents, consider a streaming XSLT 3.0 processor such as Saxon-EE. I would expect that processing 75Gb should take about an hour: dependent, obviously, on the speed of your machine.

No, that is beyond of the scope of what XML tries to achieve. If the XML does not change often and your read from it a lot, I would propose rewriting its content into a local SQLite DB once-per-change and then reading from the database instead. When doing the rewriting, remember that SAX-style XML reading is your friend in the case of huge files like this.
Theoretically, you can create a sort-of index by remembering location of already discovered IDs and then parse on your own, but that would be very brittle. XML si not simple enough for you to parse it on your own and hope you will be standard compliant.
Of course, I suppose here that you can't do anything with the larger design itself: as others noted, the size of that file suggests that there is an architectural problem.

C# combined files direct accessable

I am creating a very simple database in C# which I use to store playlists and an overview of all my music. I want to make this C compatible in the future I plan to make this completely text based. The idea is that every text file is a table, and the contents are JSON format where every line of text is a record.
I don't want to have loose files for each database, so I was thinking about something like a zip file. I don't want to extract and compress every time I access a file. Is there someway I can use a stream reader/writer in C# on different files where windows only see one file?
I'm not completely convinced that this is the way to go. So I'm open to suggestions.
Update,
Im currently messing around with the "Local database" item in C#. I never payed any atention to it. It could very well be the solution.
Update2,
SQLite seems to be very simple. I have some experience with MySQL in the past with some php projects so that will give me a headstart.

You want to use a file as a container containing different files? If so, there are a lot ways to accomplish this. These are techniques I used in the past:
Zip:
A compressed file, such as Zip is known to behave that way, can be used as a solution for your interest. It is capable to store virtual files. They can vary in size to at least up to 1 Gigabyte (testet, but I currently don't know if there are implementation based size limits).
SQLite:
SQLite sounds oldschool, but it stores all database related stuff into one physical file. Creating a database with tables for each virtual file should to the trick. This approach is useful if you know that your virtual files won't use a lot of bytes in size or neither reach any limit of sqlite field datatypes. As your virtual files are going to use textlines, may you can be able to form then into attributes and tuples. This way you can even use SQL specific statements to query and filter your data as you wish to.
There are still more ways to implement that kind of container format by your own, but propably needs to invest more time and work in it than getting effort out of it. Stay tuned for better ideas and may ready to use implementations :-)

Will you ever try to search between your data? Then use a real database manager, in C# the built in local database file is the simpliest choice (if you are familiar with SQL).
The zip file is a good choice for data space and compactness (a single file instead of many files) but it is very slow: for each database operation the whole zip file will be reorganized. Even a tar file (without compression) needs a continous reallocation when the content changes, and a zip file needs extra computation and relocation.
If you want something what is compressed and still standard, you can use OpenXML (ods or xlsx, does not matter) to store your data but the save operation will be slow and even slower as your database grows.

For a tag database is it better to store filenames per tag or tags per filename?

I want to write a small app that manages file tags for my personal files. It's gonna be pretty straightforward but I am not sure if I should be storing filenames for each unique tag, i.e.:
"sharp":
file0.ext file1.ext file2.ext file3.ext
"cold":
file1.ext file2.ext
"ice":
file3.ext
Or if I should be storing tags for each file name i.e:
file0.ext:
"sharp"
file1.ext:
"sharp" "cold"
file2.ext:
"sharp" "cold"
file2.ext:
"sharp" "ice"
I want to use the method that will give me the best performance and/or best design. Since I never did anything like this, the method I think is right might not be optimal.
Just to give more info about the app:
I will search files by tag. All I need is to be able to type my tags so I can see which files match, and double click to open them, etc.
I will use protobuffers (Marc's version) to save and load the database.
Database size is not important as I will use it on my PC.
I don't think I will ever have more than 50K files. Most likely I will have 20K max as these are mostly personal files so it's not possible for me to create/collect more than that.
EDIT: I forgot to mention another feature. Since this will be the same app to define tags for files, when I select a file, I need it to load all tags that file have so I can show them in case I want to edit them.

It all matters how you want to search the data... Since you say that you want to search files by tag, then your first method will be the simplest since you will only need to read a small part of the data file.
If you really wanted to be simple, you could have a separate data file for each tag (i.e. sharp.txt, cold.txt, ice.txt) and then just have a list of filenames in the file.

If you're searching by tag, that seems like the more appropriate index. You may incur some performance penalty for finding all tags on a file if that's something you need to do.
Alternatively if you do want to support either scenario: store both, and you can query on them as needed. This creates some data duplication and you'll need extra logic to update both data sets when a file is changed/added, but it should be pretty straightforward.

In the case, you have a lot of tags, a lot of files and a lot of relations, I would suggest using a relational database. In case you don't have a lot of data, I think you should not care about it.
Anyway, I suppose that even if you do want to save the relations in plain text files, the same principles as in the database normalization apply. The main goal is to avoid data repetition. In your model, a tag and a file would have a many-to-many relation. I would immitate the structure of a relational database, even if the data would be stored in plain text files. I would have a file holding the filenames, one ID per filename and another file holding the tags, one ID per tag. A third file would contain the relationships. Simple, keeping files to a minimum size.
Hope I helped!

Storing large files / binary data in a mysql database: when is it ok?

Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.
However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.
I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).
This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.
So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?
Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.
I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?
edit: I forgot to mention, the file sizes could range from 1KB to ~1GB
[Rough] Conclusion
Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.
In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).
As the alternative, I am torn between #Nick Coons file system approach and #tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.
I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.
That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.
The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.
That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

Short Answer:
I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:
When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.
Example Implementation:
There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).
Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:
./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7
That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.
Then I create a DB table with these columns to hold the meta data:
file_id, an auto_increment field
created, a field with a default value of current_timestamp
prev_id, more on this below
hash, the SHA1 hash on the filesystem
name, a textual name of the file (such as the original name that the file would have taken on disk.
When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.
If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.
If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

I think you will find a fair amount of debate on this as I did when I began looking into this. I tend to lean toward storing in the file system and maintaining a reference. However, that is not to say that there is never a time to store binary data in a database.
I would say that simply to keep things in sync is not a reason within itself to make an argument for storing binary data in a database. There certainly are ways to keep file systems in sync so that as a database is kept in sync so is the file system.
The bottom line is that there is a fair amount of debate on this topic and you have to go with what works for you. If what you have set up works. Use it. Do performance and load testing to make sure it works. If it doesn't hold up, change it.

Best (free) way to store data? How about updates to the file system?

I have an idea for how to solve this problem, but I wanted to know if there's something easier and more extensible to my problem.
The program I'm working on has two basic forms of data: images, and the information associated with those images. The information associated with the images has been previously stored in a JET database of extreme simplicity (four tables) which turned out to be both slow and incomplete in the stored fields. We're moving to a new implementation of data storage. Given the simplicity of the data structures involved, I was thinking that a database was overkill.
Each image will have information of it's own (capture parameters), will be part of a group of images which are interrelated (taken in the same thirty minute period, say), and then part of a larger group altogether (taken of the same person). Right now, I'm storing people in a dictionary with a unique identifier. Each person then has a List of the different groups of pictures, and each picture group has a List of pictures. All of these classes are serializable, and I'm just serializing and deserializing the dictionary. Fairly straightforward stuff. Images are stored separately, so that the dictionary doesn't become astronomical in size.
The problem is: what happens when I need to add new information fields? Is there an easy way to setup these data structures to account for potential future revisions? In the past, the way I'd handle this in C was to create a serializable struct with lots of empty bytes (at least a k) for future extensibility, with one of the bytes in the struct indicating the version. Then, when the program read the struct, it would know which deserialization to use based on a massive switch statement (and old versions could read new data, because extraneous data would just go into fields which are ignored).
Does such a scheme exist in C#? Like, if I have a class that's a group of String and Int objects, and then I add another String object to the struct, how can I deserialize an object from disk, and then add the string to it? Do I need to resign myself to having multiple versions of the data classes, and a factory which takes a deserialization stream and handles deserialization based on some version information stored in a base class? Or is a class like Dictionary ideal for storing this kind of information, as it will deserialize all the fields on disk automatically, and if there are new fields added in, I can just catch exceptions and substitute in blank Strings and Ints for those values?
If I go with the dictionary approach, is there a speed hit associated with file read/writes as well as parameter retrieval times? I figure that if there's just fields in a class, then field retrieval is instant, but in a dictionary, there's some small overhead associated with that class.
Thanks!

Sqlite is what you want. It's a fast, embeddable, single-file database that has bindings to most languages.
With regards to extensibility, you can store your models with default attributes, and then have a separate table for attribute extensions for future changes.
A year or two down the road, if the code is still in use, you'll be happy that 1)Other developers won't have to learn a customized code structure to maintain the code, 2) You can export, view, modify the data with standard database tools (there's an ODBC driver for sqlite files and various query tools), and 3) you'll be able to scale up to a database with minimal code changes.

Just a wee word of warning, SQLLite, Protocol Buffers, mmap et al...all very good but you should prototype and test each implementation and make sure that your not going to hit the same perf issues or different bottlenecks.
Simplicity may be just to upsize to SQL (Express) (you'll may be surprised at the perf gain) and fix whatever's missing from the present database design. Then if perf is still an issue start investigating these other technologies.

My brain is fried at the moment, so I'm not sure I can advise for or against a database, but if you're looking for version-agnostic serialization, you'd be a fool to not at least check into Protocol Buffers.
Here's a quick list of implementations I know about for C#/.NET:
protobuf-net
Proto#
jskeet's dotnet-protobufs

There's a database schema, for which I can't remember the name, that can handle this sort of situation. You basically have two tables. One table stores the variable name, and the other stores the variable value. If you want to group the variables, then add a third table that will have a one to many relationship with the variable name table. This setup has the advantage of letting you keep adding different variables without having to keep changing your database schema. Saved my bacon quite a few times when dealing with departments that change their mind frequently (like Marketing).
The only drawback is that the variable value table will need to store the actual value as a string column (varchar or nvarchar actually). Then you have to deal with the hassle of converting the values back to their native representations. I currently maintain something like this. The variable table currently has around 800 million rows. It's still fairly fast, as I can still retrieve certain variations of values in under one second.

I'm no C# programmer but I like the mmap() call and saw there is a project doing such a thing for C#.
See Mmap
Structured files are very performing if tailored for a specific application but are difficult to manage and an hardly reusable code resource. A better solution is a virtual memory-like implementation.
Up to 4 gigabyte of information can be managed.
Space can be optimized to real data size.
All the data can be viewed as a single array and accessed with read/write operations.
No needing to structure to store but just use and store.
Can be cached.
Is highly reusable.

So go with sqllite for the following reasons:
1. You don't need to read/write the entire database from disk every time
2. Much easier to add to even if you don't leave enough placeholders at the beginning
3. Easier to search based on anything you want
4. easier to change data in ways beyond the application was designed
Problems with Dictionary approach
1. Unless you made a smart dictionary you need to read/write the entire database every time (unless you carefully design the data structure it will be very hard to maintain backwards compatibility)
----- a) if you did not leave enough place holders bye bye
2. It appears as if you'd have to linear search through all the photos in order to search on one of the Capture Attributes
3. Can a picture be in more than one group? Can a picture be under more than one person? Can two people be in the same group? With dictionaries these things can get hairy....
With a database table, if you get a new attribute you can just say Alter Table Picture Add Attribute DataType. Then as long as you don't make a rule saying the attribute has to have a value, you can still load and save older versions. At the same time the newer versions can use the new attributes.
Also you don't need to save the picture in the database. You could just store the path to the picture in the database. Then when the app needs the picture, just load it from a disk file. This keeps the database size smaller. Also the extra seek time to get the disk file will most likely be insignificant compared to the time to load the image.
Probably your table should be
Picture(PictureID, GroupID?, File Path, Capture Parameter 1, Capture Parameter 2, etc..)
If you want more flexibility you could make a table
CaptureParameter(PictureID, ParameterName, ParameterValue) ... I would advise against this because it is a lot less efficient than just putting them in one table (not to mention the queries to retrieve/search the Capture Parameters would be more complicated).
Person(PersonID, Any Person Attributes like Name/Etc.)
Group(GroupID, Group Name, PersonID?)
PersonGroup?(PersonID, GroupID)
PictureGroup?(GroupID, PictureID)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.