Indexing a large XML file

Indexing a large XML file - c#

Given a large (74GB) XML file, I need to read specific XML nodes by a given Alphanumeric ID. It takes too long to read from top-to-bottom of the file looking for the ID.
Is there an analogy of an Index for XML files like there is for relational databases?, I imagine a small Index file, where the Alphanumeric ID is quick to find, and points to the location in the larger file.
Do Index files for XML exist?, how can they be implemented in C#?

XML databases such as BaseX, eXistDB, or MarkLogic do what you are looking for: they load XML documents into a persistent form on disk and allow fast access to parts of the document by use of indexes.
Some XML databases are optimized for handling many small documents, others are able to handle a small number of large documents, so choose your product carefully (I can't advise you on this), and consider breaking the document up into smaller parts as it is loaded.
If you need to split the large document into lots of small documents, consider a streaming XSLT 3.0 processor such as Saxon-EE. I would expect that processing 75Gb should take about an hour: dependent, obviously, on the speed of your machine.

No, that is beyond of the scope of what XML tries to achieve. If the XML does not change often and your read from it a lot, I would propose rewriting its content into a local SQLite DB once-per-change and then reading from the database instead. When doing the rewriting, remember that SAX-style XML reading is your friend in the case of huge files like this.
Theoretically, you can create a sort-of index by remembering location of already discovered IDs and then parse on your own, but that would be very brittle. XML si not simple enough for you to parse it on your own and hope you will be standard compliant.
Of course, I suppose here that you can't do anything with the larger design itself: as others noted, the size of that file suggests that there is an architectural problem.

Related

How to search through thousands of files to text efficiently in real-time

I'm working on refactoring a document storage service's site to go from a proprietary storage system to SQL. Everything is going fairly well, but I need to find a way to search through our repository for specific strings of text. We use a multitude of different file types (.xls,.xlsx,.doc,.txt, etc). They're displayed to the user by first converting them to a PDF, via line-by-line rebuilding using PDFSharp.
The speed isn't a consideration for viewing/searching a single file, but I have concerns about scalability. I was able to make a functioning text search by copying and then hooking into our conversion process, but I am fairly sure that this will not work for searching through a customer's entire document list (thousands and thousands of documents). If these were all of a uniform file type, it might be easier to do, but they aren't.
Is there an efficient way to do this of which I am unaware?
EDIT: The documents are stored on the server and referenced via document URLs in the DB

My recommendation is to build an index, either in SQL or in a file. One that matches files with all the possible search terms of interest in each file.

Extracting a small subset of data from XMLs

I am writing a C# / VB program that is to be used for reporting data based upon information received in XMLs.
My situation is that I receive many XMLs per month (about 100-200) - Each ranging in size from 10mb to 350mb. For each of these XMLs, I only need a small subset of its data (less than 5% of any one file's entire data) so as to produce the necessary reports.
Also, that subset of data will always be held in the same key-structure (it will exist within multiple keys and at differing levels down, perhaps, but it will always exist within the same key names / the keys containing it will always have the with the same attributes such as "name", etc)
So, my current idea of how to go about doing this is to:
To create a "scraper" that will pull the necessary data from the XMLs using XPath.
Store that small subset of necessary data in a SQL Server table along with file characteristic data stored in a separate table so as to know which file this scraped data came from
Query out the data into a program for reporting it.
My main question here is really what is the best way to scrape that data out?
I am most familiar with XPath, but for multiple files of 200MB in size, I'm afraid of performance issues loading in the entire file.
Other things I have seen / researched are:
Creating an XSLT file to transform / pull from the XML only the data I want
Using Linq to XML
Somehow linking the XMLs to SQL server and then being able to query them directly
Using ADO to query the XMLs from within the program
Doing it using the XMLReader class (rather than loading in each XML entirely)
Maybe there is a native .Net component that does this very well already
Quite honestly, I just have no clue what the standard is given the high number of XMLs and the large variance in file sizes and I'm not familiar with any of the other ways of doing this - such as, for example, linking the XMLs to SQL Server directly / using ADO to query the XML - and, therefore, don't know of their possible benefits / drawbacks.
If any of you have been in a similar situation, I'd really appreciate any kind of pointers in the right direction / at least validation that my method isn't the worst one out there :)
Thanks!!!

As for the memory consumption and performance concerns, a nice feature of the .NET XML APIs is that you can combine XmlReader with XPathDocument or XmlDocument or XElement to only selectively read part of a document into memory to then have the XPath or LINQ to XML features available on that part. LINQ to XML has http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom%28v=vs.110%29.aspx for doing that, DOM/XmlDocument has http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.readnode%28v=vs.110%29.aspx. So depending on your XML structure you might be able to use an XmlReader to read forward through the XML in a fast way without consuming much memory and then, when you have the element you are interested in, you can read it into an XElement (LINQ to XML) or XmlNode (DOM) to then apply LINQ to XML and/or XPath to read out details.

For a tag database is it better to store filenames per tag or tags per filename?

I want to write a small app that manages file tags for my personal files. It's gonna be pretty straightforward but I am not sure if I should be storing filenames for each unique tag, i.e.:
"sharp":
file0.ext file1.ext file2.ext file3.ext
"cold":
file1.ext file2.ext
"ice":
file3.ext
Or if I should be storing tags for each file name i.e:
file0.ext:
"sharp"
file1.ext:
"sharp" "cold"
file2.ext:
"sharp" "cold"
file2.ext:
"sharp" "ice"
I want to use the method that will give me the best performance and/or best design. Since I never did anything like this, the method I think is right might not be optimal.
Just to give more info about the app:
I will search files by tag. All I need is to be able to type my tags so I can see which files match, and double click to open them, etc.
I will use protobuffers (Marc's version) to save and load the database.
Database size is not important as I will use it on my PC.
I don't think I will ever have more than 50K files. Most likely I will have 20K max as these are mostly personal files so it's not possible for me to create/collect more than that.
EDIT: I forgot to mention another feature. Since this will be the same app to define tags for files, when I select a file, I need it to load all tags that file have so I can show them in case I want to edit them.

It all matters how you want to search the data... Since you say that you want to search files by tag, then your first method will be the simplest since you will only need to read a small part of the data file.
If you really wanted to be simple, you could have a separate data file for each tag (i.e. sharp.txt, cold.txt, ice.txt) and then just have a list of filenames in the file.

If you're searching by tag, that seems like the more appropriate index. You may incur some performance penalty for finding all tags on a file if that's something you need to do.
Alternatively if you do want to support either scenario: store both, and you can query on them as needed. This creates some data duplication and you'll need extra logic to update both data sets when a file is changed/added, but it should be pretty straightforward.

In the case, you have a lot of tags, a lot of files and a lot of relations, I would suggest using a relational database. In case you don't have a lot of data, I think you should not care about it.
Anyway, I suppose that even if you do want to save the relations in plain text files, the same principles as in the database normalization apply. The main goal is to avoid data repetition. In your model, a tag and a file would have a many-to-many relation. I would immitate the structure of a relational database, even if the data would be stored in plain text files. I would have a file holding the filenames, one ID per filename and another file holding the tags, one ID per tag. A third file would contain the relationships. Simple, keeping files to a minimum size.
Hope I helped!

C# - Loading XML file in parts

My task is to load new set of data (which is written in XML file) and then compare it to the 'old' set (also in XML). All the changes are written to another file.
My program loads new and old file into two datasets, then row after row I compare primary key from the new set with the old one. When I find corresponding row, I check all fields and if there are differences with the old one, I write it to third set and then this set to a file.
Right now I use:
newDS.ReadXml("data.xml");
oldDS.ReadXml("old.xml");
and then I just find rows with corresponding primary key and compare other fields. It is working quite good for small files.
The problem is that my files may have up to about 4GB. If my new and old data are that big it is quite problematic to load 8GB of data to memory.
I would like to load my data in parts, but to compare I need whole old data (or how to get specific row with corresponding primary key from XML file?).
Another problem is that I don't know the structure of a XML file. It is defined by user.
What is the best way to work with such a big files? I thought about using LINQ to XML, but I don't know if it has options that can help with my problem. Maybe it would be better to leave XML and use something different?

You are absolutely right that you should leave XML. It is not a good tool for datasets this size, especially if the dataset consists of many 'records' all with the same structure. Not only are 4GB files unwieldy, but almost anything you use to load and parse them is going to use even more memory overhead than the size of the file.
I would recommend that you look at solutions involving an SQL database, but I have no idea how it can make sense to be analysing a 4GB file where you "don't know the structure [of the file]" because "it is defined by the user". What meaning do you ascribe to 'rows' and 'primary keys' if you don't understand the structure of the file? What do you know about the XML?
It might make sense eg. to read one file, store all the records with primary keys in a certain range, do the same for the other file, do the comparison of that data, then carry on. By segmenting the key space you make sure that you always find matches if they exist. It could also make sense to break your files into smaller chunks in the same way (although I still think XML storage this large is usually inappropriate). Can you say a little more about the problem?

Best way to store urls locally

I am creating an RSS reader as a hobby project, and at the point where the user is adding his own URL's.
I was thinking of two things.
A plaintext file where each url is a single line
SQLite where i can have unique ID's and descriptions following the URL
Is the SQLite idea to much of an overhead or is there a better way to do things like this?

What about as an OPML file? It's XML, so if you needed to store more data then the OPML specification supplies, you can always add your own namespace.
Additionally, importing and exporting from other RSS readers is all done via OPML. Often there is library support for it. If you're interested in having users switch then you have to support OPML. Thansk to jamesh for bringing that point up.

Why not XML?
If you're dealing with RSS anyway you mayaswell :)

Do you plan just to store URLs? Or you plan to add data like last_fetch_time or so?
If it's just a simple URL list that your program will read line-by-line and download data, store it in a file or even better in some serialized object written to a file.
If you plan to extend it, add comments/time of last fetch, etc, I'd go for SQLite, it's not that much overhead.

If it's a single user application that only has one instance, SQLite might be overkill.
You've got a few options as I see it:
SQLite / Database layer. Increases the dependencies your code needs to run. But allows concurrent access
Roll your own text parser. Complexity increases as you want to save more data and you're re-inventing the wheel. Less dependency and initially, while your data is simple, it's trivial for a novice user of your application to edit.
Use XML. It's well formed & defined and text editable. Could be overkill for storing just a URL though.
Use something like pickle to serialize your objects and save them to disk. Changes to your data structure means "upgrading" the pickle files. Not very intuitive to edit for a novice user, but extremely easy to implement.

I'd go with the XML text file option. You can use the XSD tool built into Visual Studio to create a DataTable out of the XML data, and it easily serializes back into the file when needed.
The other caveat is that I'm sure you're going to want the end user to be able to categorize their RSS feeds and be able to potentially search/sort them, and having that kind of datatable style will help with this.
You'll get easy file storage and access, the benefit of a "database" structure, but not quite the overhead of SQLite.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Indexing a large XML file - c#

Related

How to search through thousands of files to text efficiently in real-time

Extracting a small subset of data from XMLs

For a tag database is it better to store filenames per tag or tags per filename?

C# - Loading XML file in parts

Best way to store urls locally

Categories

Resources