Search multiple XML files for string - c#

I have a folder with 400k+ XML-documents and many more to come, each file is named with 'ID'.xml, and each belongs to a specific user. In a SQL server database I have the 'ID' from the XML-file matched with a userID which is where I interconnect the XML-document with the user. A user can have an infinite number of XML-document attached (but let's say maximum >10k documents)
All XML-documents have a few common elements, but the structure can vary a little.
Now, each user will need to make a search in the XML-documents belonging to her, and what I've tried so far (looping through each file and read it with a streamreader) is too slow. I don't care, if it reads and matches the whole file with attributes and so on, or just the text in each element. What should be returned in the first place is a list with the ID's from the filenames.
What is the fastest and smartest methods here, if any?

I think LINQ-to-XML is probably the direction you want to go.
Assuming you know the names of the tags that you want, you would be able to do a search for those particular elements and return the values.
var xDoc = XDocument.Load("yourFile.xml");
var result = from dec in xDoc.Descendants()
where dec.Name == "tagName"
select dec.Value;
results would then contain an IEnumerable of the value of any XML tag that has has a name matching "tagName"
The query could also be written like this:
var result = from dec in xDoc.Decendants("tagName")
select dec.Value;
or this:
var result = xDoc.Descendants("tagName").Select(tag => tag.Value);
The output would be the same, it is just a different way to filter based on the element name.

You'll have to open each file that contains relevant data, and if you don't know which files contain it, you'll have to open all that may match. So the only performance gain would be in the parsing routine.
When parsing Xml, if speed is the requirement, you could use the XmlReader as it performs way better than the other parsers (most read the entire Xml file before you can query them). The fact that it is forward-only should not be a limitation for this case.
If parsing takes about as long as the disk I/O, you could try parsing files in parallel, so one thread could wait for a file to be read while the other parses the loaded data. I don't think you can make that big a win there, though.
Also what is "too slow" and what is acceptable? Would this solution of many files become slower over time?

Use LINQ to XML.
Check out this article. over at msdn.
XDocument doc = XDocument.Load("C:\file.xml");
And don't forget that reading so many files will always be slow, you may try writing a multi-threaded program...

If I understood correctly you don't want to open each xml file for particular user because it's too slow whether you are using linq to xml or some other method.
Have you considered saving some values both in xml file and relational database (tags) (together with xml ID).
In that case you could search for some values in DB first and select only xml files that contain searched values ?
for example:
ID, tagName1, tagName2
xmlDocID, value1, value2
my other question is, why have you chosen to store xml documents in file system. If you are using SQL Server 2005/2008, it has very good support for storing, searching through xml columns (even indexing some values in xml)

Are you just looking for files that have a specific string in the content somewhere?
WARNING - Not a pure .NET solution. If this scares you, then stick with the other answers. :)
If that's what you're doing, another alternative is to get something like grep to do the heavy lifting for you. Shell out to that with the "-l" argument to specify that you are only interested in filenames and you are on to a winner. (for more usage examples, see this link)

L.B Have already made a valid point.
This is a case, where Lucene.Net(or any indexer) would be a must. It would give you a steady (very fast) performance in all searches. And it is one of the primary benefits of indexers, to handle a very large amount of arbitrary data.
Or is there any reason, why you wouldn't use Lucene?

Lucene.NET (and Lucene) support incremental indexing. If you can re-open the index for reading every so often, then you can keep adding documents to the index all day long -- your searches will be up-to-date with the last time you re-opened the index for searching.

Related

Get certain row by searching for a string

I am very new to C# and am trying to feel it out. Slow going so far! What I am trying to achieve should be relatively simple; I want to read a row from a CSV file with a search. I.e. if I search for username "Toby" it would fetch the entire row, preferably as an array.
Here is my users.csv file:
Id,Name,Password
1,flugs,password
2,toby,foo
I could post the code that I've tried, but I haven't even come close in previous attempts. It's a bit easier to do such a thing in Python, it may be easy in C# too but I'm far too new to know!
Does anyone have any ideas as to how I should approach/code this? Many thanks.
Easy to do in c# too:
var lineAsArray = File.ReadLines("path").First(s => s.Contains(",toby,")).Split(',');
If you want case insens, use e.g. Contains(",toby,", StringComparison.OrdinalIgnoreCase)
If your user is going to type in "Toby" you can either concatenate a comma on the start/end of it to follow this simplistic searching (which will find Toby anywhere on the line) or you can split the lone first and look to see if the second element is Toby
var lineAsArray = File.ReadLines("path").Split(',').First(a => a[1].Equals("toby"));
To make this one case insensitive, put a suitable StringComparison argument into the Equals using the same approach as above
Sky's the limit with how involved you want to get with it; using a library that parses CSV to objects that represent your lines with named, typed parameters is probably where I'd stop.. take a look at CSVHelper from josh close or ServiceStack Text, though there are no shortage of csv parser libs- it's been done to death!

Reducing memory and increasing speed while parsing XML files

I have a directory with about 30 randomly named XML files. So the name is no clue about their content. And I need to merge all of these files into a single file according to predefined rules. And unfortunately, it is too complex to use simple stylesheets.
Each file can have up to 15 different elements within its root. So, I have 15 different methods that each take an XDocument as parameter and search for a specific element in the XML. It will then process that data. And because I call these methods in a specific order, I can assure that all data is processed in the correct order.
Example nodes are e.g. a list of products, a list of prices for specific product codes, a list of translations for product names, a list of countries, a list of discounts on product in specific country and much, much more. And no, these aren't very simple structures either.
Right now, I'm doing something like this:
List<XmlFileData> files = ImportFolder.EnumerateFiles("*.xml", SearchOption.TopDirectoryOnly).Select(f => new XDocument(f.FullName)).ToList();
files.ForEach(MyXml, FileInformation);
files.ForEach(MyXml, ParseComments);
files.ForEach(MyXml, ParsePrintOptions);
files.ForEach(MyXml, ParseTranslations);
files.ForEach(MyXml, ParseProducts);
// etc.
MyXml.Save(ExportFile.FullName);
I wonder if I can do this in a way that I have to read less in memory and generate a faster result. Speed is more important than memory, though. Thus, this solution works. I just need something faster that will use less memory.
Any suggestions?
One approach would be to create a separate List<XElement> for each of the different data types. For example:
List<XElement> Comments = new List<XElement>();
List<XElement> Options = new List<XElement>();
// etc.
Then for each document you can go through the elements in that document and add them to the appropriate lists. Or, in pseudocode:
for each document
for each element in document
add element to the appropriate list
This way you don't have to load all of the documents into memory at the same time. In addition, you only do a single pass over each document.
Once you've read all of the documents, you can concatenate the different elements into your single MyXml document. That is:
MyXml = create empty document
Add Comments list to MyXml
Add Options list to MyXml
// etc.
Another benefit of this approach is that if the total amount of data is larger than will fit in memory, those lists of elements could be files. You'd write all of the Comment elements to the Comments file, the Options to the Options file, etc. And once you've read all of the input documents and saved the individual elements to files, you can then read each of the element files to create the final XML document.
Depending on the complexity of your rules, and how interdependent the data is between the various files, you could probably process each file in parallel (or at least certain chunks of it).
Given that the XDocument's aren't being changed during the read, you could most certainly gather your data in parallel, which would likely offer a speed advantage.
See https://msdn.microsoft.com/en-us/library/dd460693%28v=vs.110%29.aspx
You should examine the data you're loading in, and whether you can work on that in any special way to keep memory-usage low (and even gain some speed).

Iterate Large XML File and Copy Select Nodes

I need to iterate through a large XML file (~2GB) and selectively copy certain nodes to one or more separate XML files.
My first thought is to use XPath to iterate through matching nodes and for each node test which other file(s) the node should be copied to, like this:
var doc = new XPathDocument(#"C:\Some\Path.xml");
var nav = doc.CreateNavigator();
var nodeIter = nav.Select("//NodesOfInterest");
while (nodeIter.MoveNext())
{
foreach (Thing thing in ThingsThatMightGetNodes)
{
if (thing.AllowedToHaveNode(nodeIter.Current))
{
thing.WorkingXmlDoc.AppendChild(... nodeIter.Current ...);
}
}
}
In this implementation, Thing defines public System.Xml.XmlDocument WorkingXmlDoc to hold nodes that it is AllowedToHave(). I don't understand, though, how to create a new XmlNode that is a copy of nodeIter.Current.
If there's a better approach I would be glad to hear it as well.
Evaluation of an XPath expression requires that the whole XML document (XML Infoset) be in RAM.
For an XML file whose textual representation exceeds 2GB, typically more than 10GB of RAM should be available just to hold the XML document.
Therefore, while not impossible, it may be preferrable (especially on a server that must have resources quickly available to many requests) to use another technique.
The XmlReader (based classes) is an excellent tool for this scenario. It is fast, forward only, and doesn't require to retain the read nodes in memory. Also, your logic will remain almost the same.
You should consider LINQ to XML. Check this blog post for details and examples:
http://james.newtonking.com/archive/2007/12/11/linq-to-xml-over-large-documents.aspx
Try an XQuery processor that implements document projection (an idea first published by Marion and Simeon). It's implemented in a number of processors including Saxon-EE. Basically, if you run a query such as //x, it will filter the input event stream and build a tree that only contains the information needed to handle this query; it will then execute the query in the normal way, but against a much smaller tree. If this is a small part of the total document, you can easily reduce the memory requirement by 95% or so.

How to read a text file into a List in C#

I have a text file that has the following format:
1234
ABC123 1000 2000
The first integer value is a weight and the next line has three values, a product code, weight and cost, and this line can be repeated any number of times. There is a space in between each value.
I have been able to read in the text file, store the first value on the first line into a variable, and then the subsequent lines into an array and then into a list, using first readline.split('').
To me this seems an inefficient way of doing it, and I have been trying to find a way where I can read from the second line where the product codes, weights and costs are listed down into a list without the need of using an array. My list control contains an object where I am only storing the weight and cost, not the product code.
Does anyone know how to read in a text file, take in some values from the file straight into a list control?
Thanks
What you do is correct. There is no generalized way of doing it, since what you did is that you descirbed the algorithm for it, that has to be coded or parametrized somehow.
Since your text file isn't as structured as a CSV file, this kind of manual parsing is probably your best bet.
C# doesn't have a Scanner class like Java, so what you wan't doesn't exist in the BCL, though you could write your own.
The other answers are correct - there's no generalized solution for this.
If you've got a relatively small file, you can use File.ReadAllLines(), which will at least get rid of a lot cruft code, since it'll immediately convert it to a string array for you.
If you don't want to parse strings from the file and to reserve an additional memory for holding split strings you can use a binary format to store your information in the file. Then you can use the class BinaryReader with methods like ReadInt32(), ReadDouble() and others. It is more efficient than read by characters.
But one thing: binary format is bad readable by humans. It will be difficult to edit the file in the editor. But programmatically - without any problems.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?
If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();
System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.
This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.
Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.
You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.
If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.
Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.
One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.
I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.
If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.
To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Categories

Resources