Optimize XML Reading

Optimize XML Reading - c#

I have an event that occurs over 50 000 times. Each time this event occurs, I need to go look into an xml file and grab some data. This xml file has over 20 000 possible values.
Before needing to read my xml, the 50 000 event's other work was very fast. As soon as I implemented the xml read, it slowed down the whole operation very significantly.
To read the XML, I am doing this:
XElement xelement = XElement.Load(SomePath);
IEnumerable<XElement> elements = xelement.Elements();
foreach (var element in elements)
{
if (element.Element("SomeNode") == value)
{
valueToSet = element.Element("SomeOtherNode").Value
break;
}
}
What can I do to optimize this?

It sounds to me like you should load the file once, preprocess it to a Dictionary<string, String>, and then just perform a lookup in that dictionary each time the event occurs. You haven't given us enough context to say exactly where you'd cache the dictionary, but that's unlikely to be particularly hard.
The loading and converting to a dictionary is really simple:
var dictionary = XElement.Load(SomePath)
.Elements()
.ToDictionary(x => x.Element("SomeNode").Value,
x => x.Element("SomeOtherNode").Value);
That's assuming that all of the secondary elements contains both a SomeNode element and a SomeOtherNode element... if that's not the case, you'd need to do some filtering first.

If you always need to load the (refreshed?) file each time, you may want to consider the XStreamingElement so that you don't have to load the entire document upfront (as XElement.load does). With XElement.Load, you load the entire XML into memory and then iterate over that. With the XStreaminigElement, you only bring the amount of the document you need into memory. In addition, to then get the first match, use .FirstOrDefault.
If you don't need to constantly refresh the document, consider a caching mechanism like the one #jonskeet offered.

Related

What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.

Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.

Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

Performance issue with Linq to object taking more time

I have an xml code which is getting updated based on the object value. The 'foreach' loop here is taking almost 12-15 minutes to fetch a 200 kb xml file. Please suggest how I can improve the performance. (the xml file consists of a four leveled tag in which the child (4th level) tags are each 10 in number)
Code:
IEnumerable<XElement> elements = xmlDoc.Descendants();
foreach (DataSource Data in DataLst)
{
XElement xmlElem = (from xmlData in elements
where Data.Name == xmlData.Name.LocalName //Name
&& Data.Store == xmlData.Element(XName.Get("Store", "")).Value
&& Data.Section == xmlData.Element(XName.Get("Section", "")).Value
select xmlData.Element(XName.Get("Val", ""))).Single();
xmlElem.ReplaceWith(new XElement(XName.Get("Val", ""), Data.Value));
}

It looks like you have an O(n)×O(m) issue here, for n = size of DataList and m = size of the xml. To make this O(n)+O(m), you should index the data; for example:
var lookup = elements.ToLookup(
x => new {
Name = x.Name.LocalName,
Store = x.Element(XName.Get("Store", "")).Value,
Section = x.Element(XName.Get("Section", "")).Value},
x => x.Element(XName.Get("Val", ""))
);
foreach (DataSource Data in DataLst)
{
XElement xmlElem = lookup[
new {Data.Name, Data.Store, Data.Section}].Single();
xmlElem.ReplaceWith(new XElement(XName.Get("Val", ""), Data.Value));
}
(untested - to show general approach only)

i think better approach would be to Deserialize XML to C# Classes and then use LINQ on that, should be fast.

"Well thanks everyone for your precious time and effort"
Problem answer: Actually the object 'DataLst' was of the type IEnumerable<> which was taking time in obtaining the values but after I changed it to the List<> type the performance improved drastically (now running in 20 seconds)

If it really takes this long to run this, then maybe do something like this:
Don't iterate both - only iterate the XML-File and load the Data from your DataLst (make a SQL-command or simple linq-statement to load the data based on Name/Store/Section), make a simple struct/class for your key with this data (Name/Store/Section) - don't forget to implement equals, and GetHashCode
iterate through your XML-Elements and use the dictionary to find the values to replace
This way you will only iterate the XML-File once not once for every data in your DataSource.

It's not clear why it's taking that long - that's a very long time. How many elements are in DataLst? I would rewrite the query for simplicity to start with though:
IEnumerable<XElement> elements = xmlDoc.Descendants();
foreach (DataSource data in DataLst)
{
XElement valElement = (from element in xmlDoc.Descendants(data.Name)
where data.Store == element.Element("Store").Value
&& data.Section == element.Element("Section").Value
select element.Element("Val")).Single();
valElement.ReplaceWith(new XElement("Val"), data.Value));
}
(I'm assuming none of your elements actually have namespaces, by the way.)
Next up: consider replacing the contents of valElement instead of replacing the element itself. Change it to:
valElement.ReplaceAll(data.Value);
Now, this has all been trying to keep to the simplicity of avoiding precomputation etc... because it sounds like it shouldn't take this long. However, you may need to build lookups as Marc and Carsten suggested.

Try by replacing Single() call in the LINQ with First().

At the risk of flaming, have you considered writing this in XQuery instead? There's a good chance that a decent XQuery processor would have a join optimiser that handles this query efficiently.

List<T> FirstOrDefault() bad performance - is dictionary possible in this case?

I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.

Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).

A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).

We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.

This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.

how to load a hashtable from a simple xml file using xmltextreader

using xmltextreader, how would I load a hashtable.
XML:
<base><user name="john">2342343</user><user name="mark">239099393</user></base>
This was asked before but it was using some funky linq that I am not fully comfortable with just yet.

Well, the LINQ to XML solution is really easy, so I suggest we try to make you comfortable with that instead of creating a more complex solution. Here's the code, with plenty of explanation...
// Load the whole document into memory, as an element
XElement root = XElement.Load(xmlReader);
// Get a sequence of users
IEnumerable<XElement> users = root.Elements("user");
// Convert this sequence to a dictionary...
Dictionary<string, string> userMap = users.ToDictionary(
element => element.Attribute("name").Value, // Key selector
element => element.Value); // Value selector
Of course you could do this all in one go - and I'd probably combine the second and third statements. But that's about as conceptually simple as it's likely to get. It would become more complicated if you wanted to put error handling around the possibility that a user element might not have a name, admittedly. (This code will throw a NullReferenceException in that case.)
Note that this assumes you want the name as the key and id as value. If you want the hashtable the other way round, just switch the order of the lambda expressions.

Fastest way to find objects from a collection matched by condition on string member

Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:
class ClassFoo
{
public string word;
public float score;
//... etc ...
}
Assume there's going to be like 50.000 items in the collection, all in memory.
Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:
List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
{
if (cf.word.StartsWith(query) || cf.word.EndsWith(query))
result.Add(cf);
}
How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?
The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.

With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.
If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.
For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.
So, it really comes down to what queries you want to do against that data.
In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.
Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.
EDIT, based on the comment
If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.
If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.
Anything else is some specialization on these concepts.

var Answers = myList.Where(item => item.bar.StartsWith(query) || item.bar.EndsWith(query));
that's the easiest in my opinion, should execute rather quickly.

Not sure I understand... All you can really do is optimize the rule, that's the part that needs to be fastest. You can't speed up the loop without just throwing more hardware at it.
You could parallelize if you have multiple cores or machines.

I'm not up on my Java right now, but I would think about the following things.
How you are creating your list? Perhaps you can create it already ordered in a way which cuts down on comparison time.
If you are just doing a straight loop through your collection, you won't see much difference between storing it as an array or as a linked list.
For storing the results, depending on how you are collecting them, the structure could make a difference (but assuming Java's generic structures are smart, it won't). As I said, I'm not up on my Java, but I assume that the generic linked list would keep a tail pointer. In this case, it wouldn't really make a difference. Someone with more knowledge of the underlying array vs linked list implementation and how it ends up looking in the byte code could probably tell you whether appending to a linked list with a tail pointer or inserting into an array is faster (my guess would be the array). On the other hand, you would need to know the size of your result set or sacrifice some storage space and make it as big as the whole collection you are iterating through if you wanted to use an array.
Optimizing your comparison query by figuring out which comparison is most likely to be true and doing that one first could also help. ie: If in general 10% of the time a member of the collection starts with your query, and 30% of the time a member ends with the query, you would want to do the end comparison first.

For your particular example, sorting the collection would help as you could binarychop to the first item that starts with query and terminate early when you reach the next one that doesn't; you could also produce a table of pointers to collection items sorted by the reverse of each string for the second clause.
In general, if you know the structure of the query in advance, you can sort your collection (or build several sorted indexes for your collection if there are multiple clauses) appropriately; if you do not, you will not be able to do better than linear search.

If it's something where you populate the list once and then do many lookups (thousands or more) then you could create some kind of lookup dictionary that maps starts with/ends with values to their actual values. That would be a fast lookup, but would use much more memory. If you aren't doing that many lookups or know you're going to be repopulating the list at least semi-frequently I'd go with the LINQ query that CQ suggested.

You can create some sort of index and it might get faster.
We can build a index like this:
Dictionary<char, List<ClassFoo>> indexByFirstLetter;
foreach (var cf in collection) {
indexByFirstLetter[cf.bar[0]] = indexByFirstLetter[cf.bar[0]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[0]].Add(cf);
indexByFirstLetter[cf.bar[cf.bar.length - 1]] = indexByFirstLetter[cf.bar[cf.bar.Length - 1]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[cf.bar.Length - 1]].Add(cf);
}
Then use the it like this:
foreach (ClasssFoo cf in indexByFirstLetter[query[0]]) {
if (cf.bar.StartsWith(query) || cf.bar.EndsWith(query))
result.Add(cf);
}
Now we possibly do not have to loop through as many ClassFoo as in your example, but then again we have to keep the index up to date. There is no guarantee that it is faster, but it is definately more complicated.

Depends. Are all your objects always going to be loaded in memory? Do you have a finite limit of objects that may be loaded? Will your queries have to consider objects that haven't been loaded yet?
If the collection will get large, I would definitely use an index.
In fact, if the collection can grow to an arbitrary size and you're not sure that you will be able to fit it all in memory, I'd look into an ORM, an in-memory database, or another embedded database. XPO from DevExpress for ORM or SQLite.Net for in-memory database comes to mind.
If you don't want to go this far, make a simple index consisting of the "bar" member references mapping to class references.

If the set of possible criteria is fixed and small, you can assign a bitmask to each element in the list. The size of the bitmask is the size of the set of the criteria. When you create an element/add it to the list, you check which criteria it satisfies and then set the corresponding bits in the bitmask of this element. Matching the elements from the list will be as easy as matching their bitmasks with the target bitmask. A more general method is the Bloom filter.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.