I'm writing a program that matches a user submitted query against a list of keywords. The list has about 2000 words and performance is most important.
Old
Is it faster to store this list in a
SQL table or hard code it in the
source code? The list does not need to
be updated often.
If SQL table is faster when which data
types would be the best? (Int,
Nvarchar?)
If hardcoded list is faster what data
type would be the best?
(List?)
Any suggestions?
what is the best in-memory data structure for fast lookups?
It doesn't matter for the performance where you store this data.
If you start your program, you load the string-array once from which datastore you stored it. And then you can use this array all the time until you quit the program.
IMO, If the list doesn't get update often, store it on a file(text/xml) then cached it in your application so that it would be faster for the next requests.
Okay, to respond to your edit (and basically lifting my comment into an answer):
Specify in advance the performance that you are expecting.
Code your application against a sorted array and using a binary search to search the array for a keyword. This is very simple to implement and gives decent performance. Then profile to see if it matches the performance that you demand. If this performance is acceptable, move on. The worst-case performance here is O(m log n) where n is the number of keywords and m is the maximum length of your keywords.
If the performance in step two is not acceptable, use a trie (also known as a prefix tree). The expected performance here is m where m is the maximum length of your keywords. Profile to see if this meets your expected performance. If it does not, revisit your performance criteria; they might have been unreasonable.
If you are still not meeting your performance specifications, consider using a hashtable (in .NET you would use a HashSet<string>. While a hashtable will have worse worst-case performance, it could have better average case performance (if there are no collisions a hashtable lookup is O(1) while the hash computing function is O(m) where m is the maximum length of your keywords). This might be faster (on average) but probably not noticeably so.
You might even consider skipping directly to the last step (as it's less complex than the former). It all depends on your needs. Tries have the advantage that you can easily spit out the closest matching keyword, for example.
The important thing here is to have a specification of your performance requirements and to profile! Use the simplest implementation that meets your performance requirements (for maintainability, readability and implementability (if it's not, it's a word now!))
The list does not need to be updated often
I say if it ever needs to be updated it does not belong in the source code.
The hardcoded list is faster. A database hit to retrieve the list will undoubtedly be slower than pulling the list from an in-memory object.
As for which datatype to store the values in, an array would probably be faster and take up less memory than a List, but trivially so.
If the list is largely static and you can afford to spend some time in prep (i.e. on application start), you'd probably be best off storing the list of keywords in a text file, then using say a B* tree to store the keywords internally (assuming you only care about exact match and not partial matching or Levenshtein distance).
Related
Working in C#, I would like to write an efficient sorting algorithm that would take as input a text file containing unsorted list of server and path combinations and output a sorted file.
As an exercise, I am working under assumption that the input data size will exceed available memory, so I am thinking of reading the file into memory a chunk at a time, doing a Quick sort (or a Heap sort, maybe?), outputting sorted chunks to temporary files, then doing a merge sort to produce the final output.
The format of the input file is up to my discretion. It can be just a list of UNC paths (server and path as single string) or it can be a CSV with servers and paths as separate fields.
My question is whether there is any benefit to be had from having server and path be separate entities in my data structure and evaluating them separately?
Having server and path separate would eliminate having to compare the server names during the path comparison run, but require additional run to sort by server and, given the available memory constraint, would require me to somehow cache the sorted server lists, increasing disk IO overhead.
Is there some technique I can leverage to optimize performance of such an application by providing server and path as separate fields in my input?
Any other optimization techniques that I might consider given the nature of the dataset?
EDIT: This is a one-time task. I do not need to later look up the entries
I am thinking of reading the file into memory a chunk at a time, doing a Quick sort (or a Heap sort, maybe?), outputting sorted chunks to temporary files, then doing a merge sort to produce the final output.
That's a perfectly reasonable plan.
An alternate solution would be: create an on-disk b-tree, and insert all your data one record at a time into the b-tree. You never need to have more than a few pages of the b-tree in memory and you can read the records one at a time from the unsorted list. Once it's in the b-tree, read it back out in order.
Having server and path separate would eliminate having to compare the server names during the path comparison run, but require additional run to sort by server and, given the available memory constraint, would require me to somehow cache the sorted server lists, increasing disk IO overhead.
OK.
My question is whether there is any benefit to be had from having server and path be separate entities in my data structure and evaluating them separately?
You just said what the pros and cons are. You've already listed them. Why are you asking this question if you already know the answer?
Is there some technique I can leverage to optimize performance of such an application by providing server and path as separate fields in my input?
Probably, yes.
How can I know for sure?
Write the code both ways and run it. The one that is better will be observed to be better.
Any other optimization techniques that I might consider given the nature of the dataset?
Your question and speculations are premature.
Start by setting a performance goal.
Then implement the code as clearly and correctly as you can.
Then carefully measure to see if you met your goal.
If you did, knock off early and go to the beach.
If you did not, get a profiler and use it to analyze your program to find the worst-performing part. Then optimize that part.
Keep doing that until either you meet your goal, or you give up.
I'm certainly not going to out-answer Eric Lippert, but from a novice's perspective I wonder if you're not looking for the most complex answer first.
You don't need to read the file into memory all at once with File.ReadLines...so your input is one line at a time. Use the Uri object for quick parsing of each string into it's component parts: host and path.
If you are thinking of a OO approach, then how about a 'serverUri' object that implements IComparable and having a SortedList of path strings. Make a SortedList of the serverUri objects, so that part of the string is stored only once, and for each path with that server uri, add it to the sub collection. Viola...its all sorted...spit it out to disk.
I have done some performance testing in C# on using the For Loop and the While Loop for an ArrayList to do comparison search.
It seems to be having quadratic time consumption.
However, if I use LastIndexOf or IndexOf to search through the list, it gains "faster than anticipated" speed.
Does anyone know the reason?
I don't know any C# and yet I can put forward the likely answer.
Any programming language method are generally written in ways that take advantage of shortcuts made available by the processor that they run on - which your own written code will not (eg you'll have to declare local variables which will have to be kept on the stack, requiring slower lookup times, instead of just being a temporary register variable). Thus anything the language does natively will generally be quicker than your own code.
Use ILSpy and take a look at the internals of LastIndexOf/IndexOf methods. There lies your answer as to why they are faster.
I have a hunch that the List internally uses a B-tree or some other tree, which has a lookup of log(n). What you are doing with for/foreach is performing a linear lookup with some extra overhead. If you remember your maths class then you'd know that log(n) is flatter than a linear line, thus having faster lookup...
I am trying to get to grips with LINQ. The thing that bothers me most is that even as I understand the syntax better, I don't want to unwittingly sacrifice performance for expressiveness.
Are they any good centralized repositories of information or books for 'Effective LINQ' ? Failing that, what is your own personal favourite high-performance LINQ technique ?
I am primarily concerned with LINQ to Objects, but all suggestions on LINQ to SQL and LINQ to XML also welcome of course. Thanks.
Linq, as a built-in technology, has performance advantages and disadvantages. The code behind the extension methods has had considerable performance attention paid to it by the .NET team, and its ability to provide lazy evaluation means that the cost of performing most manipulations on a set of objects is spread across the larger algorithm requiring the manipulated set. However, there are some things you need to know that can make or break your code's performance.
First and foremost, Linq doesn't magically save your program the time or memory needed to perform an operation; it just may delay those operations until absolutely needed. OrderBy() performs a QuickSort, which will take nlogn time just the same as if you'd written your own QuickSorter or used List.Sort() at the right time. So, always be mindful of what you're asking Linq to do to a series when writing queries; if a manipulation is not necessary, look to restructure the query or method chain to avoid it.
By the same token, certain operations (sorting, grouping, aggregates) require knowledge of the entire set they are acting upon. The very last element in a series could be the first one the operation must return from its iterator. On top of that, because Linq operations should not alter their source enumerable, but many of the algorithms they use will (i.e. in-place sorts), these operations end up not only evaluating, but copying the entire enumerable into a concrete, finite structure, performing the operation, and yielding through it. So, when you use OrderBy() in a statement, and you ask for an element from the end result, EVERYTHING that the IEnumerable given to it can produce is evaluated, stored in memory as an array, sorted, then returned one element at a time. The moral is, any operation that needs a finite set instead of an enumerable should be placed as late in the query as possible, allowing for other operations like Where() and Select() to reduce the cardinality and memory footprint of the source set.
Lastly, Linq methods drastically increase the call stack size and memory footprint of your system. Each operation that must know of the entire set keeps the entire source set in memory until the last element has been iterated, and the evaluation of each element will involve a call stack at least twice as deep as the number of methods in your chain or clauses in your inline statement (a call to each iterator's MoveNext() or yielding GetEnumerator, plus at least one call to each lambda along the way). This is simply going to result in a larger, slower algorithm than an intelligently-engineered inline algorithm that performs the same manipulations. Linq's main advantage is code simplicity. Creating, then sorting, a dictionary of lists of groups values is not very easy-to-understand code (trust me). Micro-optimizations can obfuscate it further. If performance is your primary concern, then don't use Linq; it will add approximately 10% time overhead and several times the memory overhead of manipulating a list in-place yourself. However, maintainability is usually the primary concern of developers, and Linq DEFINITELY helps there.
On the performance kick: If performance of your algorithm is the sacred, uncompromisable first priority, you'd be programming in an unmanaged language like C++; .NET is going to be much slower just by virtue of it being a managed runtime environment, with JIT native compilation, managed memory and extra system threads. I would adopt a philosophy of it being "good enough"; Linq may introduce slowdowns by its nature, but if you can't tell the difference, and your client can't tell the difference, then for all practical purposes there is no difference. "Premature optimization is the root of all evil"; Make it work, THEN look for opportunities to make it more performant, until you and your client agree it's good enough. It could always be "better", but unless you want to be hand-packing machine code, you'll find a point short of that at which you can declare victory and move on.
Simply understanding what LINQ is doing internally should yield enough information to know whether you are taking a performance hit.
Here is a simple example where LINQ helps performance. Consider this typical old-school approach:
List<Foo> foos = GetSomeFoos();
List<Foo> filteredFoos = new List<Foo>();
foreach(Foo foo in foos)
{
if(foo.SomeProperty == "somevalue")
{
filteredFoos.Add(foo);
}
}
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();
So the above code will iterate twice and allocate a second container to hold the filtered values. What a waste! Compare with:
var foos = GetSomeFoos();
var filteredFoos = foos.Where(foo => foo.SomeProperty == "somevalue");
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();
This only iterates once (when the repeater is bound); it only ever uses the original container; filteredFoos is just an intermediate enumerator. And if, for some reason, you decide not to bind the repeater later on, nothing is wasted. You don't even iterate or evaluate once.
When you get into very complex sequence manipulations, you can potentially gain a lot by leveraging LINQ's inherent use of chaining and lazy evaluation. Again, as with anything, it's just a matter of understanding what it is actually doing.
There are various factors which will affect performance.
Often, developing a solution using LINQ will offer pretty reasonable performance because the system can build an expression tree to represent the query without actually running the query while it builds this. Only when you iterate over the results does it use this expression tree to generate and run a query.
In terms of absolute efficiency, running against predefined stored procedures you may see some performance hit, but generally the approach to take is to develop a solution using a system that offers reasonable performance (such as LINQ), and not worry about a few percent loss of performance. If a query is then running slowly, then perhaps you look at optimisation.
The reality is that the majority of queries will not have the slightest problem with being done via LINQ. The other fact is that if your query is running slowly, it's probably more likely to be issues with indexing, structure, etc, than with the query itself, so even when looking to optimise things you'll often not touch the LINQ, just the database structure it's working against.
For handling XML, if you've got a document being loaded and parsed into memory (like anything based on the DOM model, or an XmlDocument or whatever), then you'll get more memory usage than systems that do someting like raising events to indicate finding a start or end tag, but not building a complete in-memory version of the document (like SAX or XmlReader). The downside is that the event-based processing is generally rather more complex. Again, with most documents there won't be a problem - most systems have several GB of RAM, so taking up a few MB representing a single XML document is not a problem (and you often process a large set of XML documents at least somewhat sequentially). It's only if you have a huge XML file that would take up 100's of MB that you worry about the particular choice.
Bear in mind that LINQ allows you to iterate over in-memory lists and so on as well, so in some situations (like where you're going to use a set of results again and again in a function), you may use .ToList or .ToArray to return the results. Sometimes this can be useful, although generally you want to try to use the database's querying rather in-memory.
As for personal favourites - NHibernate LINQ - it's an object-relational mapping tool that allows you to define classes, define mapping details, and then get it to generate the database from your classes rather than the other way round, and the LINQ support is pretty good (certainly better than the likes of SubSonic).
In linq to SQL you don't need to care that much about performance. you can chain all your statements in the way you think it is the most readable. Linq just translates all your statements into 1 SQL statement in the end, which only gets called/executed in the end (like when you call a .ToList()
a var can contain this statement without executing it if you want to apply various extra statements in different conditions. The executing in the end only happens when you want to translate your statements into a result like an object or a list of objects.
There's a codeplex project called i4o which I used a while back which can help improve the performance of Linq to Objects in cases where you're doing equality comparisons, e.g.
from p in People
where p.Age == 21
select p;
http://i4o.codeplex.com/
I haven't tested it with .Net 4 so can't safely say it will still work but worth checking out.
To get it to work its magic you mostly just have to decorate your class with some attributes to specify which property should be indexed. When I used it before it only works with equality comparisons though.
I find myself often with a situation where I need to perform an operation on a set of properties. The operation can be anything from checking if a particular property matches anything in the set to a single iteration of actions. Sometimes the set is dynamically generated when the function is called, some built with a simple LINQ statement, other times it is a hard-coded set that will always remain the same. But one constant always exists: the set only exists for one single operation and has no use before or after it.
My problem is, I have so many points in my application where this is necessary, but I appear to be very, very inconsistent in how I store these sets. Some of them are arrays, some are lists, and just now I've found a couple linked lists. Now, none of the operations I'm specifically concerned about have to care about indices, container size, order, or any other functionality that is bestowed by any of the individual container types. I picked resource efficiency because it's a better idea than flipping coins. I figured, since array size is configured and it's a very elementary container, that might be my best choice, but I figure it is a better idea to ask around. Alternatively, if there's a better choice not out of resource-efficiency but strictly as being a better choice for this kind of situation, that would be nice as well.
With your acknowledgement that this is more about coding consistency rather than performance or efficiency, I think the general practice is to use a List<T>. Its actual backing store is an array, so you aren't really losing much (if anything noticable) to container overhead. Without more qualifications, I'm not sure that I can offer anything more than that.
Of course, if you truly don't care about the things that you list in your question, just type your variables as IEnumerable<T> and you're only dealing with the actual container when you're populating it; where you consume it will be entirely consistent.
There are two basic principles to be aware of regarding resource efficiency.
Runtime complexity
Memory overhead
You said that indices and order do not matter and that a frequent operation is matching. A Dictionary<T> (which is a hashtable) is an ideal candidate for this type of work. Lookups on the keys are very fast which would be beneficial in your matching operation. The disadvantage is that it will consume a little more memory than what would be strictly required. The usual load factor is around .8 so we are not talking about a huge increase or anything.
For your other operations you may find that an array or List<T> is a better option especially if you do not need to have the fast lookups. As long as you are not needing high performance on specialty operations (lookups, sorting, etc.) then it is hard to beat the general resource characteristics of array based containers.
List is probably fine in general. It's easy to understand (in the literate programming sense) and reasonably efficient. The keyed collections (e.g. Dict, SortedList) will throw an exception if you add an entry with a duplicate key, though this may not be a problem for what you're working on now.
Only if you find that you're running into a CPU-time or memory-size problem should you look at improving the "efficiency", and then only after determining that this is the bottleneck.
No matter which approach you use, there will still be creation and deletion of the underlying objects (collection or iterator) that will eventually be garbage collected, if the application runs long enough.
I'm using a Dictionary<> to store a bazillion items. Is it safe to assume that as long as the server's memory has enough space to accommodate these bazillion items that I'll get near O(1) retrieval of items from it? What should I know about using a generic Dictionary as huge cache when performance is important?
EDIT: I shouldn't rely on the default implementations? What makes for a good hashing function?
It depends, just about entirely, on how good a hashing functionality your "bazillion items" support -- if their hashing function is not excellent (so that many conflicts result) your performance will degrade with the growth of the dictionary.
You should measure it and find out. You're the one who has knowledge of the exact usage of your dictionary, so you're the one who can measure it to see if it meets your needs.
A word of advice: I have in the past done performance analysis on large dictionary structures, and discovered that performance did degrade as the dictionary became extremely large. But it seemed to degrade here and there, not consistently on each operation. I did a lot of work trying to analyze the hash algorithms, etc, before smacking myself in the forehead. The garbage collector was getting slower because I had so much live working set; the dictionary was just as fast as it always was, but if a collection happened to be triggered, then that was eating up my cycles.
That's why it is important to not do performance testing in unrealistic benchmark scenarios; to find out what the real-world performance cost of your bazillion-item dictionary is, well, that's going to be gated on lots of stuff that has nothing to do with your dictionary, like how much collection triggering is happening throughout the rest of your program, and when.
Yes you will have O(1) access times. In fact to be pedantic g it will be exactly O(1).
You need to ensure that all your objects that are used as keys have a good GetHashCode implementation and should likely override Equals.
Edit to clarify: In reality acess times will get slower the more items you have unless you can provide a "perfect" hash function.
Yes, you will have near O(1) no matter how many objects you put into the Dictionary. But for the Dictionary to be fast, your key-objects should provide a sufficient GetHashCode-implementation, because Dictionary uses a hashtable inside.