Fastest way to detect non-equal strings (without storing the string)?

Fastest way to detect non-equal strings (without storing the string)? - c#

I am writing a templating engine and I am searching for a good way to detect if a template has changed.
For this I have the following requirements (in order of importance):
non-equal strings are required to be detected different
as fast as possible
as less memory as possible (=> do not store the whole string for comparison)
high propability to detect equal strings as equal
It is not a big problem, if sometimes equal strings are not detected as equal as this would just trigger a "re-rendering" which would not be needed, but because of the "heavy work" of this, this should happen as less as possible.
I first thought of using String.GetHashCode(), but the probalility of getting the same hash-code for two non-equal strings is pretty high.
Are there any good combinations like checking hash-code and Length to get the probability of to non-equal strings wrongly detected as equal to an unrealisticly happening low number?
Or is using some hashing algorithm, like MD5 or SHA, a good alternative (after hash-code is equal)?
My rendering looks something like the following:
public string RenderTemplate(string name, string template)
{
var cachedTemplate = Cache.Get(name);
if(cachedTemplate == null || !cachedTemplate.Equals(template)) // <= Equals
{
cachedTemplate = new Template(name, template);
cachedTemplate.Render();
Cache.Set(name, cachedTemplate);
}
return cachedTemplate.Result;
}
The Equals is the point I am asking about.
I am also open for other suggestions how this could be solved.
UPDATE:
To add some numbers to get more context:
I expect to have >1000 individual templates and each template will have up to at least a few thousand characters.
This is why I would like to avoid storing the whole template-string "in memory" only for the comparison.
Most of the templates are stored in the DB.
UPDATE 2:
What do you think about extending my RenderTemplate method with a timestamp as suggested by Nikola:
public string RenderTemplate(string name, string template, DateTime timestamp)
Then I could compare name, GetHashCode and timestamp which does not need much memory, should be pretty fast and the probability of a "wrongly detected equality" is practically 0. The timestamp I can read from the DB (have it already there) or the "last changed date" from the file-system for a file-based template.

You don't have much choice. If you don't compare strings by comparing their content, use a hash algorithm to determine if strings are equal. Personally, I would probably use a hash algorithm. If you are a bit paranoid and afraid of a collision, choose algorithm with widest space (e.g. SHA512).
Why do you need to compare strings to determine that a template has changed? Why not use a different approach?
If file is stored on disk, why not use a file watcher?
If stored in database, why not use a timestamp to detect when it was saved?
If application is restarted, anyway reload templates
Also, it's worrying that a template for UI changes so often that you must make checks like this. I think you have more problems with design beside comparing strings.

Related

Replacement .net Dictionary

Given (Simplified description)
One of our services has a lot of instances in memory. About 85% are unique.
We need a very fast key based access to these items as they are queried very often in a single stack / call. This single context is extremely performance optimized.
So we started to put them them into a dictionary. The performance was ok.
Access to the items as fast as possible is the most important thing in this case. It is ensured that there are no write operations when reads occur.
Problem
In the meanwhile we hit the limits of the number of items a dictionary can store.
Die Arraydimensionen haben den unterstützten Bereich überschritten.
bei System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
bei System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
Which translates to The array dimensions have exceeded the supported range.
Solutions like Memcached are in this specific case just too slow. It is a isolated very specific use case encapsulated in a single service
So we are looking for a replacement of the dictionary for this specific scenario.
Currently I can't find one supporting this. Am I missing something? Can someone point me to one?
As an alternative, if none exists we are thinking about implementing one by ourselves.
We thought about two possibilities. Build it up from scratch or wrapping multiple dictionaries.
Wrapping multiple dictionaries
When an item is searched we could have a look at the keys HasCode and use its starting number like an index for a list of wrappers dictionaries. Although this seems to be easy it smells to me and it would mean that the hashcode is calculated twice (one time by us one time by the inner dictionary) (this scenario is really really performance cruical).
I know that exchanging a basetype like the dictionary is the absolute last possibility and I want to avoid it. But currently it looks like there is no way to make the objects more unique or to get the performance of a dictionary from a database or to save performance somewhere else.
I'm also aware of "be aware of optimizations" but the a lower performance would very badly hit the business requirements behind it.

Before I finished reading your questions, the simple multiple dictionaries came to my mind. But you know this solution already. I am assuming you are really hitting the maximum number of items in a dictionary, not any other limit.
I would say go for it. I do not think you should be worried about counting a hash twice. If they keys are somehow long and getting the hash is really a time consuming operations (which I doubt, but can't be sure as you did not mention what are the keys), you do not need to use whole keys for your hash function. Just pick up whatever part you are OK to process in your own hashing and distribute the item based on that.
The only thing you need to make sure here is to have an evenly spread of items among your multiple dictionaries. How hard is to achieve this really depends on what your keys are. If they were completely random numbers, you could just use the first byte and it would be fine (unless you would need more than 256 dictionaries). If they are not random numbers, you have to think about the distribution in their domain and code your first hash function in a way it achieves that goal of even distribution.

I've looked at the implementation of the .Net Dictionary and it seems like you should be able to store 2^32 values in your dictionary. (Next to the list of buckets, which are themselves linked lists there is a single array that stores all items, probably for quick iteration, that might be the limiting factor).
If you haven't added 2^32 values it might be that there is a limit on the items in a bucket (its a linked list so its probably limitted to the maximum stackframe size). In that case you should double check that your hashing function spreads the items evenly over the dictionary. See this answer for more info What is the best algorithm for an overridden System.Object.GetHashCode?

monitor html change using hash func

I want to write an application that gets a list of urls.
For each of them I need to monitor periodically if the content has changed.
I thought :
to use HtmlAgilityPack to fetch html content (any other recommendation?)
I don't need to spot the change itself,
so I though to hash the content, save it in the DB
and re-compare the has in the future.
How would you suggest hashing? .net's GetHashCode() ?
I saw this documentation http://support.microsoft.com/kb/307020
which advise using
tmpSource = ASCIIEncoding.ASCII.GetBytes(sSourceData);
why?

You should absolutely not use GetHashCode() for this. The documentation explicitly states:
Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework.
The results of GetHashCode can change between runs - all that's guaranteed is that calling it on two equal objects in the same process (possibly AppDomain) will give the same hash code. Indeed, String.GetHashCode's algorithm has changed over time, and in .NET 4 the 32-bit implementation is different to the 64-bit implementation.
If you want to use hashing, use MD5, SHA1 etc - something with a specified algorithm which will not change. (Note that these operation on binary data rather than string data, which is probably more appropriate too - you don't need to bother decoding the data as text.)
It's not clear to me whether refetching periodically is really the best idea though - do these servers not support last modified times, etags etc?

As you have asked for suggestions. I would have used this method instead
WebClient client = new WebClient();
String htmlCode = client.DownloadString("http://google.com");
And i would have saved this string in my DB. After the particular interval i could have compared them again.
But yes I do agree the string size would be really be large.
If I just want to get a alert on the fact the content has changed some how. I would use MD5. As the result size of an MD5 string is only 27 characters.
Hence easier to compare and store in DB

Fast data structure for small sets

I'm in need for a data structure that can handle small sets (10-20 strings, at most 50, of varying length) very fast. False positives is ok, but false negatives are not.
The last requirement makes bloom filters seem like a good fit, but I'm not sure about their speed, any other recommendations?
Edit: The set only needs to support insert + membership test.

How about an array of strings that you use a for-loop over checking membership with String.Equals?
For sets this small, fancy data structures may incur too much overhead, and big-oh does not apply. Have you tried doing the simplest possible thing and measuring that?
(If false positives are ok, you might also keep e.g. an array of 1024 bools, where you compute a poor 'hash' of strings by looking at just the first two characters' lowest 5 bits to give you a 10-bit index into the boolean array. Seems like this would be just a few instructions long.)

Depending on what operations you wish to perform against the set, the fastest will likely be a HashSet<string>. See HashSet for more.
ADDITION
Asking Mr. Google, here's an article written by a gentlemen that wrote a Bloom Filter function in C#. However, he's still using (multiple) hashcodes to populate the filter. I would expect that on small data sets it will be slower than a HashSet.

If the set of strings to check for membership is much larger than the set of valid strings then a Trie might give you better performance than a HashSet. The speed of a lookup in a hashset is dependent on the run time of the hashing algorithm which is usually O(k) where k is the length of the string. This is true whether the string is in the hashset or not.
With a Trie, lookup is still O(k), but if the string is not in the Trie, it will terminate the lookup as soon as a single character doesn't match. So best-case, a lookup for an invalid string is O(1).

Why not use a Radix Tree? It's a specialized set data structure based on the trie that is used to store a set of strings.

Check out the System.Collections.Specialized Namespace on MSDN.
Especially the HybridDictionary and the StringDictionary.
I know they're not sets, but you can use null values for each key. (Java does the same with out-of-the box Sets and still is "fast".

If HashSet is too slow for you, you can use classic LZ compressor's technique: fixed size array of hash codes where each entry points to linked list of strings.
In case you know domain of your data just construct ideal hash function and use it.
If it's not your case you can use string.GetHashCode() of something like Murmur hash
and use hash(str) % array.Length as array's index.
I suppose array size of 256-512 entries in good enough for your data structure with 50 strings.

The main benefit of bloom filters over hash tables is that their size depends on the number of objects in the database and the permitted probability for false positives, but not on the size of the objects themselves. Since your database is so small I doubt its size is your main concern.
HashSets are theoretically the best data structure for your requirement, but since the database is so small, an O(log (n)) structure like a SortedDictionary is often preferable, or maybe even just linear search (as mentioned). I recall stories where switching from hash-based collections to tree-based ones drastically increased performance for small sets.
The best way is to switch between them and compare the performance of each.

What is the fastest way to count the unique elements in a list of billion elements?

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)

I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.

I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.

This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.

If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).

Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.

With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.

divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count

I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.

+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.

Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.

A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.

I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.

If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.

Should we store format strings in resources?

For the project that I'm currently on, I have to deliver specially formatted strings to a 3rd party service for processing. And so I'm building up the strings like so:
string someString = string.Format("{0}{1}{2}: Some message. Some percentage: {3}%", token1, token2, token3, number);
Rather then hardcode the string, I was thinking of moving it into the project resources:
string someString = string.Format(Properties.Resources.SomeString, token1, token2, token3, number);
The second option is in my opinion, not as readable as the first one i.e. the person reading the code would have to pull up the string resources to work out what the final result should look like.
How do I get around this? Is the hardcoded format string a necessary evil in this case?

I do think this is a necessary evil, one I've used frequently. Something smelly that I do, is:
// "{0}{1}{2}: Some message. Some percentage: {3}%"
string someString = string.Format(Properties.Resources.SomeString
,token1, token2, token3, number);
..at least until the code is stable enough that I might be embarrassed having that seen by others.

There are several reasons that you would want to do this, but the only great reason is if you are going to localize your application into another language.
If you are using resource strings there are a couple of things to keep in mind.
Include format strings whenever possible in the set of resource strings you want localized. This will allow the translator to reorder the position of the formatted items to make them fit better in the context of the translated text.
Avoid having strings in your format tokens that are in your language. It is better to use
these for numbers. For instance, the message:
"The value you specified must be between {0} and {1}"
is great if {0} and {1} are numbers like 5 and 10. If you are formatting in strings like "five" and "ten" this is going to make localization difficult.
You can get arround the readability problem you are talking about by simply naming your resources well.
string someString = string.Format(Properties.Resources.IntegerRangeError, minValue, maxValue );
Evaluate if you are generating user visible strings at the right abstraction level in your code. In general I tend to group all the user visible strings in the code closest to the user interface as possible. If some low level file I/O code needs to provide errors, it should be doing this with exceptions which you handle in you application and consistent error messages for. This will also consolidate all of your strings that require localization instead of having them peppered throughout your code.

One thing you can do to help add hard coded strings or even speed up adding strings to a resource file is to use CodeRush Xpress which you can download for free here: http://www.devexpress.com/Products/Visual_Studio_Add-in/CodeRushX/
Once you write your string you can access the CodeRush menu and extract to a resource file in a single step. Very nice.
Resharper has similar functionality.

I don't see why including the format string in the program is a bad thing. Unlike traditional undocumented magic numbers, it is quite obvious what it does at first glance. Of course, if you are using the format string in multiple places it should definitely be stored in an appropriate read-only variable to avoid redundancy.
I agree that keeping it in the resources is unnecessary indirection here. A possible exception would be if your program needs to be localized, and you are localizing through resource files.

yes you can
new lets see how
String.Format(Resource_en.PhoneNumberForEmployeeAlreadyExist,letterForm.EmployeeName[i])
this will gave me dynamic message every time
by the way I'm useing ResXManager

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.