I need to store about 60000 IP-address-like things such that I can quickly determine if the store contains an address or return all the addresses that match a pattern like 3.4.*.* or *.5.*.*. The previous implementation used HashTables nested four levels deep. It's not fully thread safe, and this is causing us bugs. I need to make this thread safe with locking on the outer layer, or I could change all those to ConcurrentDictionaries, but neither of those options seemed quite right. Using a byte for a key in a dictionary never felt quite right to me in general, especially a heavy-weight dictionary. Suggestions?
Guava uses a prefix trie for storing IP lookup matches. You can see the code here:
https://code.google.com/p/google-collections/source/browse/trunk/src/com/google/common/collect/PrefixTrie.java?r=2
This is Java code, but I'm sure you could easily adapt it to C#. The technique of a prefix trie is applicable independent of the language and gets you trailing wildcard matches for free. If you want arbitrary wildcards, you'll still need to implement that yourself. Alternatively, you could build a data structure similar to a Directed acyclic word graph (DAWG). This will let you more directly implement the arbitrary wild card matches.
Related
I'm writing a bot that will analyse posts and reply with a vaguely related strings from a database. I'm not aiming for coherence, just for vague similarity that could pass as someone ignorant to the topic (but knowledgeable enough to try to reply). What are some methods that would help me to choose the right reply?
One thing I've come up with is to create a vocabulary list, check which elements of the list are in the post, and get a reply from the database based on these results. This crude method has been successful about 10% of the time (based on 100 replies to random posts). I might expand the list by more words, but this method has its limit. Any better ones?
(P. S. The database is sizeable -- about 500 000 replies)
First of all, I think the best you can hope for will be about a 50% answer rate, unless you're prepared to write a lot of code.
If you're willing to get your hands dirty with some statistics, check out term frequency–inverse document frequency. Basically, you will use the frequency of uncommon words to determine what keywords are critical to the document, and use this as the input into the tf-idf algorithm to pull out other replies with those same keywords.
You can then combine this further with whitelisting and blacklisting techniques to ignore common words and prioritize certain keywords. You can then keep tuning those lists to enhance the algorithm as you see it work.
There are also simpler string metrics you can use to test basic similarity. Take a look at this list of string metrics.
You might want to look into vector-space mapping and resemblance. The "vaguely related" problem could be handled by resemblance statistical analysis most likely.
Check out this novel use of resemblance:
http://www.cromwell-intl.com/security/attack-study/
There is a PHP function called "similar_text()", (e.g.:
$percent_similar = similar_text($str1, $str2);) This works fairly well but I didn't come up with anything similar in C#. If you could get hold of the source for the PHP function you might try to translate it. I think there may be a Java version also.
This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?
I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.
I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.
I've got a huge (>>10m) list of entries. Each entry offers two hash functions:
Cheap: quickly computes hash, but its distribution is terrible (may put 99% of items in 1% of hash space)
Expensive: takes a lot of time to compute, but the distribution is a lot better also
An ordinary Dictionary lets me use only one of these hash functions. I'd like a Dictionary that uses the cheap hash function first, and checks the expensive one on collisions.
It seems like a good idea to use a dictionary inside a dictionory for this. I currently basically use this monstrosity:
Dictionary<int, Dictionary<int, List<Foo>>>;
I improved this design so the expensive hash gets called only if there are actually two items of the same cheap hash.
It fits perfectly and does a flawless job for me, but it looks like something that should have died 65 million years ago.
To my knowledge, this functionality is not included in the basic framework. I am about to write a DoubleHashedDictionary class but I wanted to know of your opinion first.
As for my specific case:
First hash function = number of files in a file system directory (fast)
Second hash function = sum of size of files (slow)
Edits:
Changed title and added more informations.
Added quite important missing detail
In your case, you are technically using a modified function (A|B), not a double-hashed. However, depending on how huge your "huge" list of entries is and the characteristics of your data, consider the following:
A 20% full hash table with a not-so-good distribution can have more than 80% chance of collision. This means your expected function cost could be: (0.8 expensive + 0.2 cheap) + (cost of lookups). So if your table is more than 20% full it may not be worth using the (A|B) scheme.
Coming up with a perfect hash function is possible but O(n^3) which makes it impractical.
If performance is supremely important, you can make a specifically tuned hash table for your specific data by testing various hash functions on your key data.
First off, I think you're on the right path to implement your own hashtable, if what you are describing is truely desired.But as a critic, I'd like to ask a few questions:
Have you considered using something more unique for each entry?
I am assuming that each entry is a file system directory information, have you considered using its full path as key? prefixing with computer name/ip address?
On the other hand, if you're using number of files as hash key, are those directories never going to change? Because if the hash key/result changes, you will never be able to find it again.
While on this topic, if the directory content/size is never going to change, can you store that value somewhere to save the time to actually calculate that?
Just my few cents.
Have you taken a look at the Power Collections or C5 Collections libaries? The Power Collections library hasn't had much action recently, but the C5 stuff seems to be fairly up to date.
I'm not sure if either library has what you need, but they're pretty useful and they're open source so it may provide a decent base implementation for you to extend to your desired functionality.
You're basically talking about a hash table of hash table's, each using a different GetHashCode implementation... while it's possible I think you'd want to consider seriously whether you'll actually get a performance improvement over just doing one or the other...
Will there actually be a substantial number of objects that will be located via the quick-hash mechanism without having to resort to the more expensive to narrow it down further? Because if you can't locate a significant amount purely off the first calculation you really save nothing by doing it in two steps (Not knowing the data it's hard to predict whether this is the case).
If it is going to be a significant amount located in one step then you'll probably have to do a fair bit of tuning to work out how many records to store on each hash location of the outer before resorting to an inner "expensive" hashtable lookup rather than the more treatment of hashed data, but under certain circumstances I can see how you'd get a performance gain from this (The circumstances would be few and far between, but aren't inconceivable).
Edit
I just saw your ammendment to the question - you plan to do both lookups regardless... I doubt you'll get any performance benefits from this that you can't get just by configuring the main hash table a bit better. Have you tried using a single dictionary with an appropriate capacity passed in the constructor and perhaps an XOR of the two hash codes as your hash code?
Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:
var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);
If there is no such function, why bother with making strings immutable?
Or, alternatively, if strings are already immutable for other reasons, why not provide this method?
The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.
Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...
No, there's nothing like that.
.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.
Both solutions have "wins" in some situations, and losses in others.
If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.
As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?
The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.
Nothing in C# provides you the out-of-the-box functionality you're looking for.
What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.
Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.
Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.
You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use
string text = myCheapObject;
and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).
The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.
I am getting to the last stage of my rope (a more scalable version of String) implementation. Obviously, I want all operations to give the same result as the operations on Strings whenever possible.
Doing this for ordinal operations is pretty simple, but I am worried about implementing culture-sensitive operations correctly. Especially since I know only two languages and in both of them culture-sensitive operations behave precisely the same as ordinal operations do!
So are there any specific things that I could test and get at least some confidence that I am doing things correctly? I know, for example, about ß being equal to SS when ignoring cases in German; about dotted and undotted i in Turkish.
Surrogate pairs, if you plan to support them - including invalid combinations (e.g. only one part of one).
If you're doing encoding and decoding, make sure you retain enough state to cope with being given arbitrarily blocks of binary data to decode which may end half way through a character, with the remaining half coming in the next character.
The Turkish test is the best I know :)
You should mimic the String methods implementations and use the core library to do this for you. It is very hard to take into account every possible aspect of every culture. Instead of re-inventing the wheel use reflector on the String methods and see the internal calls. For example String.Compare uses CultureInfo.CurrentCulture.CompareInfo.Compare for comparing 2 strings in current culture.