Merging 2 lists using Levenshtein Distance on terms in list

Merging 2 lists using Levenshtein Distance on terms in list - c#

Good afternoon,
I'm hoping i can get an assist on this from someone. If not some example code then some general direction i should be going with this.
Essentially i have two large lists (roughly 10-20,000 records each) of string terms and ID's. These lists are coming from two different data providers. The lists are obviously related to one another topically, however each data provider has slight variations in their terms naming conventions. For example list1 would have a term "The Term (Some Subcategory)" and list2 would have "the term - some subcategory". Additionally list1 could have "The Term (Some Subcategory)" and "The Term (Some Subcategory 2)" while list2 only has "the term - some subcategory".
Both lists have the following properties - "term" and "id". What i need to do is compare every term in both lists and if a reasonable match is found generate a new list containing "term", "list1id", "list2id" properties. If no match is found for a term i need it also to be added to the list with either "list1id" or "list2id" null/blank (which will indicate the origin of the unmatched term).
I'm willing to us a NuGet package to occumplish this or if anyone has a good example of what i need that would be helpful too. Essentially i'm attempting to generate a new merged list based on fuzzy terms within each while retaining the ID's of the matched terms somehow.
My research has dug up some similar articles and source such as https://matthewgladney.com/blog/data-science/using-levenshtein-distance-in-csharp-to-associate-lists-of-terms/ and https://github.com/wolfgarbe/symspell but neither seem to fit what i need.
Where do i go from here with this? Any help would be awesome!
Nugs

Your question is pretty broad, but I will attempt a broad answer to, at least, get you started. I've done this sort of thing before.
Do it in two stages: first normalize, then match. By doing this you eliminate known but irrelevant causes of differences. By normalize, for example, make everything caps, remove whitespace, remove non-alphanumeric characters, etc. You'll need to be a little creative and work within any constraints you might have (is "Amy (pony)" the same thing as "Amy pony"?). Then calculate
the distance.
Create a class with a few properties to contain the value from the left list, the value from the right list, the normalized values, the score, etc.
When you get a match, create an instance of that class, add it to a list or equivalent, remove the old values from the original lists, then keep going.
Try to write your code so you keep track of intermediate values (e.g. the normalized values, etc). This will make it easier to debug, and will allow you to log everything after you've done processing.
Once you're done, you can then throw away intermediate values and keep just the things you identified as a match.

Related

String likeness algorithms

I have two strings (they're going to be descriptions in a simple database eventually), let's say they're
String A: "Apple orange coconut lime jimmy buffet"
String B: "Car
bicycle skateboard"
What I'm looking for is this. I want a function that will have the input "cocnut", and have the output be "String A"
We could have differences in capitalization, and the spelling won't always be spot on. The goal is a 'quick and dirty' search if you will.
Are there any .net (or third party), or recommend 'likeness algorithms' for strings, so I could check that the input has a 'pretty close fragment' and return it? My database is going to have liek 50 entries, tops.

What you’re searching for is known as the edit distance between two strings. There exist plenty of implementations – here’s one from Stack Overflow itself.
Since you’re searching for only part of a string what you want is a locally optimal match rather than a global match as computed by this method.
This is known as the local alignment problem and once again it’s easily solvable by an almost identical algorithm – the only thing that changes is the initialisation (we don’t penalise whatever comes before the search string) and the selection of the optimum value (we don’t penalise whatever comes after the search string).

Compare 10 Million Entities

I have to write a program that compares 10'000'000+ Entities against one another. The entities are basically flat rows in a database/csv file.
The comparison algorithm has to be pretty flexible, it's based on a rule engine where the end user enters rules and each entity is matched against every other entity.
I'm thinking about how I could possibly split this task into smaller workloads but I haven't found anything yet. Since the rules are entered by the end user pre-sorting the DataSet seems impossible.
What I'm trying to do now is fit the entire DataSet in memory and process each item. But that's not highly efficient and requires approx. 20 GB of memory (compressed).
Do you have an idea how I could split the workload or reduce it's size?
Thanks

If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.
If the rules are not completely general I see 3 solutions to optimize different cases:
if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.
if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.
if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.
By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.

I would try to think about rule hierarchy.
Let's say for example that rule A is "Color" and rule B is "Shape".
If you first divide objects by color,
than there is no need to compare Red circle with Blue triangle.
This will reduce the number of compares you will need to do.

I would create a hashcode from each entity. You probably have to exclude the id from the hash generation and then test for equals. If you have the hashs you could order all the hashcodes alphabetical. Having all entities in order means that it's pretty easy to check for doubles.

If you want to compare each entity with all entities than effectively you need to cluster the data , there is very fewer reasons to compare totally unrelated things ( compare Clothes with Human does not make sense) , i think your rules will try to cluster the data.
so you need to cluster the data , try some clustering algorithms like K-Means.
Also see , Apache Mahout

Are you looking for the best suitable sorting algorithm, kind of a, for this?
I think Divide and Concur seems good.
If the algorithm seems good, you can have plenty of other ways to do the calculation. Specially parallel processing using MPICH or something may give you a final destination.
But before decide how to execute, you have to think if algorithm fits first.

Predicting new (unknown) sequence values using aforge GA

I've been messing around with the aforge time series genetic algorithm sample and I've got my own version working, atm it's just 'predicting' Fibonacci numbers.
The problem is when I ask it to predict new values beyond the array I've given it (which contains the first 21 numbers of the sequence, using a window size of 5) it won't do it, it throws an exception that says "Data size should be enough for window and prediction".
As far as I can tell I'm supposed to decipher the bizarre formula contained in "population.BestChromosome" and use that to extrapolate future values, is that right? Is there an easier way? Am I overlooking something massively obvious?
I'd ask on the aforge forum but the developer is not supporting it anymore.

As far as I can tell I'm supposed to decipher the bizarre formula
contained in "population.BestChromosome" and use that to extrapolate
future values, is that right?
What you call a "bizarre formula" is called a model in data analysis. You learn such a model from past data and you can feed it new data to get a predicted outcome. Whether that new outcome makes sense or is just garbage depends on how general your model is. Many techniques can learn very good models that explain the observed data very well, but which are not generalizable and will return unuseful results when you feed new data into the model. You need to find a model that both explains the given data as well as potentially unobserved data which is a non-trivial process. Usually people estimate the generalization error of that model by splitting the known data into two partitions: one with which the model is learned and another one on which the learned models are tested. You then want to select that model which is accurate on both data. You can also check out the answer I gave on another question here which also treats the topic of machine learning: https://stackoverflow.com/a/3764893/189767
I don't think you're "overlooking something massively obvious", but rather you're faced with a problem that is not trivial to solve.
Btw, you can also use genetic programming (GP) in HeuristicLab. The model of GP is a mathematical formula and in HeuristicLab you can export that model to e.g. MatLab.
Ad Fibonacci, the closed formula for Fibonacci numbers is F(n) = (phi^n - psi^n) / sqrt(5) where phi and psi are special magic numbers according to wikipedia. If you want to find that with GP you need one variable (n), three constants, and the power function. However, it's very likely that you find a vastly different formula that is similar in output. The problem in machine learning is that very different models can produce the same output. The recursive form requires that you include the values of the past two n into the data set. This is similar to learning a model for a time series regression problem.

Methodologies or algorithms for filling in missing data

I am dealing with datasets with missing data and need to be able to fill forward, backward, and gaps. So, for example, if I have data from Jan 1, 2000 to Dec 31, 2010, and some days are missing, when a user requests a timespan that begins before, ends after, or encompasses the missing data points, I need to "fill in" these missing values.
Is there a proper term to refer to this concept of filling in data? Imputation is one term, don't know if it is "the" term for it though.
I presume there are multiple algorithms & methodologies for filling in missing data (use last measured, using median/average/moving average, etc between 2 known numbers, etc.
Anyone know the proper term for this problem, any online resources on this topic, or ideally links to open source implementations of some algorithms (C# preferably, but any language would be useful)

The term you're looking for is interpolation. (obligatory wiki link)
You're asking for a C# solution with datasets but you should also consider doing this at the database level like this.
An simple, brute-force approach in C# could be to build an array of consecutive dates with your beginning and ending values as the min/max values. Then use that array to merge "interpolated" date values into your data set by inserting rows where there is no matching date for your date array in the dataset.
Here is an SO post that gets close to what you need: interpolating missing dates with C#. There is no accepted solution but reading the question and attempts at answers may give you an idea of what you need to do next. E.g. Use the DateTime data in terms of Ticks (long value type) and then use an interpolation scheme on that data. The convert the interpolated long values to DateTime values.

The algorithm you use will depend a lot on the data itself, the size of the gaps compared to the available data, and its predictability based on existing data. It could also incorporate other information you might know about what's missing, as is common in statistics, when your actual data may not reflect the same distribution as the universe across certain categories.
Linear and cubic interpolation are typical algortihms that are not difficult to implement, try googling those.
Here's a good primer with some code:
http://paulbourke.net/miscellaneous/interpolation/
The context of the discussion in that link is graphics but the concepts are universally applicable.

For the purpose of feeding statistical tests, a good search term is imputation - e.g. http://en.wikipedia.org/wiki/Imputation_%28statistics%29

Efficient Datastructure for tags?

Imagine you wanted to serialize and deserialize stackoverflow posts including their tags as space efficiently as possible (in binary), but also for performance when doing tag lookups. Is there a good datastructure for that kind of scenario?
Stackoverflow has about 28532 different tags, you could create a table with all tags and assign them an integer, Furthermore you could sort them by frequency so that the most common tags have the lowest numbers. Still storing them simply like a string in the format "1 32 45" seems a bit inefficent borth from a searching and storing perspective
Another idea would be to save tags as a variable bitarray which is attractive from a lookup and serializing perspective. Since the most common tags are first you potentially could fit tags into a small amount of memory.
The problem would be of course that uncommon tags would yield huge bitarrays. Is there any standard for "compressing" bitarrays for large spans of 0's? Or should one use some other structure completely?
EDIT
I'm not looking for a DB solution or a solution where I need to keep entire tables in memory, but a structure for filtering individual items

Not to undermine your question but 28k records is really not all that many. Are you perhaps optimizing prematurely?
I would first stick to using 'regular' indices on a DB table. The harshing heuristics they use are typically very efficient and not trivial to beat (or if you can is it really worth the effort in time and are the gains large enough?).
Also depending on where you actually do the tag query, is the user really noticing the 200ms time gain you optimized for?
First measure then optimize :-)
EDIT
Without a DB I would probably have a master table holding all tags together with an ID (if possible hold it in memory). Keep a regular sorted list of IDs together with each post.
Not sure how much storage based on commonality would help. A sorted list in which you can do a regular binary search may prove fast enough; measure :-)
Here you would need to iterate all posts for every tag query though.
If this ends up being to slow you could resort to storing a pocket of post identifiers for each tag. This data structure may become somewhat large though and may require a file to seek and read against.
For a smaller table you could resort to build one based on a hashed value (with duplicates). This way you could use it to quickly get down to a smaller candidate list of posts that need further checking to see if they match or not.

You need second table with 2 fields: tag_id question_id
That's it. Then you create indexes on tag_id, question_id and question_id, tag_id - that would be covering index so all your queries would be very fast.

I have a feeling you abstracted your question too much; you didn't say very much about how you want to access the datastructure, which is very important.
That being said, I suggest to count the number of occurances for each tag and then use Huffman coding to come up with the shortest encoding which can be used for the tags. This is not entirely perfect, but I'd stick with it until you've demonstrate that it's inappropriate. You can then associate the codes with each question.

If you want to efficiently lookup questions within a specific tag, you will need some kind of index. Maybe, all Tag objects could have an array of references (references, pointers, nummeric-id, etc) to all the questions that are tagged with this particular tag. This way you simply need to find the tag object and you have an array pointing to all the questions of that tag.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.