Approximate string matching

Approximate string matching - c#

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Related

Facebook Sentiment Analysis API

i want to try and create an application which rates the user's facebook posts based on the content (Sentiment Analysis).
I tried creating an algorithm myself initially but i felt it wasn't that reliable.
Created a dictionary list of words and scanned the posts against the dictionary and rate if it was positive or negative.
However, i feel this is minimal. I would like to rate the mood or feelings/personality traits of the person based on the posts. Is this possible to be done?
Would hope to make use of some online APIs, please assist. Thanks ;)

As #Jared pointed out, using a dictionary-based approach can work quite well in some situations, depending on the quality of your training corpus. This is actually how CLIPS pattern and TextBlob's implementations work.
Here's an example using TextBlob:
from text.blob import TextBlob
b = TextBlob("StackOverflow is very useful")
b.sentiment # returns (polarity, subjectivity)
# (0.39, 0.0)
By default, TextBlob uses pattern's dictionary-based algorithm. However, you can easily swap out algorithms. You can, for example, use a Naive Bayes classifier trained on a movie reviews corpus.
from text.blob import TextBlob
from text.sentiments import NaiveBayesAnalyzer
b = TextBlob("Today is a good day", analyzer=NaiveBayesAnalyzer())
b.sentiment # returns (label, prob_pos, prob_neg)
# ('pos', 0.7265237431528468, 0.2734762568471531)

The algorithm you describe should actually work well, but the quality of the result depends greatly on the word list used. For Sentimental, we take comments on Facebook posts and score them based on sentiment. Using the AFINN 111 word list to score the comments word by word, this approach is (perhaps surprisingly) effective. By normalizing and stemming the words first, you should be able to do even better.

There are lots of sentiment analysis APIs that you can easily incorporate into your app, also many have a free usage allowance (usually, 500 requests a day). I started a small project that compares how each API (currently supporting 10 different APIs: AIApplied, Alchemy, Bitext, Chatterbox, Datumbox, Lymbix, Repustate, Semantria, Skyttle, and Viralheat) classifies a given set of texts into positive, negative or neutral: https://github.com/skyttle/sentiment-evaluation
Each specific API can offer lots of other features, like classifying emotions (delight, anger, sadness, etc) or linking sentiment to entities the sentiment is attributed to. You just need to go through available features and pick the one that suits your needs.
TextBlob is another possiblity, though it will only classify texts into pos/neg/neu.

If you are looking for an Open Source implementation of sentiment analysis engine based on Naive Bayes classifier in C#, take a peek at https://github.com/amrishdeep/Dragon. It works best on large corpus of words like blog posts or multi-paragraph product reviews. However, I am not sure if it would work for facebook posts that have a handful of words

Predicting new (unknown) sequence values using aforge GA

I've been messing around with the aforge time series genetic algorithm sample and I've got my own version working, atm it's just 'predicting' Fibonacci numbers.
The problem is when I ask it to predict new values beyond the array I've given it (which contains the first 21 numbers of the sequence, using a window size of 5) it won't do it, it throws an exception that says "Data size should be enough for window and prediction".
As far as I can tell I'm supposed to decipher the bizarre formula contained in "population.BestChromosome" and use that to extrapolate future values, is that right? Is there an easier way? Am I overlooking something massively obvious?
I'd ask on the aforge forum but the developer is not supporting it anymore.

As far as I can tell I'm supposed to decipher the bizarre formula
contained in "population.BestChromosome" and use that to extrapolate
future values, is that right?
What you call a "bizarre formula" is called a model in data analysis. You learn such a model from past data and you can feed it new data to get a predicted outcome. Whether that new outcome makes sense or is just garbage depends on how general your model is. Many techniques can learn very good models that explain the observed data very well, but which are not generalizable and will return unuseful results when you feed new data into the model. You need to find a model that both explains the given data as well as potentially unobserved data which is a non-trivial process. Usually people estimate the generalization error of that model by splitting the known data into two partitions: one with which the model is learned and another one on which the learned models are tested. You then want to select that model which is accurate on both data. You can also check out the answer I gave on another question here which also treats the topic of machine learning: https://stackoverflow.com/a/3764893/189767
I don't think you're "overlooking something massively obvious", but rather you're faced with a problem that is not trivial to solve.
Btw, you can also use genetic programming (GP) in HeuristicLab. The model of GP is a mathematical formula and in HeuristicLab you can export that model to e.g. MatLab.
Ad Fibonacci, the closed formula for Fibonacci numbers is F(n) = (phi^n - psi^n) / sqrt(5) where phi and psi are special magic numbers according to wikipedia. If you want to find that with GP you need one variable (n), three constants, and the power function. However, it's very likely that you find a vastly different formula that is similar in output. The problem in machine learning is that very different models can produce the same output. The recursive form requires that you include the values of the past two n into the data set. This is similar to learning a model for a time series regression problem.

LibSVM turns all my training vectors into support vectors, why?

I am trying to use SVM for News article classification.
I created a table that contains the features (unique words found in the documents) as rows.
I created weight vectors mapping with these features. i.e if the article has a word that is part of the feature vector table that location is marked as 1 or else 0.
Ex:- Training sample generated...
1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1
10:1 11:1 12:1 13:1 14:1 15:1 16:1
17:1 18:1 19:1 20:1 21:1 22:1 23:1
24:1 25:1 26:1 27:1 28:1 29:1 30:1
As this is the first document all the features are present.
I am using 1, 0 as class labels.
I am using svm.Net for classification.
I gave 300 weight vectors manually classified as training data and the model generated is taking all the vectors as support vectors, which is surely overfitting.
My total features (unique words/row count in feature vector DB table) is 7610.
What could be the reason?
Because of this over fitting my project is now in pretty bad shape. It is classifying every article available as a positive article.
In LibSVM binary classification is there any restriction on the class label?
I am using 0, 1 instead of -1 and +1. Is that a problem?

You need to do some type of parameter search, also if the classes are unbalanced the classifier might get artificially high accuracies without doing much. This guide is good at teaching basic, practical things, you should probably read it

As pointed out, a parameter search is probably a good idea before doing anything else.
I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)
For more information and perhaps better answers, look on stats.stackexchange.com.

I would definitely try using -1 and +1 for your labels, that's the standard way to do it.
Also, how much data do you have? Since you're working in 7610-dimensional space, you could potentially have that many support vectors, where a different vector is "supporting" the hyperplane in each dimension.
With that many features, you might want to try some type of feature selection method like principle component analysis.

Methodologies or algorithms for filling in missing data

I am dealing with datasets with missing data and need to be able to fill forward, backward, and gaps. So, for example, if I have data from Jan 1, 2000 to Dec 31, 2010, and some days are missing, when a user requests a timespan that begins before, ends after, or encompasses the missing data points, I need to "fill in" these missing values.
Is there a proper term to refer to this concept of filling in data? Imputation is one term, don't know if it is "the" term for it though.
I presume there are multiple algorithms & methodologies for filling in missing data (use last measured, using median/average/moving average, etc between 2 known numbers, etc.
Anyone know the proper term for this problem, any online resources on this topic, or ideally links to open source implementations of some algorithms (C# preferably, but any language would be useful)

The term you're looking for is interpolation. (obligatory wiki link)
You're asking for a C# solution with datasets but you should also consider doing this at the database level like this.
An simple, brute-force approach in C# could be to build an array of consecutive dates with your beginning and ending values as the min/max values. Then use that array to merge "interpolated" date values into your data set by inserting rows where there is no matching date for your date array in the dataset.
Here is an SO post that gets close to what you need: interpolating missing dates with C#. There is no accepted solution but reading the question and attempts at answers may give you an idea of what you need to do next. E.g. Use the DateTime data in terms of Ticks (long value type) and then use an interpolation scheme on that data. The convert the interpolated long values to DateTime values.

The algorithm you use will depend a lot on the data itself, the size of the gaps compared to the available data, and its predictability based on existing data. It could also incorporate other information you might know about what's missing, as is common in statistics, when your actual data may not reflect the same distribution as the universe across certain categories.
Linear and cubic interpolation are typical algortihms that are not difficult to implement, try googling those.
Here's a good primer with some code:
http://paulbourke.net/miscellaneous/interpolation/
The context of the discussion in that link is graphics but the concepts are universally applicable.

For the purpose of feeding statistical tests, a good search term is imputation - e.g. http://en.wikipedia.org/wiki/Imputation_%28statistics%29

Algorithm for text classification

I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents which have already been categorized. What algorithm should I use to do the job. I don't need to do it fast. I need to be sure that the algorithm categorizes correctly (as far as possible).
What algorithm should I use? Is there an implementation of in in C#?
Thank you for your help!

Take a look at term frequency and inverse document frequency also cosine similarity to find important words to create categories and assign documents to categories based on similarity
EDIT:
Found an example here

Interesting articles :
A self-organizing semantic map for information retrieval
WEBSOM - self-organizing maps of document collections

The major issue IMHO here is the length of the documents. I think I would call it phrase classification and there is work going on on this because of the twitter thing. You could bring in additional text performing a web search on the 30 words and then analyzing the top matches. There is a paper about this but I can't find it right now. Then I would try a feature vector approach (tdf-idf as in Jimmy's answer) and a multiclass SVM for classification.

Perhaps a decision tree combined with a NN?

You can use SVM Algorithm for Classify text in C# with libsvm.net library.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.