High Dimensional Data Clustering

High Dimensional Data Clustering - c#

What are the best clustering algorithms to use in order to cluster data with more than 100 dimensions (sometimes even 1000). I would appreciate if you know any implementation in C, C++ or especially C#.

It depends heavily on your data. See curse of dimensionality for common problems. Recent research (Houle et al.) showed that you can't really go by the numbers. There may be thousands of dimensions and the data clusters well, and of course there is even one-dimensional data that just doesn't cluster. It's mostly a matter of signal-to-noise.
This is why for example clustering of TF-IDF vectors works rather well, in particular with cosine distance.
But the key point is that you first need to understand the nature of your data. You then can pick appropriate distance functions, weights, parameters and ... algorithms.
In particular, you also need to know what constitutes a cluster for you. There are many definitions, in particular for high-dimensional data. They may be in subspaces, they may or may not be arbitrarily rotated, they may overlap or not (k-means for example, doesn't allow overlaps or subspaces).

well i know something called vector quantization, its a nice algorithem to cluster stuf with many dimentions.

i've used k-means on data with 100's dimensions, it is very common so i'm sure theres an implementation in any language, worst case scenario - it is very easy to implement by your self.

It might also be worth trying some dimensionality reduction techniques like Principle Component Analysis or an auto-associative neural net before you try to cluster it. It can turn a huge problem into a much smaller one.
After that, go k-means or mixture of gaussians.

The EM-tree and K-tree algorithms in the LMW-tree project can cluster high dimensional problems like this. It is implemented in C++ and supports many different representations.
We have novel algorithms clustering binary vectors created by LSH / Random Projections, or anything else that emits binary vectors that can be compared via Hamming distance for similarity.

Related

.Net sorted sets - efficient range dissection operations

In an attempt to not reinvent the wheel, I've been looking for a fast, efficient data structure that I can use in my code which is somewhat analogous to Redis' sorted sets.
Adding ranges of items is easy enough, but I've discovered that there doesn't seem to be anything that allows me to drop ranges of entries based on an upper and lower bound. For example, if my set is keyed on a double, I'd like to be able to drop all values between 0.3 and 0.7. I was quite surprised that I couldn't find any easy way to do this that felt like the correct approach. Also, performance is important for my use case, both in terms of speed and memory usage.
Examples and implementations from NuGet, CodeProject, GitHub, etc. are all acceptable.
p.s. For what it's worth, I'm wanting to maintain a cache of items indexed across 4 dimensions, and need to be able to discard items across different dimensional ranges when usage strays too far from a given range. Feel free to suggest something to this effect as well.

Ensemble learning, multiple classifier system

I am trying to use a MCS (Multi classifier system) to do some better work on limited data i.e become more accurate.
I am using K-means clustering at the moment but may choose to go with FCM (Fuzzy c-means) with that the data is clustered into groups (clusters) the data could represent anything, colours for example. I first cluster the data after pre-processing and normalization and get some distinct clusters with a lot in between. I then go on to use the clusters as the data for a Bayes classifier, each cluster represents a distinct colour and the Bayes classifier is trained and the data from the clusters is then put through separate Bayes classifiers. Each Bayes classifier is trained only in one colour. If we take the colour spectrum 3 - 10 as being blue 13 - 20 as being red and the spectrum in between 0 - 3 being white up to 1.5 then turning blue gradually through 1.5 - 3 and same for blue to red.
What I would like to know is how or what kind of aggregation method (if that is what you would use) could be applied so that the Bayes classifier can become stronger, and how does it work? Does the aggregation method already know the answer or would it be human interaction that corrects the outputs and then those answers go back into the Bayes training data? Or a combination of both? Looking at Bootstrap aggregating it involves having each model in the ensemble vote with equal weight so not quite sure in this particular instance I would use bagging as my aggregation method? Boosting however involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified, not sure if this would be a better alternative to bagging as im unsure how it incrementally builds upon new instances? And the last one would be Bayesian model averaging which is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law, however completely unsure how you would sample hypotheses from search space?
I know that usualy you would use a competitive approach to bounce between the two classification algorithms one says yes one says maybe a weighting could be applied and if its correct you get the best of both classifiers but for keep sake I dont want a competitive approach.
Another question is using these two methods together in such a way would it be beneficial, i know the example i provided is very primitive and may not apply in that example but can it be beneficial in more complex data.

I have some issues about the method you are following:
K-means puts in each cluster the points that are the most near to it. And then you train a classifier using the output data. I think that the classifier may outperform the clustering implicit classification, but only by taking into account the number of samples in each cluster. For example, if your training data after clustering you have typeA(60%), typeB(20%), typeC(20%); your classifier will prefer to take ambiguous samples to typeA, to obtain less classification error.
K-means depends on what "coordinates"/"features" you take from the objects. If you use features where the objects of different types are mixed, the K-means performance will decrease. Deleting these kind of features from the feature vector may improve your results.
Your "feature"/"coordinates" that represent the objects that you want to classify may be measured in different units. This fact can affect your clustering algorithm since you are implicitly setting a unit conversion between them through the clustering error function. The final set of clusters is selected with multiple clustering trials (that were obtained upon different cluster initializations), using an error function. Thus, an implicit comparison is made upon the different coordinates of your feature vector (potentially introducing the implicit conversion factor).
Taking into account these three points, you will probably increase the overall performance of your algorithm by adding preprocessing stages. For example in object recognition for computer vision applications, most of the information taken from the images comes only from borders in the image. All the color information and part of the texture information are not used. The borders are substracted from the image processing the image to obtain the Histogram of Oriented Gradients (HOG) descriptors. This descriptor gives back "features"/"coordinates" that separate better the objects, thus, increasing classification (object recognition) performance. Theoretically descriptors throw information contained in the image. However, they present two main advantages (a) the classifier will deal with lower dimensionality data and (b) descriptors calculated from test data can be more easily matched with training data.
In your case, I suggest that you try to improve your accuracy taking a similar approach:
Give richer features to your clustering algorithm
Take advantage of prior knowledge in the field to decide what features you should add and delete from your feature vector
Always consider the possibility of obtaining labeled data, so that supervised learning algorithms can be applied
I hope this helps...

Computing, storing, and retrieving values to and from an N-Dimensional matrix

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?

I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.

I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.

.NET Neural Network or AI for Future Predictions

I am looking for some kind of intelligent (I was thinking AI or Neural network) library that I can feed a list of historical data and this will predict the next sequence of outputs.
As an example I would like to feed the library the following figures 1,2,3,4,5
and based on this, it should predict the next sequence is 6,7,8,9,10 etc.
The inputs will be a lot more complex and contain much more information.
This will be used in a C# application.
If you have any recommendations or warning that will be great.
Thanks
EDIT
What I am trying to do i using historical sales data, predict what amount a specific client is most likely going to spend in the next period.
I do understand that there are dozens of external factors that can influence a clients purchases but for now I need to merely base it on the sales history and then plot a graph showing past sales and predicted sales.

If you're looking for a .NET API, then I would recommend you try AForge.NET http://code.google.com/p/aforge/
If you just want to try various machine learning algorithms on a data set that you have at your disposal, then I would recommend that you play around with Weka; it's (relatively) easy to use and it implements a lot of ML/AI algorithms. Run multiple runs with different settings for each algorithm and try as many algorithms as you can. Most of them will have some predictive power and if you combine the right ones, then you might really get something useful.

If I understand your question correctly, you want to approximate and extrapolate an unknown function. In your example, you know the function values
f(0) = 1
f(1) = 2
f(2) = 3
f(3) = 4
f(4) = 5
A good approximation for these points would be f(x) = x+1, and that would yield f(5) = 6... as expected. The problem is, you can't solve this without knowledge about the function you want to extrapolate: Is it linear? Is it a polynomial? Is it smooth? Is it (approximately or exactly) cyclic? What is the range and domain of the function? The more you know about the function you want to extrapolate, the better your predictions will be.

I just have a warning, sorry. =)
Mathematically, there is no reason for your sequence above to be followed by a "6". I can easily give you a simple function, whose next value is any value you like. Its just that humans like simple rules, and therefore tend to see a connection in these sequences, that in reality is not there. Therefore, this is a impossible task for a computer, if you do not want to feed it with additional information.
Edit:
In the case that you suspect your data to have a known functional dependence, and there are uncontrollable outside factors, maybe regression analysis will have good results. To start easy, look at linear regression first.
If you cannot assume linear dependence, there is a nice application that looks for functions fitting your historical data... I'll update this post with its name as soon as I remember. =)

Text difference algorithm

I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.
The implementation can be in c# or python.
Thanks.

I can recommend to take a look at Neil Fraser's code and articles:
google-diff-match-patch
Currently available in Java,
JavaScript, C++ and Python. Regardless
of language, each library features the
same API and the same functionality.
All versions also have comprehensive
test harnesses.
Neil Fraser: Diff Strategies - for theory and implementation notes

In Python, there is difflib, as also others have suggested.
difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:
def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()

Look at difflib. (Python)
That will calculate the diffs in various formats. You could then use the size of the context diff as a measure of how different two documents are?

My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.
The Myers algorithm is described in:
E. Myers, ``An O(ND) Difference
Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.
The GNU diff utility uses the Myers algorithm.
The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.
Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.
For a good Myers / Hirschberg implementation look at:
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.
Mike

Bazaar contains an alternative difference algorithm, called patience diff (there's more info in the comments on that page) which is claimed to be better than the traditional diff algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line front end.

If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO.
Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.
Cf. also this answer.

There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.

As stated, use difflib. Once you have the diffed output, you may find the Levenshtein distance of the different strings as to give a "value" of how different they are.

You could use the solution to the Longest Common Subsequence (LCS) problem. See also the discussion about possible ways to optimize this solution.

One method I've employed for a different functionality, to calculate how much data was new in a modified file, could perhaps work for you as well.
I have a diff/patch implementation C# that allows me to take two files, presumably old and new version of the same file, and calculate the "difference", but not in the usual sense of the word. Basically I calculate a set of operations that I can perform on the old version to update it to have the same contents as the new version.
To use this for the functionality initially described, to see how much data was new, I simple ran through the operations, and for every operation that copied from the old file verbatim, that had a 0-factor, and every operation that inserted new text (distributed as part of the patch, since it didn't occur in the old file) had a 1-factor. All characters was given this factory, which gave me basically a long list of 0's and 1's.
All I then had to do was to tally up the 0's and 1's. In your case, with my implementation, a low number of 1's compared to 0's would mean the files are very similar.
This implementation would also handle cases where the modified file had inserted copies from the old file out of order, or even duplicates (ie. you copy a part from the start of the file and paste it near the bottom), since they would both be copies of the same original part from the old file.
I experimented with weighing copies, so that the first copy counted as 0, and subsequent copies of the same characters had progressively higher factors, in order to give a copy/paste operation some "new-factor", but I never finished it as the project was scrapped.
If you're interested, my diff/patch code is available from my Subversion repository.

Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.
A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.