I am trying to use a MCS (Multi classifier system) to do some better work on limited data i.e become more accurate.
I am using K-means clustering at the moment but may choose to go with FCM (Fuzzy c-means) with that the data is clustered into groups (clusters) the data could represent anything, colours for example. I first cluster the data after pre-processing and normalization and get some distinct clusters with a lot in between. I then go on to use the clusters as the data for a Bayes classifier, each cluster represents a distinct colour and the Bayes classifier is trained and the data from the clusters is then put through separate Bayes classifiers. Each Bayes classifier is trained only in one colour. If we take the colour spectrum 3 - 10 as being blue 13 - 20 as being red and the spectrum in between 0 - 3 being white up to 1.5 then turning blue gradually through 1.5 - 3 and same for blue to red.
What I would like to know is how or what kind of aggregation method (if that is what you would use) could be applied so that the Bayes classifier can become stronger, and how does it work? Does the aggregation method already know the answer or would it be human interaction that corrects the outputs and then those answers go back into the Bayes training data? Or a combination of both? Looking at Bootstrap aggregating it involves having each model in the ensemble vote with equal weight so not quite sure in this particular instance I would use bagging as my aggregation method? Boosting however involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified, not sure if this would be a better alternative to bagging as im unsure how it incrementally builds upon new instances? And the last one would be Bayesian model averaging which is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law, however completely unsure how you would sample hypotheses from search space?
I know that usualy you would use a competitive approach to bounce between the two classification algorithms one says yes one says maybe a weighting could be applied and if its correct you get the best of both classifiers but for keep sake I dont want a competitive approach.
Another question is using these two methods together in such a way would it be beneficial, i know the example i provided is very primitive and may not apply in that example but can it be beneficial in more complex data.
I have some issues about the method you are following:
K-means puts in each cluster the points that are the most near to it. And then you train a classifier using the output data. I think that the classifier may outperform the clustering implicit classification, but only by taking into account the number of samples in each cluster. For example, if your training data after clustering you have typeA(60%), typeB(20%), typeC(20%); your classifier will prefer to take ambiguous samples to typeA, to obtain less classification error.
K-means depends on what "coordinates"/"features" you take from the objects. If you use features where the objects of different types are mixed, the K-means performance will decrease. Deleting these kind of features from the feature vector may improve your results.
Your "feature"/"coordinates" that represent the objects that you want to classify may be measured in different units. This fact can affect your clustering algorithm since you are implicitly setting a unit conversion between them through the clustering error function. The final set of clusters is selected with multiple clustering trials (that were obtained upon different cluster initializations), using an error function. Thus, an implicit comparison is made upon the different coordinates of your feature vector (potentially introducing the implicit conversion factor).
Taking into account these three points, you will probably increase the overall performance of your algorithm by adding preprocessing stages. For example in object recognition for computer vision applications, most of the information taken from the images comes only from borders in the image. All the color information and part of the texture information are not used. The borders are substracted from the image processing the image to obtain the Histogram of Oriented Gradients (HOG) descriptors. This descriptor gives back "features"/"coordinates" that separate better the objects, thus, increasing classification (object recognition) performance. Theoretically descriptors throw information contained in the image. However, they present two main advantages (a) the classifier will deal with lower dimensionality data and (b) descriptors calculated from test data can be more easily matched with training data.
In your case, I suggest that you try to improve your accuracy taking a similar approach:
Give richer features to your clustering algorithm
Take advantage of prior knowledge in the field to decide what features you should add and delete from your feature vector
Always consider the possibility of obtaining labeled data, so that supervised learning algorithms can be applied
I hope this helps...
Related
I am trying to build an application which would take in
an array of [time pressure]. Say about 200 of them to be filled.
And i have several more constants such as
- Viscosity
- density
- volume
- area
Outputs would be about 3 of them.
Would it be possible to use neural network (Either encog/accord.net) to feed in
the time-pressure data and the constants with the expected outputs,
so that the program would be able to estimate the outputs based
on a different time-pressure data and different constant values?
Every application in data mining is different, but a great place to start is with weka. it has a Java and C# API and its dead easy to apply different machine learning algorithms. Many researchers in my old research team have used this really sucessfully in the past.
Defining your features, using only discrimative features and cleaning any noise your feature set is the first place to start as the algorithms will only work with a good feature set. The first step to good data mining is preprocessing of the data.
What are the best clustering algorithms to use in order to cluster data with more than 100 dimensions (sometimes even 1000). I would appreciate if you know any implementation in C, C++ or especially C#.
It depends heavily on your data. See curse of dimensionality for common problems. Recent research (Houle et al.) showed that you can't really go by the numbers. There may be thousands of dimensions and the data clusters well, and of course there is even one-dimensional data that just doesn't cluster. It's mostly a matter of signal-to-noise.
This is why for example clustering of TF-IDF vectors works rather well, in particular with cosine distance.
But the key point is that you first need to understand the nature of your data. You then can pick appropriate distance functions, weights, parameters and ... algorithms.
In particular, you also need to know what constitutes a cluster for you. There are many definitions, in particular for high-dimensional data. They may be in subspaces, they may or may not be arbitrarily rotated, they may overlap or not (k-means for example, doesn't allow overlaps or subspaces).
well i know something called vector quantization, its a nice algorithem to cluster stuf with many dimentions.
i've used k-means on data with 100's dimensions, it is very common so i'm sure theres an implementation in any language, worst case scenario - it is very easy to implement by your self.
It might also be worth trying some dimensionality reduction techniques like Principle Component Analysis or an auto-associative neural net before you try to cluster it. It can turn a huge problem into a much smaller one.
After that, go k-means or mixture of gaussians.
The EM-tree and K-tree algorithms in the LMW-tree project can cluster high dimensional problems like this. It is implemented in C++ and supports many different representations.
We have novel algorithms clustering binary vectors created by LSH / Random Projections, or anything else that emits binary vectors that can be compared via Hamming distance for similarity.
What would be the best way to store subway data in the application?
Data consists of subway station positions, length of the tunnels between stations, alignment of the labels while rendering, types of arcs to draw while rendering tunnels, junctions, etc...
Right now I'm thinking of a severely extended graph, but (just curious) maybe there is something more convenient? (obviously, subway model is used for path finding and routing).
I would suggest creating different data models that treat different parts of your problem (because you have different bounded contexts).
Using a directed graph is a no-brainer. You should implement it in a very abstract manner, so you can reuse decent, proven path finding algorithms. Depending on the algorithm you chose (A* is likely a good candidate) your data model needs to optimize for this algorithm. In case of A* this starts by defining a meaningful, practically relevant topological sort on your subway stations (euclidian distance is fine for a start, but by analyzing the nature of your data and tuning it you are likely to gain a decent boost in performance). Another aspect is using caches for various calculations and quickly discarding stations out of question.
For representation, you want to create another model of your graph, that can carry all information relevant to presentation (colors, texts, etc.).
I am looking for some kind of intelligent (I was thinking AI or Neural network) library that I can feed a list of historical data and this will predict the next sequence of outputs.
As an example I would like to feed the library the following figures 1,2,3,4,5
and based on this, it should predict the next sequence is 6,7,8,9,10 etc.
The inputs will be a lot more complex and contain much more information.
This will be used in a C# application.
If you have any recommendations or warning that will be great.
Thanks
EDIT
What I am trying to do i using historical sales data, predict what amount a specific client is most likely going to spend in the next period.
I do understand that there are dozens of external factors that can influence a clients purchases but for now I need to merely base it on the sales history and then plot a graph showing past sales and predicted sales.
If you're looking for a .NET API, then I would recommend you try AForge.NET http://code.google.com/p/aforge/
If you just want to try various machine learning algorithms on a data set that you have at your disposal, then I would recommend that you play around with Weka; it's (relatively) easy to use and it implements a lot of ML/AI algorithms. Run multiple runs with different settings for each algorithm and try as many algorithms as you can. Most of them will have some predictive power and if you combine the right ones, then you might really get something useful.
If I understand your question correctly, you want to approximate and extrapolate an unknown function. In your example, you know the function values
f(0) = 1
f(1) = 2
f(2) = 3
f(3) = 4
f(4) = 5
A good approximation for these points would be f(x) = x+1, and that would yield f(5) = 6... as expected. The problem is, you can't solve this without knowledge about the function you want to extrapolate: Is it linear? Is it a polynomial? Is it smooth? Is it (approximately or exactly) cyclic? What is the range and domain of the function? The more you know about the function you want to extrapolate, the better your predictions will be.
I just have a warning, sorry. =)
Mathematically, there is no reason for your sequence above to be followed by a "6". I can easily give you a simple function, whose next value is any value you like. Its just that humans like simple rules, and therefore tend to see a connection in these sequences, that in reality is not there. Therefore, this is a impossible task for a computer, if you do not want to feed it with additional information.
Edit:
In the case that you suspect your data to have a known functional dependence, and there are uncontrollable outside factors, maybe regression analysis will have good results. To start easy, look at linear regression first.
If you cannot assume linear dependence, there is a nice application that looks for functions fitting your historical data... I'll update this post with its name as soon as I remember. =)
We have some examples of pictures.
And we have on input set of pictures. Every input picture is one of example after combination of next things
1) Rotating
2) Scaling
3) Cutting part of it
4) Adding noise
5) Using filter of some color
It is guarantee that human can recognize picture ease.
I need simple but effective algorithm to recognize from which one of base examples we get input picture.
I am writing in C# and Java
I don't think there is a single simple algorithm which will enable you to recognise images under all the conditions you mention.
One technique which might cover most is to Fourier transform the image, but this can't be described as simple by any stretch of the imagination, and will involve some pretty heavy mathematical concepts.
You might find it useful to search in the field of Digital Signal Processing which includes image processing since they're just two dimensional signals.
EDIT: Apparently the problem is limited to recognising MONEY (notes and coins) so the first problem of searching becomes avoiding websites which mention money as the result of using their image-recognition product, rather than as the source of the images.
Anyway, I found more useful hits by searching for 'Currency Image Recognition'. Including some which mention Hidden Markov Models (whatever that means). It may be the algorithm you're searching for.
The problem is simplified by having a small set of target images, but complicated by the need to detect counterfeits.
I still don't think there's a 'simple agorithm' for this job. Good luck in your searching.
There is some good research going on in the field of computer vision. One of the problem being solved is identification of an object irrespective of scale changes,noise additions and skews introduced because photo has been clicked from a different view. I have done little assignment on this two years back as a part of computer vision course. There is a transformation called as scale invariant feature transform by which you can extract various features for the corner point. Corner points are those which are different from all its neighboring pixels. As you can observe, If photo has been clicked from two different views, some edges may disappear and appear like some thing else but corners remain almost same. This transformations explains how feature vector of size 128 can be extracted for all the corner points and tells you how to use these feature vector to find out the similarity between two images. Here in you case
You can extract those features for one of all the currency notes you have and check for existence of these corner points in the test image you are supposed to test
As this transformation is robust to rotation,scaling,cropping,noise addition and color filtering, I guess this is the best I can suggest you. You can check this demo to have a better picture of what I explained.
OpenCV has lots of algorithms and features, I guess it should be suitable for your problem, however you'll have to play with PInvoke to consume it from c# (it's C library) - doable, but requires some work.
You would need to build a set of functions that compute the probability of a particular transform between two images f(A,B). A number of transforms have previously been suggested as answers, e.g. Fourier. You would probably not be able to compute the probability of multiple transforms in one go fgh(A,B) with any reliability. So, you would compute the probability that each transform was independently applied f(A,B) g(A,B) h(A,B) and those with P above a threshold are the solution.
If the order is important, i.e you need to know that f(A,B) then g(f,B) then h(g,B) was performed, then you would need to adopt a state based probability framework such as Hidden Markov Models or a Bayesian Network (well, this is a generalization of HMMs) to model the likelihood of moving between states. See the BNT toolbox for Matlab (http://people.cs.ubc.ca/~murphyk/Software/BNT/bnt.html) for more details on these or any good modern AI book.