Find right features in multiclass svm without PCA - c#

I'm classifing users with a multiclass svm (one-against-on), 3 classes. In binary, I would be able to plot the distribution of the weight of each feature in the hyperplan equation for different training sets. In this case, I don't really need a PCA to see stability of the hyperplan and relative importance of the features (reudced centered btw). What would the alternative be in multiclass svm, as for each training set you have 3 classifiers and you choose one class according to the result of the three classifiers (what is it already ? the class that appears the maximum number of times or the bigger discriminant ? whichever it does not really matter here). Anyone has an idea.
And if it matters, I am writing in C# with Accord.
Thank you !

In a multi-class SVM that uses the one-vs-one strategy, the problem is divided into a set of smaller binary problems. For example, if you have three possible classes, using the one-vs-one strategy requires the creation of (n(n-1))/n binary classifiers. In your example, this would be
(n(n-1))/n = (3(3-1))/2 = (3*2)/2 = 3
Each of those will be specialized in the following problems:
Distinguishing between class 1 and class 2 (let's call it svma).
Distinguishing between class 1 and class 3 (let's call it svmb)
Distinguishing between class 2 and class 3 (let's call it svmc)
Now, I see that actually you have asked multiple questions in your original post, so I will ask them separately. First I will clarify how the decision process works, and then tell how you could detect which features are the most important.
Since you mentioned Accord.NET, there are two ways this framework might be computing the multi-class decision. The default one is to use a Decision Directed Acyclic Graph (DDAG), that is nothing more but the sequential elimination of classes. The other way is by solving all binary problems and taking the class that won most of the time. You can configure them at the moment you are classifying a new sample by setting the method parameter of the SVM's Compute method.
Since the winning-most-of-the-time version is straightforward to understand, I will explain a little more about the default approach, the DDAG.
Decision using directed acyclic graphs
In this algorithm, we test each of the SVMs and eliminate the class that lost at each round. So for example, the algorithm starts with all possible classes:
Candidate classes: [1, 2, 3]
Now it asks svma to classify x, it decides for class 2. Therefore, class 1 lost and is not considered anymore in further tests:
Candidate classes: [2, 3]
Now it asks svmb to classify x, it decides for class 2. Therefore, class 3 lost and is not considered anymore in further tests:
Candidate classes: [2]
The final answer is thus 2.
Detecting which features are the most useful
Now, since the one-vs-one SVM is decomposed into (n(n-1)/2) binary problems, the most straightforward way to analyse which features are the most important is by considering each binary problem separately. Unfortunately it might be tricky to globally determine which are the most important for the entire problem, but it will be possible to detect which ones are the most important to discriminate between class 1 and 2, or class 1 and 3, or class 2 and 3.
However, here I can offer a suggestion if your are using DDAGs. Using DDAGs, it is possible to extract the decision path that lead to a particular decision. This means that is it possible to estimate how many times each of the binary machines was used when classifying your entire database. If you can estimate the importance of a feature for each of the binary machines, and estimate how many times a machine is used during the decision process in your database, perhaps you could take their weighted sum as an indicator of how useful a feature is in your decision process.
By the way, you might also be interested in trying one of the Logistic Regression Support Vector Machines using L1-regularization with a high C to perform sparse feature selection:
// Create a new linear machine
var svm = new SupportVectorMachine(inputs: 2);
// Creates a new instance of the sparse logistic learning algorithm
var smo = new ProbabilisticCoordinateDescent(svm, inputs, outputs)
{
// Set learning parameters
Complexity = 100,
};

I'm not an expert in ML or SVM. I am a self learner. However my prototype over-performed some of similar commercial or academical software in accuracy, while the training time is about 2 hours in contrast of days and weeks(!) of some competitors.
My recognition system (patterns in bio-cells) uses following approach to select best features:
1)Extract features and calculate mean and variance for all classes
2)Select those features, where means of classes are most distanced and variances are minimal.
3)Remove redundant features - those which mean-histograms over classes are similar
In my prototype I'm using parametric features e.g feature "circle" with parameters diameter, threshold, etc.
The training is controlled by scripts defining which features with which argument-ranges are to use. So the software tests all possible combinations.
For some training-time optimization:
The software begins with 5 instances per class for extracting the features and increases the number when the 2nd condition met.
Probably there are some academical names for some of the steps. Unfortunately I'm not aware of them, I've "invented the wheel" myself.

Related

Predicting new (unknown) sequence values using aforge GA

I've been messing around with the aforge time series genetic algorithm sample and I've got my own version working, atm it's just 'predicting' Fibonacci numbers.
The problem is when I ask it to predict new values beyond the array I've given it (which contains the first 21 numbers of the sequence, using a window size of 5) it won't do it, it throws an exception that says "Data size should be enough for window and prediction".
As far as I can tell I'm supposed to decipher the bizarre formula contained in "population.BestChromosome" and use that to extrapolate future values, is that right? Is there an easier way? Am I overlooking something massively obvious?
I'd ask on the aforge forum but the developer is not supporting it anymore.
As far as I can tell I'm supposed to decipher the bizarre formula
contained in "population.BestChromosome" and use that to extrapolate
future values, is that right?
What you call a "bizarre formula" is called a model in data analysis. You learn such a model from past data and you can feed it new data to get a predicted outcome. Whether that new outcome makes sense or is just garbage depends on how general your model is. Many techniques can learn very good models that explain the observed data very well, but which are not generalizable and will return unuseful results when you feed new data into the model. You need to find a model that both explains the given data as well as potentially unobserved data which is a non-trivial process. Usually people estimate the generalization error of that model by splitting the known data into two partitions: one with which the model is learned and another one on which the learned models are tested. You then want to select that model which is accurate on both data. You can also check out the answer I gave on another question here which also treats the topic of machine learning: https://stackoverflow.com/a/3764893/189767
I don't think you're "overlooking something massively obvious", but rather you're faced with a problem that is not trivial to solve.
Btw, you can also use genetic programming (GP) in HeuristicLab. The model of GP is a mathematical formula and in HeuristicLab you can export that model to e.g. MatLab.
Ad Fibonacci, the closed formula for Fibonacci numbers is F(n) = (phi^n - psi^n) / sqrt(5) where phi and psi are special magic numbers according to wikipedia. If you want to find that with GP you need one variable (n), three constants, and the power function. However, it's very likely that you find a vastly different formula that is similar in output. The problem in machine learning is that very different models can produce the same output. The recursive form requires that you include the values of the past two n into the data set. This is similar to learning a model for a time series regression problem.

Looking for tips to build "TestMaker" (Questions and Responses) application with Evaluation Engine

I'm working on a new project.
My best analogy would be a psychological evaluation test maker.
Aspect #1.
The end-business-user needs to create test questions. With question types. And possible responses to the questions when applicable.
Examples:
1. Do you have red hair? (T/F)
2. What is your favorite color? (Single Response/Multiple Choice)
(Possible Responses: Red, Yellow, Black, White)
3. What size shoe do you wear (rounded to next highest size)? (Integer)
4. How much money do you have on you right now? (Dollar Amount (decimal))
So I need to be able to create questions, their question type, and for some of the questions, possible answers.
Here:
Number 1 is a know type of "True or False".
Number 2 is a know type of "Single Response/Multiple Choice" AND the end-business-user will create the possible responses.
Number 3 is a known type of "Integer". The end-user (person taking the evaluation) can basically put in any integer value.
Number 4 is a known type of "Decimal". Same thing as Integer.
Aspect #2.
The end-business-user needs to evaluate the person's responses. And assign some scalar value to a set of responses.
Example:
If someone responded:
1. T
2. Red
3. >=8
4. (Not considered for this situation)
Some psychiatrist-expert figures out that if someone answered with the above responses, that you are a 85% more at risk for depression than the normal. (Where 85% is a number the end-business-user (psychiatrist) can enter as a parameter.
So Aspect #2 is running through someone's responses, and determining a result.
The "response grid" would have to be setup so that it will go through (some or all) the possibilities in a priority ranking order, and then after all conditions are met (on a single row), exit out with the result.
Like this:
(1.) T (2.) Red (3.) >=8 ... Result = 85%
(1.) T (2.) Yellow (3.) >=8 ... Result = 70%
(1.) F (2.) Red (3.) >=8 ... Result = 50%
(1.) F (2.) Yellow (3.) >=8 ... Result = 40%
Once a match is found, you exit with the percentage. If you don't find a match, you go to the next rule.
Also, running with this psych evaluation mock example, I don't need to define every permutation. A lot of questions of psych evaluations are not actually used, they are just "fluff". So in my grid above, I have purposely left out question #4. It has no bearing on the results.
There can also be a "Did this person take this seriously?" evaluation grid:
(3.) >=20 (4.) >=$1,000 ... Result = False
(The possibility of having a shoe size >= 20 and having big dollars in your pocket is very low, thus you probably did not take the psych test seriously.)
If no rule is found, (in my real application, not this mock up), I would throw an exception or just not care. I would not need a default or fall-through rule.
In the above, Red and Yellow are "worrisome" favorite colors. If your favorite color is black or white, it has no bearing upon your depression risk factor.
I have used Business Rule Engines in the past. (InRule for example).
They are very expensive, and it is not in the budget.
BizTalk Business Rules Framework is a possibility. Not de$irable, but possible.
My issue with any Rules-Engine is that the "vocabulary" (I have limited experience with business rules engines, mind you) is based off of objects, with static properties.
public class Employee
{
public string LastName
{ get; set; }
public DateTime HireDate
{ get; set; }
public decimal AnnualSalary
{ get; set; }
public void AdjustSalary(int percentage)
{
this.AdjustSalary= this.AdjustSalary + (this.AdjustSalary * percentage);
}
}
This would be easy to create business rules.
If
the (Employee's HireDate) is before (10 years ago)
then
(Increase Their Salary) By (5) Percent.)
else
(Increase Their Salary) By (3) Percent.)
But in my situation, the Test is composed of (dynamic) Questions and (dynamic) Responses, not predetermined properties.
So I guess I'm looking for some ideas to investigate on how to pull this off.
I know I can build a "TestMaker" application fairly quickly.
The biggest issue is integrating the Questions and (Possible Responses) into "evaluating rules".
Thanks for any tips.
Technologies:
DotNet 4.0 Framework
Sql Server 2008 Backend Database
VS2010 Pro, C#
If it's a small application, i.e 10s of users as opposed to 1000s, and its not business critical, then I would recommend Excel.
The advantages are, most business users are very familiar with excel and most probably have it on their machines. Basically you ship an Excel Workbook to your business users. They would enter the questions, the Type (Decimal, True False etc.). Click a button which triggers an excel macro. This could generate an XML configuration file or put the questions into SQL. Your application just reads it, displays it and collects responses as usual.
The main advantage of Excel comes in Aspect #2, the dynamic user chosen business rules. In another sheet of the same Excel document, the end business user can specify as many of the permutations of the responses/questions as they feel like. Excel users are very familiar with inputting simple formulas like =If(A1 > 20 && A2 <50) etc. Again the user pushes a button and generates either a configuration file or input the data into SQL server.
In your application you iterate through the Rules table and apply it to the responses.
Given the number of users/surveys etc. This simple solution would be much more simpler than biztalk or a full on customs rules engine. Biztalk would be good if you need to talk to multiple system, integrate their data, enforce rules, create work flow etc. Most of the other rules engines are also geared towards really big, complex rule system. The complexity in this rule systems, isn't just the number of rules or permutations, it is mostly having to talk to multiple system or "End dating" certain rules or putting in rules for future kick off dates etc.
In your case an Excel based or a similar datagrid on a webpage system would work fine. Unless of course you are doing a project for Gartner or some other global data survey firm or a major government statistical organisation. In that case I would suggest Biztalk or other commercial rules engines.
In your case, its SQL tables with Questions, Answer Types, Rules To Apply. Input is made user friendly via Excel or "Excel in the web" via a datagrid and then just iterate through the rules and apply them to the response table.
Are you sure you want to use a business rule engine for this problem?
As far as I understand it, the usecase for a BRE is
Mostly static execution flow
Some decisions do change often
In your usecase, the whole system (Q/A-flow and evaluation) are dynamic, so IMHO a simple domain specific language for the whole system would be a better solution.
You might want to take some inspiration from testMaker - a web based software exactly for this workflow. (Disclaimer: I contributed a bit to this project.) Its scoring rules are quite basic, so this might not help you that much. It was designed to export data to SPSS and build reports from there...
Be sure you are modeling a database suitable for hierarchical objects This article would help
create table for test
create tables for questions, with columns question, testid, questiontype
create tables for answers, answer id,question id, answer and istrue columns
answers belong to questions
one question can have many answer
create table for user or employee
create table for user answers, answer id, selected answer
evaluation (use object variables for boolean-integer coverage, use try catch before using function for high exception coverage.):
function(questiontype,answer,useranswer)
switch(questiontype) //condition can be truefalse, biggerthan,smallerthan, equals
{
case: "biggerthan": if(useranswer>answer) return true else return false;
case "truefalse": if(useranswer==answer) return true else return false
case "equals": if(useranswer==answer) return true else return false
}
get output as a data dictionary and post here please.
without a data schema the help you get will be limited.

LibSVM turns all my training vectors into support vectors, why?

I am trying to use SVM for News article classification.
I created a table that contains the features (unique words found in the documents) as rows.
I created weight vectors mapping with these features. i.e if the article has a word that is part of the feature vector table that location is marked as 1 or else 0.
Ex:- Training sample generated...
1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1
10:1 11:1 12:1 13:1 14:1 15:1 16:1
17:1 18:1 19:1 20:1 21:1 22:1 23:1
24:1 25:1 26:1 27:1 28:1 29:1 30:1
As this is the first document all the features are present.
I am using 1, 0 as class labels.
I am using svm.Net for classification.
I gave 300 weight vectors manually classified as training data and the model generated is taking all the vectors as support vectors, which is surely overfitting.
My total features (unique words/row count in feature vector DB table) is 7610.
What could be the reason?
Because of this over fitting my project is now in pretty bad shape. It is classifying every article available as a positive article.
In LibSVM binary classification is there any restriction on the class label?
I am using 0, 1 instead of -1 and +1. Is that a problem?
You need to do some type of parameter search, also if the classes are unbalanced the classifier might get artificially high accuracies without doing much. This guide is good at teaching basic, practical things, you should probably read it
As pointed out, a parameter search is probably a good idea before doing anything else.
I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)
For more information and perhaps better answers, look on stats.stackexchange.com.
I would definitely try using -1 and +1 for your labels, that's the standard way to do it.
Also, how much data do you have? Since you're working in 7610-dimensional space, you could potentially have that many support vectors, where a different vector is "supporting" the hyperplane in each dimension.
With that many features, you might want to try some type of feature selection method like principle component analysis.

Permutation/Algorithm to Solve Conditional Fill Puzzle

I've been digging around to see if something similar has been done previously, but have not seen anything with the mirrored conditions. To make swallowing the problem a little easier to understand, I'm going to apply it in the context of filling a baseball team roster.
The given roster structure is organized as such: C, 1B, 2B, 3B, SS, 2B/SS (either or), 1B/3B, OF, OF, OF, OF, UT (can be any position)
Every player has at least one of the non-backup positions (positions that allow more than one position) where they're eligible and in many cases more than one (i.e. a player that can play 1B and OF, etc.). Say that you are manager of a team, which already has some players on it and you want to see if you have room for a particular player at any of your slots or if you can move one or more players around to open up a slot where he is eligible.
My initial attempts were to use a conditional permutation and collect in a list all the possible unique "lineups" for each player, updating the open slots before moving to the next player. This also required (since the order that the player was moved would affect what positions were available for the next player) that the list being looped through was reordered and then looped through again. I still think that this is the way to go, but there are a number of pitfalls that have snagged the function.
The data to start the loop that you assume is given is:
1. List of positions the player being evaluated can player (the one being checked if he can fit)
2. List of players currently on the roster and the positions each of those is eligible at (I'm currently storing a list of lists and using the list index as the unique identifier of the player)
3. A list of the positions open as the roster currently is
It's proven a bigger headache than I originally anticipated. It was even suggested to me by a colleague that the situation I have (which involves, on a much larger scale, conditional assignments for each object) was NP-complete. I am certain that it is not, since once a player has been repositioned in a particular lineup being tested, the entire roster should not need to be iterated over again once another player has moved. That's the long and short of it and I finally decided to open it up to the forums.
Thanks for any help anyone can provide. Due to restrictions, I can't post portions of code (some of it is legacy). It is, however, being translated in .NET (C# at the moment). If there's additional information necessary, I'll try and rewrite some of the short pieces of the function for post.
Joseph G.
EDITED 07/24/2010
Thank you very much for the responses. I actually did look into using a genetic algorithm, but ultimately abandoned it because of the amount of work that would go into the determination of ordinal results was superfluous. The ultimate aim of the test is to determine if there is, in fact, a scenario that returns a positive. There's no need to determine the relative benefit of each working solution.
I appreciate the feedback on the likely lack of familiarity with the context I presented the problem. The actual model is in the distribution of build commands across multiple platform-specific build servers. It's accessible, but I'd rather not get into why certain build tasks can only be executed on certain systems and why certain systems can only execute certain types of build commands.
It appears that you have gotten the gist of what I was presenting, but here's a different model that's a little less specific. There are a set of discrete positions in an ordered array of lists as such (I'll refer to these as "positions"):
((2), (2), (3), (4), (5), (6), (4, 6), (3, 5), (7), (7), (7), (7), (7), (2, 3, 4, 5, 6, 7))
Additionally, there is a an unordered array of lists (I'll refer to as "employees") that can only occupy one of the slots if its array has a member in common with the ordered list to which it would be assigned. After the initial assignments have been made, if an additional employee comes along, I need to determine if he can fill one of the open positions, and if not, if the current employees can be rearranged to allow one of the positions the employee CAN fill to be made available.
Brute force is something I'd like to avoid, because with this being on the order of 40 - 50 objects (and soon to be increasing), individual determinations will be very expensive to calculate at runtime.
I don't understand baseball at all so sorry if I'm on the wrong track. I do like rounders though, but there are only 2 positions to play in rounders, a batter or everyone else.
Have you considered using Genetic Algorithms to solve this problem? They are very good at solving NP hard problems and work surprisingly well for rota and time schedule type problems as well.
You have a solution model which can easily be scored and easily manipulated which is a great start for a genetic algorithm.
For more complex problems where the total permutations are too large to calculate a genetic algorithm should find a near optimum or excellent solution (along with lots and lots of other valid solutions) in a fairly short amount of time. Although if you wish the find the optimum solution every time, you are going to have to brute force it in all likelihood (I have only skimmed the problem so this may not be the case but it sounds like it probably is).
In your example, you would have a solution class, this represents a solution, IE a line-up for the baseball team. You randomly generate say 20 solutions, regardless if they are valid or not, then you have a rating algorithm that rates the solution. In your case, a better player in the line-up would score more than a worse player, and any invalid line-ups (for whatever reason) would force a score of 0.
Any 0 scoring solutions are killed off, and replaced with new random ones, and the rest of the solutions breed together to form new solutions. Theoretically and after enough time the pool of solutions should improve.
This has the benefit of not only finding lots of valid unique line-ups, but also rating them. You didn't specify in your problem the need to rate the solutions, but it offers plenty of benefits (for example if a player is injured, he can be temporarily rated as a -10 or whatever). All other players score based on their quantifiable stats.
It's scalable and performs well.
It sounds as though you have a bipartite matching problem. One partition has a vertex for each player on the roster. The other has a vertex for each roster position. There is an edge between a player vertex and a position vertex if and only if the player can play that position. You are interested in matchings: collections of edges such that no endpoint is repeated.
Given an assignment of players to positions (a matching) and a new player to be accommodated, there is a simple algorithm to determine if it can be done. Direct each edge in the current matching from the position to the player; direct the others from the player to the position. Now, using breadth-first search, look for a path from the new player to an unassigned position. If you find one, it tells you one possible series of reassignments. If you don't, there's no matching with all of the players.
For example, suppose player A can play positions 1 or 2
A--1
\
\
2
We provisionally assign A to 2. Now B shows up and can only play 2. Direct the graph:
A->1
<
\
B->2
We find a path B->2->A->1, which means "assign B to 2, displacing A to 1".
There is lots of pretty theory for dealing with hypothetical matchings. Genetic algorithms need not apply.
EDIT: I should add that because of the use of BFS, it computes the least disruptive sequence of reassignments.

Possible Combination of Knapsack problem and?

Alright quick overview
I have looked into the knapsack problem
http://en.wikipedia.org/wiki/Knapsack_problem
and i know it is what i need for my project, but the complicated part of my project would be that i need multiple sacks inside a main sack.
The large knapsack that holds all the "bags" can only carry x amount of "bags" (lets say 9 for sake of example). Each bag has different values;
Weight
Cost
Size
Capacity
and so on, all of those values are integer numbers. Lets assume from 0-100.
The inner bag will also be assigned a type, and there can only be one of that type within the outer bag, although the program input will be given multiple of the same type.
I need to assign a maximum weight that the main bag can hold, and all other properties of the smaller bags need to be grouped by weighted values.
Example
Outer Bag:
Can hold 9 smaller bags
Weight no more than 98 [Give or take 5 either side]
Must hold one of each type, Can only hold one of each type at a time.
Inner Bags:
Cost, Weighted at 100%
Size, Weighted at 67%
Capacity, Weighted at 44%
The program will be given an input of multiple bags, and then must work out combinations of Smaller Bags to go into the larger bag, there will be multiple solutions depending on the input, and the program would output the best solutions for me.
I am wondering what you guys think the best way for me to approach this would be.
I will be programming it in either Java, or C#. I would love to program it in PHP but i'm afraid the algorithm would be very inefficient for web servers.
Thanks for any help you can give
-Zack
Okay, well, knapsack is NP-hard so I'm pretty certain this will be NP-hard as well (if it weren't you could solve knapsack by doing this with only one outer bag.) So for an exactly optimal solution, you're probably going to be able to do no beter than searching all combinations. So the outline of the program you want will be like
for each possible combination
do
if current combination is better than best previous
save current combination as best so far
fi
od
and the run time will be exponential. It sounds, though, like you might be able to get a near solution with dynamic programming.
Consider using Prolog for your logical programming. There's multiple implementations of it including P# on mono (.NET). Theres a bit of a learning curve, but once you get used to it, it's pretty much in a league of its own for this kind of problem solving.
Hope this helps. Cheers!
link to P#

Categories

Resources