I have been doing some research with neural networks and the concept and theory as a whole makes sense to me. Although the one question that sticks out to me, which I haven't been able to find an answer to yet, is how many neurons should be used in a Neural Net. to achieve proper/efficient results. Including Hidden Layers, neurons per Hidden Layer, etc. Do more neurones necessarily more accurate results (while being more taxing on the system) or will less neurons still be sufficient? Is there some sort of governing rule to help determine those numbers? Does it depend on the type of training/learning algorithm that is being implemented into the neural net. Does it depend on the type of data/input that is being presented to the network?
If it makes it easier to answer the questions, I will most likely be using feedforwarding and backpropogation as the main method for training and prediction.
On a side note, is there a prediction algorithm/firing rule or learning algorithm that is generally regraded to as "the best/most practical", or is that also dependant on the type of data being presented to the network?
Thanks to anyone with any input, it's always appreciated!
EDIT: Regarding the C# tag, that is the language in which I'll be putting together my neural network. If that information helps at all.
I specialized in AI / NN in College, and have had some ameture experience working on them for games, and here is what I found as a guide for getting started. Realize, however, that each NN will take some tweaking to work best in your chosen environment. (One potential solution is to expose your program to 1000s of different NNs, setup a testable criteria for performance and then use a Genetic Algorithm to propagate more useful NNs and cull less useful NNs - but that is a whole other very large post...)
I found - in general
Input Layer - One AN for each input vector + 1 Bias (always 1)
Inner Layer - Double the Input Layer
Output Layer - One AN for each Action or Result
Example: Character Recognition
If you are examining a 10x10 grid for character recognition;
start with 101 Input AN (one for each pixel, plus one bias)
202 Inner AN
and 26 Output AN (one for each letter of the alphabet)
Example: Blackjack
If you are building a NN to "win at blackjack";
start with 16 Input AN (13 to count each occurance of a card, 1 for player hand value, 1 for dealer "up-card", and 1 bias)
32 Inner AN
and 6 output AN (one for "Hit" "Stay" "Split" "Double" "Surrender" and "Insurrance")
Some general rules are the following based on this paper: 'Approximating Number of Hidden layer neurons in Multiple Hidden Layer BPNN Architecture' by Saurabh Karsoliya. Source here
The number of hidden layer neurons are 2/3 (or 70%
to 90%) of the size of the input layer. If this is
insufficient then number of output layer neurons can
be added later on.
The number of hidden layer neurons should be less
than twice of the number of neurons in input layer.
The size of the hidden layer neurons is between the
input layer size and the output layer size.
Related
I have a neural network with a lot of inputs, and i want to train it to realise that only 1 of the inputs matter. First i train it with input[1]=1 and given result 10
then i train with exact same inputs except input[1] = 0 and given result being 0.
I train them until the error is 0 before i switch to the other one, but they just keep changing different weights up and down till the output is equal to the given result, they never figure out that only the weights related to input[1] needs to be concerned about.
Is this a common error so to say, that can be bypassed somehow?
Ps. I'm using Sigmoid and derivatives
what you are doing is incremental or selective learning. each time you re-train the network on a new data several epochs you are over fitting the new data. if in your case you don't care about the incremental learning and you just care about the result from your data set it is better you use batches from you data set over several iteration until your network converge and doesn't fit the training data.
I'm classifing users with a multiclass svm (one-against-on), 3 classes. In binary, I would be able to plot the distribution of the weight of each feature in the hyperplan equation for different training sets. In this case, I don't really need a PCA to see stability of the hyperplan and relative importance of the features (reudced centered btw). What would the alternative be in multiclass svm, as for each training set you have 3 classifiers and you choose one class according to the result of the three classifiers (what is it already ? the class that appears the maximum number of times or the bigger discriminant ? whichever it does not really matter here). Anyone has an idea.
And if it matters, I am writing in C# with Accord.
Thank you !
In a multi-class SVM that uses the one-vs-one strategy, the problem is divided into a set of smaller binary problems. For example, if you have three possible classes, using the one-vs-one strategy requires the creation of (n(n-1))/n binary classifiers. In your example, this would be
(n(n-1))/n = (3(3-1))/2 = (3*2)/2 = 3
Each of those will be specialized in the following problems:
Distinguishing between class 1 and class 2 (let's call it svma).
Distinguishing between class 1 and class 3 (let's call it svmb)
Distinguishing between class 2 and class 3 (let's call it svmc)
Now, I see that actually you have asked multiple questions in your original post, so I will ask them separately. First I will clarify how the decision process works, and then tell how you could detect which features are the most important.
Since you mentioned Accord.NET, there are two ways this framework might be computing the multi-class decision. The default one is to use a Decision Directed Acyclic Graph (DDAG), that is nothing more but the sequential elimination of classes. The other way is by solving all binary problems and taking the class that won most of the time. You can configure them at the moment you are classifying a new sample by setting the method parameter of the SVM's Compute method.
Since the winning-most-of-the-time version is straightforward to understand, I will explain a little more about the default approach, the DDAG.
Decision using directed acyclic graphs
In this algorithm, we test each of the SVMs and eliminate the class that lost at each round. So for example, the algorithm starts with all possible classes:
Candidate classes: [1, 2, 3]
Now it asks svma to classify x, it decides for class 2. Therefore, class 1 lost and is not considered anymore in further tests:
Candidate classes: [2, 3]
Now it asks svmb to classify x, it decides for class 2. Therefore, class 3 lost and is not considered anymore in further tests:
Candidate classes: [2]
The final answer is thus 2.
Detecting which features are the most useful
Now, since the one-vs-one SVM is decomposed into (n(n-1)/2) binary problems, the most straightforward way to analyse which features are the most important is by considering each binary problem separately. Unfortunately it might be tricky to globally determine which are the most important for the entire problem, but it will be possible to detect which ones are the most important to discriminate between class 1 and 2, or class 1 and 3, or class 2 and 3.
However, here I can offer a suggestion if your are using DDAGs. Using DDAGs, it is possible to extract the decision path that lead to a particular decision. This means that is it possible to estimate how many times each of the binary machines was used when classifying your entire database. If you can estimate the importance of a feature for each of the binary machines, and estimate how many times a machine is used during the decision process in your database, perhaps you could take their weighted sum as an indicator of how useful a feature is in your decision process.
By the way, you might also be interested in trying one of the Logistic Regression Support Vector Machines using L1-regularization with a high C to perform sparse feature selection:
// Create a new linear machine
var svm = new SupportVectorMachine(inputs: 2);
// Creates a new instance of the sparse logistic learning algorithm
var smo = new ProbabilisticCoordinateDescent(svm, inputs, outputs)
{
// Set learning parameters
Complexity = 100,
};
I'm not an expert in ML or SVM. I am a self learner. However my prototype over-performed some of similar commercial or academical software in accuracy, while the training time is about 2 hours in contrast of days and weeks(!) of some competitors.
My recognition system (patterns in bio-cells) uses following approach to select best features:
1)Extract features and calculate mean and variance for all classes
2)Select those features, where means of classes are most distanced and variances are minimal.
3)Remove redundant features - those which mean-histograms over classes are similar
In my prototype I'm using parametric features e.g feature "circle" with parameters diameter, threshold, etc.
The training is controlled by scripts defining which features with which argument-ranges are to use. So the software tests all possible combinations.
For some training-time optimization:
The software begins with 5 instances per class for extracting the features and increases the number when the 2nd condition met.
Probably there are some academical names for some of the steps. Unfortunately I'm not aware of them, I've "invented the wheel" myself.
I'm implementing Ng's example of OCR neural network in C#.
I think I've got all formulas correctly implemented [vectorized version] and my app is training the network.
Any advice on how can I see my network improving in recognition - without testing examples manually by drawing them after the training is done? I want to see where my training is going while it's being trained.
I've test my trained weights on a drawn digits, output on all neurons is quite similar(approx. 0.077,or something like that ...on all neurons) ,and the largest value is on the wrong neuron. So the result doesn't match the drawn image.
This is the only test I'm doing so far: Cost Function changes with epochs
So, this is what happens with Cost function (some call it objective function? ) in 50 epochs.
my Lambda value is set to 3.0 , learning rate is 0.01, 5000 examples, I do batch after each epoch i.e. after those 5000 examples. Activation function: sigmoid.
input: 400
hidden: 25
output:10
I don't know what proper values are for lambda and learning rate so that my network can learn without overfitting or underfitting.
Any suggestions how to find out my network is learning well?
Also, what value should J cost function have after all this training?
Should it approach zero?
Should I have more epochs?
Is it bad that my examples are all ordered by digits?
Any help is appreciated.
Q: Any suggestions how to find out my network is learning well?
A: Split the data into three groups training, cross validation and test.Validate your result with test data. This is actually address in the course later.
Q: Also, what value should J cost function have after all this training? Should it approach zero?
A: I recall in the homework Ng mentioned what is the expected value. The regularized cost should not be zero since it includes a sum of all the weights.
Q: Should I have more epochs?
A: If you run your program long enough ( less than 20 minutes? ) you will see the cost is not getting smaller, I assume it reached the local/global optimum so more epochs would not be necessary.
Q: Is it bad that my examples are all ordered by digits?
A: The algorithm modify the weights for every example so different order of data does affect each step in a batch. However the final result should not have much difference.
I've been messing around with the aforge time series genetic algorithm sample and I've got my own version working, atm it's just 'predicting' Fibonacci numbers.
The problem is when I ask it to predict new values beyond the array I've given it (which contains the first 21 numbers of the sequence, using a window size of 5) it won't do it, it throws an exception that says "Data size should be enough for window and prediction".
As far as I can tell I'm supposed to decipher the bizarre formula contained in "population.BestChromosome" and use that to extrapolate future values, is that right? Is there an easier way? Am I overlooking something massively obvious?
I'd ask on the aforge forum but the developer is not supporting it anymore.
As far as I can tell I'm supposed to decipher the bizarre formula
contained in "population.BestChromosome" and use that to extrapolate
future values, is that right?
What you call a "bizarre formula" is called a model in data analysis. You learn such a model from past data and you can feed it new data to get a predicted outcome. Whether that new outcome makes sense or is just garbage depends on how general your model is. Many techniques can learn very good models that explain the observed data very well, but which are not generalizable and will return unuseful results when you feed new data into the model. You need to find a model that both explains the given data as well as potentially unobserved data which is a non-trivial process. Usually people estimate the generalization error of that model by splitting the known data into two partitions: one with which the model is learned and another one on which the learned models are tested. You then want to select that model which is accurate on both data. You can also check out the answer I gave on another question here which also treats the topic of machine learning: https://stackoverflow.com/a/3764893/189767
I don't think you're "overlooking something massively obvious", but rather you're faced with a problem that is not trivial to solve.
Btw, you can also use genetic programming (GP) in HeuristicLab. The model of GP is a mathematical formula and in HeuristicLab you can export that model to e.g. MatLab.
Ad Fibonacci, the closed formula for Fibonacci numbers is F(n) = (phi^n - psi^n) / sqrt(5) where phi and psi are special magic numbers according to wikipedia. If you want to find that with GP you need one variable (n), three constants, and the power function. However, it's very likely that you find a vastly different formula that is similar in output. The problem in machine learning is that very different models can produce the same output. The recursive form requires that you include the values of the past two n into the data set. This is similar to learning a model for a time series regression problem.
I am trying to use SVM for News article classification.
I created a table that contains the features (unique words found in the documents) as rows.
I created weight vectors mapping with these features. i.e if the article has a word that is part of the feature vector table that location is marked as 1 or else 0.
Ex:- Training sample generated...
1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1
10:1 11:1 12:1 13:1 14:1 15:1 16:1
17:1 18:1 19:1 20:1 21:1 22:1 23:1
24:1 25:1 26:1 27:1 28:1 29:1 30:1
As this is the first document all the features are present.
I am using 1, 0 as class labels.
I am using svm.Net for classification.
I gave 300 weight vectors manually classified as training data and the model generated is taking all the vectors as support vectors, which is surely overfitting.
My total features (unique words/row count in feature vector DB table) is 7610.
What could be the reason?
Because of this over fitting my project is now in pretty bad shape. It is classifying every article available as a positive article.
In LibSVM binary classification is there any restriction on the class label?
I am using 0, 1 instead of -1 and +1. Is that a problem?
You need to do some type of parameter search, also if the classes are unbalanced the classifier might get artificially high accuracies without doing much. This guide is good at teaching basic, practical things, you should probably read it
As pointed out, a parameter search is probably a good idea before doing anything else.
I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)
For more information and perhaps better answers, look on stats.stackexchange.com.
I would definitely try using -1 and +1 for your labels, that's the standard way to do it.
Also, how much data do you have? Since you're working in 7610-dimensional space, you could potentially have that many support vectors, where a different vector is "supporting" the hyperplane in each dimension.
With that many features, you might want to try some type of feature selection method like principle component analysis.