I am trying to build an application which would take in
an array of [time pressure]. Say about 200 of them to be filled.
And i have several more constants such as
- Viscosity
- density
- volume
- area
Outputs would be about 3 of them.
Would it be possible to use neural network (Either encog/accord.net) to feed in
the time-pressure data and the constants with the expected outputs,
so that the program would be able to estimate the outputs based
on a different time-pressure data and different constant values?
Every application in data mining is different, but a great place to start is with weka. it has a Java and C# API and its dead easy to apply different machine learning algorithms. Many researchers in my old research team have used this really sucessfully in the past.
Defining your features, using only discrimative features and cleaning any noise your feature set is the first place to start as the algorithms will only work with a good feature set. The first step to good data mining is preprocessing of the data.
Related
I'm collecting FFT sample data from a hardware device with a .NET Core application. The data structure is a float array of 4096 values per sample. Each of these values represents a small frequency band and its amplitude. These float arrays come in regularly as the live measurement continues.
For analysis I need to be able to train a model on data captured during a learning period. When the learning is done, new samples must be compared and evaluated. If the sample (FFT frequency spectrum) looks similar to what's been observed during training, the result is good. If it looks considerably different, the result is bad and this should be indicated by an existing UI.
I only need that training and classification part. I've already read the ML.NET website and guess I need to anomaly detection feature of ML.NET. It can be trained with good data only and should detect bad data automatically without having seen it in advance. Unfortunately, all examples I could find only have samples of a single number. None try to use a whole number array as data. So I have no idea how to accomplish that. Is ML.NET even capable of this and is it a suitable tool for the task? I don't have the time of business competence to thoroughly study and understand the examples provided by the ML.NET website. I only have this float[4096] data set and would like to use this as first practical steps into ML.
I am trying to use a MCS (Multi classifier system) to do some better work on limited data i.e become more accurate.
I am using K-means clustering at the moment but may choose to go with FCM (Fuzzy c-means) with that the data is clustered into groups (clusters) the data could represent anything, colours for example. I first cluster the data after pre-processing and normalization and get some distinct clusters with a lot in between. I then go on to use the clusters as the data for a Bayes classifier, each cluster represents a distinct colour and the Bayes classifier is trained and the data from the clusters is then put through separate Bayes classifiers. Each Bayes classifier is trained only in one colour. If we take the colour spectrum 3 - 10 as being blue 13 - 20 as being red and the spectrum in between 0 - 3 being white up to 1.5 then turning blue gradually through 1.5 - 3 and same for blue to red.
What I would like to know is how or what kind of aggregation method (if that is what you would use) could be applied so that the Bayes classifier can become stronger, and how does it work? Does the aggregation method already know the answer or would it be human interaction that corrects the outputs and then those answers go back into the Bayes training data? Or a combination of both? Looking at Bootstrap aggregating it involves having each model in the ensemble vote with equal weight so not quite sure in this particular instance I would use bagging as my aggregation method? Boosting however involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified, not sure if this would be a better alternative to bagging as im unsure how it incrementally builds upon new instances? And the last one would be Bayesian model averaging which is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law, however completely unsure how you would sample hypotheses from search space?
I know that usualy you would use a competitive approach to bounce between the two classification algorithms one says yes one says maybe a weighting could be applied and if its correct you get the best of both classifiers but for keep sake I dont want a competitive approach.
Another question is using these two methods together in such a way would it be beneficial, i know the example i provided is very primitive and may not apply in that example but can it be beneficial in more complex data.
I have some issues about the method you are following:
K-means puts in each cluster the points that are the most near to it. And then you train a classifier using the output data. I think that the classifier may outperform the clustering implicit classification, but only by taking into account the number of samples in each cluster. For example, if your training data after clustering you have typeA(60%), typeB(20%), typeC(20%); your classifier will prefer to take ambiguous samples to typeA, to obtain less classification error.
K-means depends on what "coordinates"/"features" you take from the objects. If you use features where the objects of different types are mixed, the K-means performance will decrease. Deleting these kind of features from the feature vector may improve your results.
Your "feature"/"coordinates" that represent the objects that you want to classify may be measured in different units. This fact can affect your clustering algorithm since you are implicitly setting a unit conversion between them through the clustering error function. The final set of clusters is selected with multiple clustering trials (that were obtained upon different cluster initializations), using an error function. Thus, an implicit comparison is made upon the different coordinates of your feature vector (potentially introducing the implicit conversion factor).
Taking into account these three points, you will probably increase the overall performance of your algorithm by adding preprocessing stages. For example in object recognition for computer vision applications, most of the information taken from the images comes only from borders in the image. All the color information and part of the texture information are not used. The borders are substracted from the image processing the image to obtain the Histogram of Oriented Gradients (HOG) descriptors. This descriptor gives back "features"/"coordinates" that separate better the objects, thus, increasing classification (object recognition) performance. Theoretically descriptors throw information contained in the image. However, they present two main advantages (a) the classifier will deal with lower dimensionality data and (b) descriptors calculated from test data can be more easily matched with training data.
In your case, I suggest that you try to improve your accuracy taking a similar approach:
Give richer features to your clustering algorithm
Take advantage of prior knowledge in the field to decide what features you should add and delete from your feature vector
Always consider the possibility of obtaining labeled data, so that supervised learning algorithms can be applied
I hope this helps...
Hi I need some help on getting started with creating my first algorithm; I want to create a NN/Genetic Algorithm for use as an Intrusion detection system.
But I’m struggling with some points (never written an algorithm before.)
I want to develop in C# would it be possible as a console app? If so, as a precursor how big would the programme roughly be, at its most simplistic form. Is it even possible in c#?
How to connect the program to read in data from the network? Also how packets can be converted to readable data for the algorithm.
How to get the programme to write rules for snort or some other form of firewall and block what the programme deems as a potential threat. (i.e it spots a threat from No.2 then it writes a rule into the snort rules page blocking that specific traffic)
How to track the data. (what its blocked what its observing how it came to that conclusion)
Where to place it on the network? (can the programme connect to other algorithms and share data on the same network, would that be beneficial)
If anyone can help start me off in the right direction or explain what other alternatives there are like fuzzy logic etc and why is it deemed as a black box?
Yes, a console app, and C#, can be used to create a Neural Network. Of course, if you want more visual aspects to the UI, you'll want to use WinForms/WPF/Silverlight etc.. It's impossible to tell how big the program will be as there's not enough information on what you want to do. Also, the size shouldn't really be a problem as long as it's efficient.
I assume this is some sort of final year project? What type of Neural Network are you using? You should read some academic papers /whitepapers on using NN with intrusion detection to get an idea. For example, this PDF has some information that might help.
You should take this one step at a time. Creating a Neural Network is separate from creating a new rule in Snort. Work on one topic at a time otherwise you'll just get overwhelmed. Considering the hard part will most likely be the NN, you should focus on that first.
It's unlikely anyone's going to go through each step with you as it's quite a large project. Show what you've done and explain where you need help.
My core realization when I started learning about neural networks is that they are just function approximators. I think that's a crucial thing to keep in mind. Whether you're using genetic algorithms or neural nets (or combining them as mentioned by #Ben Voigt, even though neural networks are typically associated with other training techniques) - what you get in the end is a function where you put in a number of real values and get out a single value.
Keeping this in mind, you can design your program and just think of the network as a black box providing those predictions, on the testing part. During training, think of another black box where you put in pairs of input and output pairs and assume it's gonna get better the more pairs you show to it.
Maybe you find this trivial, but with all the theory and mystic behaviour that's associated with this type of algorithms, I found it reassuring (though a bit disappointing ;) to reduce them to those kinds of boxes.
I am looking for some kind of intelligent (I was thinking AI or Neural network) library that I can feed a list of historical data and this will predict the next sequence of outputs.
As an example I would like to feed the library the following figures 1,2,3,4,5
and based on this, it should predict the next sequence is 6,7,8,9,10 etc.
The inputs will be a lot more complex and contain much more information.
This will be used in a C# application.
If you have any recommendations or warning that will be great.
Thanks
EDIT
What I am trying to do i using historical sales data, predict what amount a specific client is most likely going to spend in the next period.
I do understand that there are dozens of external factors that can influence a clients purchases but for now I need to merely base it on the sales history and then plot a graph showing past sales and predicted sales.
If you're looking for a .NET API, then I would recommend you try AForge.NET http://code.google.com/p/aforge/
If you just want to try various machine learning algorithms on a data set that you have at your disposal, then I would recommend that you play around with Weka; it's (relatively) easy to use and it implements a lot of ML/AI algorithms. Run multiple runs with different settings for each algorithm and try as many algorithms as you can. Most of them will have some predictive power and if you combine the right ones, then you might really get something useful.
If I understand your question correctly, you want to approximate and extrapolate an unknown function. In your example, you know the function values
f(0) = 1
f(1) = 2
f(2) = 3
f(3) = 4
f(4) = 5
A good approximation for these points would be f(x) = x+1, and that would yield f(5) = 6... as expected. The problem is, you can't solve this without knowledge about the function you want to extrapolate: Is it linear? Is it a polynomial? Is it smooth? Is it (approximately or exactly) cyclic? What is the range and domain of the function? The more you know about the function you want to extrapolate, the better your predictions will be.
I just have a warning, sorry. =)
Mathematically, there is no reason for your sequence above to be followed by a "6". I can easily give you a simple function, whose next value is any value you like. Its just that humans like simple rules, and therefore tend to see a connection in these sequences, that in reality is not there. Therefore, this is a impossible task for a computer, if you do not want to feed it with additional information.
Edit:
In the case that you suspect your data to have a known functional dependence, and there are uncontrollable outside factors, maybe regression analysis will have good results. To start easy, look at linear regression first.
If you cannot assume linear dependence, there is a nice application that looks for functions fitting your historical data... I'll update this post with its name as soon as I remember. =)
I found this very cool C++ sample , literally the "Hello World!" of genetic algorithms.
I so decided to re-code the whole thing in C# and this is the result.
Now I am asking myself: is there any practical application along the lines of generating a target string starting from a population of random strings?
EDIT: my buddy on twitter just tweeted that "is useful for transcription type things such as translation. Does not have to be Monkey's". I wish I had a clue.
Is there any practical application along the lines of generating a target string starting from a population of random strings?
Sure. Imagine any scenario in which you know how to evaluate the fitness of a particular string, and in which the choices are discrete and constrained in some way:
Picking pronounceable names ("Xhjkxc" has low fitness; "Artekzo" has high fitness)
Trying out a series of chess moves
Guessing the combination to a safe, assuming you can tell how close you are to unlocking each tumbler
Picking phone numbers that evaluate to words (e.g. "843-2378" has high fitness because it spells "THE-BEST")
No. Each time you run the GA, you are giving it the eventual answer. This is great for showing how a GA works and to show how powerful it can be, but it does not have any purpose beyond that.
You could write an EA that writes code in a dynamic language like IronPython with the goal of creating code that a) executes without crashing and b) analyzes the stock market and intelligently buys and sells stock.
That's a very simplistic take on what would be necessary, but it's possible. You would need a host that provides a lot of methods for the IronPython code (technical indicators, etc) and a database of ticks.
It would also be smart to not just generate any old random code, lest you format your own hard-drive. You need a sandbox, and you need to limit the namespaces that are accessable, and you would need to provide a time limit to avoid infinite loops. You could also provide symantic guidelines that allow it to choose appropriate approved keywords instead of just stringing random letters together -- this would greatly speed up evolution.
So, I was involved with a project that did everything but the EA. We had a satellite dish that got real-time stock ticks from the NASDAQ, a service for trading that had an API, and a primitive decision making "brain" that made decisions as the ticks came in.
Sadly, one of the partners flipped out, quit his job, forked the project (got his own dish, etc), and started trading with logic that wasn't ready. He lost a bunch of money. It turns out that for some people this type of project is only a step away from common gambling. But anyway, the project kind of fizzled out after that. Evolving the logic part is the missing link though. And I know there are people out there doing this type of thing.
I have used GA in 2 real life research problems.
One was a power optimization problem (maximize number of appliances turned on, meeting the available power constraint and service guarantee for each appliance)
Another was for radio network optimization, maximizing the coverage area given a fixed equipment budget
GA has one main disadvantage, it usually works with genetic speed so using it in some serious time-dependant projects is quite risky.