Neural Network, Genetic algorithm as an Intrusion detection system - c#

Hi I need some help on getting started with creating my first algorithm; I want to create a NN/Genetic Algorithm for use as an Intrusion detection system.
But I’m struggling with some points (never written an algorithm before.)
I want to develop in C# would it be possible as a console app? If so, as a precursor how big would the programme roughly be, at its most simplistic form. Is it even possible in c#?
How to connect the program to read in data from the network? Also how packets can be converted to readable data for the algorithm.
How to get the programme to write rules for snort or some other form of firewall and block what the programme deems as a potential threat. (i.e it spots a threat from No.2 then it writes a rule into the snort rules page blocking that specific traffic)
How to track the data. (what its blocked what its observing how it came to that conclusion)
Where to place it on the network? (can the programme connect to other algorithms and share data on the same network, would that be beneficial)
If anyone can help start me off in the right direction or explain what other alternatives there are like fuzzy logic etc and why is it deemed as a black box?

Yes, a console app, and C#, can be used to create a Neural Network. Of course, if you want more visual aspects to the UI, you'll want to use WinForms/WPF/Silverlight etc.. It's impossible to tell how big the program will be as there's not enough information on what you want to do. Also, the size shouldn't really be a problem as long as it's efficient.
I assume this is some sort of final year project? What type of Neural Network are you using? You should read some academic papers /whitepapers on using NN with intrusion detection to get an idea. For example, this PDF has some information that might help.
You should take this one step at a time. Creating a Neural Network is separate from creating a new rule in Snort. Work on one topic at a time otherwise you'll just get overwhelmed. Considering the hard part will most likely be the NN, you should focus on that first.
It's unlikely anyone's going to go through each step with you as it's quite a large project. Show what you've done and explain where you need help.

My core realization when I started learning about neural networks is that they are just function approximators. I think that's a crucial thing to keep in mind. Whether you're using genetic algorithms or neural nets (or combining them as mentioned by #Ben Voigt, even though neural networks are typically associated with other training techniques) - what you get in the end is a function where you put in a number of real values and get out a single value.
Keeping this in mind, you can design your program and just think of the network as a black box providing those predictions, on the testing part. During training, think of another black box where you put in pairs of input and output pairs and assume it's gonna get better the more pairs you show to it.
Maybe you find this trivial, but with all the theory and mystic behaviour that's associated with this type of algorithms, I found it reassuring (though a bit disappointing ;) to reduce them to those kinds of boxes.

Related

Location Based Product Recommendation Service Using Content Based Recommendations with ASP.Net & C#

I am in a bit of a crisis here. I would really appreciate your help on the matter.
My Final Year Project is a "Location Based Product Recommendation Service". Now, due to some communication gap, we got stuck with an extremely difficult algorithm. Here is how it went:
We had done some research about recommendation systems prior to the project defense. We knew there were two approaches, "Collaborative Filtering" and "Content Based Recommendation". We had planned on using whichever technique gave us the best results. So, in essence, we were more focused on the end product than the actual process. The HOD asked us what algorithms OUR product would use? But, my group members thought that he meant what are the algorithms that are used for "Content Based Recommendations". They answered with "Rule Mining, Classification and Clustering". He was astonished that we planned on using all these algorithms for our project. He told us that he would accept our project proposal if we use his algorithm in our project. He gave us his research paper, without any other resources such as data, simulations, samples, etc. The algorithm is named "Context Based Positive and Negative Spatio-Temporal Association Rule Mining" In the paper, this algorithm was used to recommend sites for hydrocarbon taps and mining with extremely accurate results. Now here are a few issues I face:
I am not sure how or IF this algorithm fits in our project scenario
I cannot find spatio-temporal data, MarketBaskets, documentation or indeed any helpful resource
I tried asking the HOD for the data he used for the paper, as a reference. He was unable to provide the data to me
I tried coding the algorithm myself, in an incremental fashion, but found I was completely out of my depth. I divided the algo in 3 phases. Positive Spatio-Temporal Association Rule Mining, Negative Spatio-Temporal Association Rule Mining and Context Based Adjustments. Alas! The code I write is not mature enough. I couldn't even generate frequent itemsets properly. I understand the theory quite well, but I am not able to translate it into efficient code.
When the algorithm has been coded, I need to develop a web service. We also need a client website to access the web service. But with the code not even 10% done, I really am panicking. The project submission is in a fortnight.
Our supervisor is an expert in Artificial Intelligence, but he cannot guide us in the algorithm development. He dictates the importance of reuse and utilizing open-source resources. But, I am unable to find anything of actual use.
My group members are waiting on me to deliver the algorithm, so they can deploy it as a web service. There are other adjustments than need to be done, but with the algorithm not available, there is nothing we can do.
I have found a data set of Market Baskets. It's a simple excel file, with about 9000 transactions. There is not spatial or temporal data in it and I fear adding artificial data would compromise the integrity of the data.
I would appreciate if somebody could guide me. I guess the best approach would be to use an open-source API to partially implement the algorithm and then build the service and client application. We need to demonstrate something on 17th of June. I am really looking forward to your help, guidance and constructive criticism. Some solutions that I have considered are:
Use "User Clustering" as a "Collaborate Filtering" technique. Then
recommend the products from similar users via an alternative "Rule
Mining" algorithm. I need all these algorithms to be openly available
either as source code or an API, if I have any chance of making this
project on time.
Drop the algorithm altogether and make a project that actually works
as we intended, using available resources. I am 60% certain that we
would fail or marked extremely low.
Pay a software house to develop the algorithm for us and then
over-fit it into our project. I am not inclined to do this because it
would be unethical to do this.
As you can clearly see, my situation is quite dire. I really do need extensive help and guidance if I am to complete this project properly, in time. The project needs to be completely deployed and operational. I really am in a loop here
"Collaborative Filtering", "Content Based Recommendation", "Rule Mining, Classification and Clustering"
None of these are algorithms. They are tasks or subtasks, for each of which several algorithms exist.
I think you had a bad start already by not really knowing well enough what you proposed... but granted, the advice from your advisor was also not at all helpful.

Need an audio analysis library to create real time feedback from audio file?

Real-time is not necessarily required, however I am creating a game for my final year project and I wish to use the power of audio to create dynamic levels based solely on a music track that is playing. I aim to create this game for the PS Vita using playstation mobile and C#, but if i want i can switch to C++ and PSP.
I can use a WAV file, and hopefully extract the amplitude of the waveform, as well as calculating other characteristics like average frequency and approximate BPM from this data to create a level.
I have no qualms about trying to work with this raw data, I just want to know a way I can actually GET that information first. If i can extract the samples and assertain different characteristics of these samples, I can store them and work out changes in loudness, pitch and more to create notes etc.
I am using C#, but if at all possible i can either use p/invoke or switch my project to another device that uses C++ instead of C#.
I'm panicking a bit here, cos I really am a bit stumped.
Many thanks guys.
Unfortunately i don't think you'll be able to use C# to do this - AFAIK, there is no JIT compiler for it. I remember reading about something for Mono, which would make it available to use with C#, but i'm not sure right now.
That said - i would go with c++. If you go that way, you can make use of a vast amount of audio analysis libraries, like CLAM (http://clam-project.org/).
Don't panic (imagine big, friendly letters.) Envision the necessary parts for the project step by step, tackle one by one, and you'll be done in no time. =)
The problem you describe here is one of music/audio feature extraction and a substantial body of academic work exists that you can draw on. Another useful term of art with which to search is Music Information Retrieval (MIR).
The list of 'features' that researchers have attempted to retrieve from recordings is large and varied, from deterministic things such as pitch and key through emotional characteristics, such as 'energy'.
Most of these turn out to be more difficult than you might imagine, and typically only about 60-70% accurate - although for your requirements, this is probably adequate.
A good entry point might be download Sonic Visualiser, for which a large number of feature extraction plug-ins exist, and are open-source. You'll at least get a feel for what's possible.
Update: Another useful term of art is Onset detection - this is typically used to describe beat detection algorithms.
Aubio is a C/C++ library that does pitch tracking, onset detection and bpm tracking, among other things.
As for "extracting the amplitude of the waveform", the waveform is amplitude, i.e., you could just pick the audio sample with the greatest absolute value every n samples and use that value to do the "amplitude" part of the visualization.
Here's some code that might help you get started reading WAVE data in C#.
Here's some information about writing a C# wrapper for the FFTW library.

XXX image recognition software/algorithm [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Akismet does an amazing job at detecting spam comments. But comments are not the only form of spam these days. What if I wanted something like akismet to automatically detect porn images on a social networking site which allows users to upload their pics, avatars, etc?
There are already a few image based search engines as well as face recognition stuff available so I am assuming it wouldn't be rocket science and it could be done. However, I have no clue regarding how that stuff works and how I should go about it if I want to develop it from scratch.
How should I get started?
Is there any open source project for this going on?
This is actually reasonably easy. You can programatically detect skin tones - and porn images tend to have a lot of skin. This will create false positives but if this is a problem you can pass images so detected through actual moderation. This not only greatly reduces the the work for moderators but also gives you lots of free porn. It's win-win.
#!python
import os, glob
from PIL import Image
def get_skin_ratio(im):
im = im.crop((int(im.size[0]*0.2), int(im.size[1]*0.2), im.size[0]-int(im.size[0]*0.2), im.size[1]-int(im.size[1]*0.2)))
skin = sum([count for count, rgb in im.getcolors(im.size[0]*im.size[1]) if rgb[0]>60 and rgb[1]<(rgb[0]*0.85) and rgb[2]<(rgb[0]*0.7) and rgb[1]>(rgb[0]*0.4) and rgb[2]>(rgb[0]*0.2)])
return float(skin)/float(im.size[0]*im.size[1])
for image_dir in ('porn','clean'):
for image_file in glob.glob(os.path.join(image_dir,"*.jpg")):
skin_percent = get_skin_ratio(Image.open(image_file)) * 100
if skin_percent>30:
print "PORN {0} has {1:.0f}% skin".format(image_file, skin_percent)
else:
print "CLEAN {0} has {1:.0f}% skin".format(image_file, skin_percent)
This code measures skin tones in the center of the image. I've tested on 20 relatively tame "porn" images and 20 completely innocent images. It flags 100% of the "porn" and 4 out of the 20 of the clean images. That's a pretty high false positive rate but the script aims to be fairly cautious and could be further tuned. It works on light, dark and Asian skin tones.
It's main weaknesses with false positives are brown objects like sand and wood and of course it doesn't know the difference between "naughty" and "nice" flesh (like face shots).
Weakness with false negatives would be images without much exposed flesh (like leather bondage), painted or tattooed skin, B&W images, etc.
source code and sample images
This was written in 2000, not sure if the state of the art in porn detection has advanced at all, but I doubt it.
http://www.dansdata.com/pornsweeper.htm
PORNsweeper seems to have some ability to distinguish pictures of people from pictures of things that aren't people, as long as the pictures are in colour. It is less successful at distinguishing dirty pictures of people from clean ones.
With the default, medium sensitivity, if Human Resources sends around a picture of the new chap in Accounts, you've got about a 50% chance of getting it. If your sister sends you a picture of her six-month-old, it's similarly likely to be detained.
It's only fair to point out amusing errors, like calling the Mona Lisa porn, if they're representative of the behaviour of the software. If the makers admit that their algorithmic image recogniser will drop the ball 15% of the time, then making fun of it when it does exactly that is silly.
But PORNsweeper only seems to live up to its stated specifications in one department - detection of actual porn. It's half-way decent at detecting porn, but it's bad at detecting clean pictures. And I wouldn't be surprised if no major leaps were made in this area in the near future.
I would rather allow users report on bad images. Image recognition development can take too much efforts and time and won't be as much as accurate as human eyes. It's much cheaper to outsource that moderation job.
Take a look at: Amazon Mechanical Turk
"The Amazon Mechanical Turk (MTurk) is one of the suite of Amazon Web Services, a crowdsourcing marketplace that enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do."
Bag-of-Visual-Words Models for Adult Image Classification and Filtering
What is the best way to programatically detect porn images?
A Brief Survey of Porn-Detection/Porn-Removal Software
Detection of Pornographic Digital Images (2011!)
BOOM! Here is the whitepaper containing the algorithm.
Does anyone know where to get the source code for a java (or any language) implementation?
That would rock.
One algorithm called WISE has a 98% accuracy rate but a 14% false positive rate. So what you do is you let the users flag the 2% false negatives, ideally with automatic removal if a certain number of users flag it, and have moderators view the 14% false positives.
Nude.js based on the whitepaper by Rigan Ap-apid from De La Salle University.
There is software that detects the probability for porn, but this is not an exact science, as computers can't recognize what is actually on pictures (pictures are only a big set of values on a grid with no meaning). You can just teach the computer what is porn and what not by giving examples. This has the disadvantage that it will only recognize these or similar images.
Given the repetitive nature of porn you have a good chance if you train the system with few false positives. For example if you train the system with nude people it may flag pictures of a beach with "almost" naked people as porn too.
A similar software is the facebook software that recently came out. It's just specialized on faces. The main principle is the same.
Technically you would implement some kind of feature detector that utilizes a bayes filtering. The feature detector may look for features like percentage of flesh colored pixels if it's a simple detector or just computes the similarity of the current image with a set of saved porn images.
This is of course not limited to porn, it's actually more a corner case. I think more common are systems that try to find other things in images ;-)
The answer is really easy: It's pretty safe to say that it won't be possible in the next two decades. Before that we will probably get good translation tools. The last time I checked, the AI guys were struggling to identify the same car on two photographs shot from a slightly altered angle. Take a look on how long it took them to get good enough OCR or speech recognition together. Those are recognition problems which can benefit greatly from dictionaries and are still far from having completely reliable solutions despite of the multi-million man months thrown at them.
That being said you could simply add an "offensive?" link next to user generated contend and have a mod cross check the incoming complaints.
edit:
I forgot something: IF you are going to implement some kind of filter, you will need a reliable one. If your solution would be 50% right, 2000 out of 4000 users with decent images will get blocked. Expect an outrage.
A graduate student from National Cheng Kung University in Taiwan did a research on this subject in 2004. He was able to achieve success rate of 89.79% in detecting nude pictures downloaded from the Internet. Here is the link to his thesis: The Study on Naked People Image Detection Based on Skin Color It's in Chinese therefore you may need a translator in case you can't read it.
short answer: use a moderator ;)
Long answer: I dont think there's a project for this cause what is porn? Only legs, full nudity, midgets etc. Its subjective.
Add an offensive link and store the md5 (or other hash) of the offending image so that it can automatically tagged in the future.
How cool would it be if somebody had a large public database of image md5 along with descriptive tags running as a webservice? Alot of porn isn't original work (in that the person who has it now, didn't probably make it) and the popular images tend to float around different places, so this could really make a difference.
If you're really have time and money:
One way of doing it is by 1) Writing an image detection algorithm to find whether an object is human or not. This can be done by bitmasking an image to retrieve it's "contours" and see if the contours fits a human contour.
2) Data mine a lot of porn images and use data mining techniques such as the C4 algorithms or Particle Swarm Optimization to learn to detect pattern that matches porn images.
This will require that you identify how a naked man/woman contours of a human body must look like in digitized format (this can be achieved in the same way OCR image recognition algorithms works).
Hope you have fun! :-)
Seems to me like the main obstacle is defining a "porn image". If you can define it easily, you could probably write something that would work. But even humans can't agree on what is porn. How will the application know? User moderation is probably your best bet.
I've seen a web filtering application which does porn image filtering, sorry I can't remember the name. It was pretty prone to false positives however most of the time it was working.
I think main trick is detecting "too much skin on the picture :)
Detecting porn images is still a definite AI task which is very much theoretical yet.
Harvest collective power and human intelligence by adding a button/link "Report spam/abuse". Or employ several moderators to do this job.
P.S. Really surprised how many people ask questions assuming software and algorithms are all-mighty without even thinking whether what they want could be done. Are they representatives of that new breed of programmers who have no understanding of hardware, low-level programming and all that "magic behind"?
P.S. #2. I also remember that periodically it happens that some situation when people themselves cannot decide whether a picture is porn or art is taken to the court. Even after the court rules, chances are half of the people will consider the decision wrong. The last stupid situation of the kind was quite recently when a Wikipedia page got banned in UK because of a CD cover image that features some nakedness.
Two options I can think of (though neither of them is programatically detecting porn):
Block all uploaded images until one of your administrators has looked at them. There's no reason why this should take a long time: you could write some software that shows 10 images a second, almost as a movie - even at this speed, it's easy for a human being to spot a potentially pornographic image. Then you rewind in this software and have a closer look.
Add the usual "flag this image as inappropriate" option.
The BrightCloud web service API is perfect for this. It's a REST API for doing website lookups just like this. It contains a very large and very accurate web filtering DB and one of the categories, Adult, has over 10M porn sites identified!
I've heard about tools which were using very simple, but quite effective algorithm. The algorithm calculated relative amount of pixels with color value near to some predefined "skin" colours. If that amount is higher than some predefined value then image is considered to be of erotic/pornographic content. Of course that algorithm will give false positive results for close-up face photos and many other things.
Since you are writing about social networking there will be lots of "normal" photos with high amount of skin colour on it, so you shouldn't use this algorithm to deny all pictures with positive result. But you can use it provide some help for moderators, for example flag these pictures with higher priority, so if moderator want to check some new pictures for pornographic content he can start from these pictures.
This one looks promising. Basically they detect skin (with calibration by recognizing faces) and determine "skin paths" (i.e. measuring the proportion of skin pixels vs. face skin pixels / skin pixels). This has decent performance.
http://www.prip.tuwien.ac.at/people/julian/skin-detection
Look at file name and any attributes. There's not nearly enough information to detect even 20% of naughty images, but a simple keyword blacklist would at least detect images with descriptive labels or metadata. 20 minutes of coding for a 20% success rate isn't a bad deal, especially as a prescreen that can at least catch some simple ones before you pass the rest to a moderator for judging.
The other useful trick is the opposite of course, maintain a whitelist of image sources to allow without moderation or checking. If most of your images come from known safe uploaders or sources, you can just accept them bindly.
I shall not today attempt further to
define the kinds of material I
understand to be embraced within that
shorthand description ["hard-core
pornography"]; and perhaps I could
never succeed in intelligibly doing
so. But I know it when I see it, and
the motion picture involved in this
case is not that.
— United States Supreme Court Justice Potter Stewart, 1964
You can find many whitepapers on the net dealing with this subject.
It is not rocket science. Not anymore. It is very similar to face recognition. I think that the easiest way to deal with it is to use machine learning. And since we are dealing with images, I can point towards neuronal networks, because these seem to be preferred for images. You will need training data. And you can find tons of training data on the internet but you have to crop the images to the specific part that you want the algorithm to detect. Of course you will have to break the problem into different body parts that you want to detect and create training data for each, and this is where things become amusing.
Like someone above said, it cannot be done 100% percent. There will be cases where such algorithms fail. The actual precision will be determined by your training data, the structure of your neuronal networks and how you will choose to cluster the training data (penises, vaginas, breasts, etc, and combinations of such). In any case I am very confident that this can be achieved with high accuracy for explicit porn imagery.
This is a nudity detector. I haven't tried it. It's the only OSS one I could find.
https://code.google.com/p/nudetech
There is no way you could do this 100% (i would say maybe 1-5% would be plausible) with nowdays knowledge. You would get much better result (than those 1-5%) just checking the image-names for sex-related-words :).
#SO Troll: So true.

.NET Neural Network or AI for Future Predictions

I am looking for some kind of intelligent (I was thinking AI or Neural network) library that I can feed a list of historical data and this will predict the next sequence of outputs.
As an example I would like to feed the library the following figures 1,2,3,4,5
and based on this, it should predict the next sequence is 6,7,8,9,10 etc.
The inputs will be a lot more complex and contain much more information.
This will be used in a C# application.
If you have any recommendations or warning that will be great.
Thanks
EDIT
What I am trying to do i using historical sales data, predict what amount a specific client is most likely going to spend in the next period.
I do understand that there are dozens of external factors that can influence a clients purchases but for now I need to merely base it on the sales history and then plot a graph showing past sales and predicted sales.
If you're looking for a .NET API, then I would recommend you try AForge.NET http://code.google.com/p/aforge/
If you just want to try various machine learning algorithms on a data set that you have at your disposal, then I would recommend that you play around with Weka; it's (relatively) easy to use and it implements a lot of ML/AI algorithms. Run multiple runs with different settings for each algorithm and try as many algorithms as you can. Most of them will have some predictive power and if you combine the right ones, then you might really get something useful.
If I understand your question correctly, you want to approximate and extrapolate an unknown function. In your example, you know the function values
f(0) = 1
f(1) = 2
f(2) = 3
f(3) = 4
f(4) = 5
A good approximation for these points would be f(x) = x+1, and that would yield f(5) = 6... as expected. The problem is, you can't solve this without knowledge about the function you want to extrapolate: Is it linear? Is it a polynomial? Is it smooth? Is it (approximately or exactly) cyclic? What is the range and domain of the function? The more you know about the function you want to extrapolate, the better your predictions will be.
I just have a warning, sorry. =)
Mathematically, there is no reason for your sequence above to be followed by a "6". I can easily give you a simple function, whose next value is any value you like. Its just that humans like simple rules, and therefore tend to see a connection in these sequences, that in reality is not there. Therefore, this is a impossible task for a computer, if you do not want to feed it with additional information.
Edit:
In the case that you suspect your data to have a known functional dependence, and there are uncontrollable outside factors, maybe regression analysis will have good results. To start easy, look at linear regression first.
If you cannot assume linear dependence, there is a nice application that looks for functions fitting your historical data... I'll update this post with its name as soon as I remember. =)

What's the "Hello World!" of genetic algorithms good for?

I found this very cool C++ sample , literally the "Hello World!" of genetic algorithms.
I so decided to re-code the whole thing in C# and this is the result.
Now I am asking myself: is there any practical application along the lines of generating a target string starting from a population of random strings?
EDIT: my buddy on twitter just tweeted that "is useful for transcription type things such as translation. Does not have to be Monkey's". I wish I had a clue.
Is there any practical application along the lines of generating a target string starting from a population of random strings?
Sure. Imagine any scenario in which you know how to evaluate the fitness of a particular string, and in which the choices are discrete and constrained in some way:
Picking pronounceable names ("Xhjkxc" has low fitness; "Artekzo" has high fitness)
Trying out a series of chess moves
Guessing the combination to a safe, assuming you can tell how close you are to unlocking each tumbler
Picking phone numbers that evaluate to words (e.g. "843-2378" has high fitness because it spells "THE-BEST")
No. Each time you run the GA, you are giving it the eventual answer. This is great for showing how a GA works and to show how powerful it can be, but it does not have any purpose beyond that.
You could write an EA that writes code in a dynamic language like IronPython with the goal of creating code that a) executes without crashing and b) analyzes the stock market and intelligently buys and sells stock.
That's a very simplistic take on what would be necessary, but it's possible. You would need a host that provides a lot of methods for the IronPython code (technical indicators, etc) and a database of ticks.
It would also be smart to not just generate any old random code, lest you format your own hard-drive. You need a sandbox, and you need to limit the namespaces that are accessable, and you would need to provide a time limit to avoid infinite loops. You could also provide symantic guidelines that allow it to choose appropriate approved keywords instead of just stringing random letters together -- this would greatly speed up evolution.
So, I was involved with a project that did everything but the EA. We had a satellite dish that got real-time stock ticks from the NASDAQ, a service for trading that had an API, and a primitive decision making "brain" that made decisions as the ticks came in.
Sadly, one of the partners flipped out, quit his job, forked the project (got his own dish, etc), and started trading with logic that wasn't ready. He lost a bunch of money. It turns out that for some people this type of project is only a step away from common gambling. But anyway, the project kind of fizzled out after that. Evolving the logic part is the missing link though. And I know there are people out there doing this type of thing.
I have used GA in 2 real life research problems.
One was a power optimization problem (maximize number of appliances turned on, meeting the available power constraint and service guarantee for each appliance)
Another was for radio network optimization, maximizing the coverage area given a fixed equipment budget
GA has one main disadvantage, it usually works with genetic speed so using it in some serious time-dependant projects is quite risky.

Categories

Resources