Is there any DECAPTCHA library in .NET?

Is there any DECAPTCHA library in .NET? - c#

I'm looking for some sample projects to read CAPTCHA images. Is there any in C# or VB ?
pseudo code:
String captchaText = CaptchaDecoder(Image captchaImage);

Take a look to:
Text-based CAPTCHA Strengths and Weaknesses. ACM Computer and Communication security 2011 (CSS’2011). link
The authors present a CAPTCHA breaker and explain a generic algorithm to crack standard CAPTCHAs
In this section we present our captcha breaker, Decaptcha, which is
able to break many popular captchas including eBay, Wikipedia and Digg
[...] Decaptcha implements a reﬁned version of the three stage
approach in 15,000 lines of code in C# [...]

This is easier said than done.
This involves either brute-forcing the captcha or running OCR algorithms on it to try and detect what is written in the captcha.
You might want to check this related question: Has reCaptcha been cracked / hacked / OCR'd / defeated / broken?
It also depends on what techniques were used to produce the CAPTCHA. Some distort the text and some squeeze the text.
Your question is a little vague.
Additional reading here: http://en.wikipedia.org/wiki/CAPTCHA
Christian

There are so many types of Captchas out there that you won't find a single library to read them all. If you are only interested in one type though, you might have more luck. Even then, there are lots of variations on Captchas, and the engines frequently produce (whether on purpose or incidentally) tricky ones which even humans can't figure out. Humans can click the little icon to get a new one; your program might not be able to.

Related

LSODA to dll - Fortran (F77) to dll to call from C#

a little question in which I hope you can help me. (To make my life simpler)
One of the most respected numerical solvers for differential equations is LSODA, however it is written in Fortran... ( http://www.netlib.org/odepack/index.html )
There does not seem to be a decent solver for C#, and writing my own is too time consuming in C#, especially as I have very stiff equations that need to be solved.
The NAG libraries for net do not contain an ODE solver (they lack D02 routines). In terms of "university side" libraries that's it.
However NAG Support suggested calling their dll, which is fine for simple variables, but has me rather perplexed with its external functions and dummy parameters which made me give up.
This leaves LSODA still, which is fortran, but a lot simpler in its calling sequence - so I wonder, how can the Odepack (the solvers that include the lsoda routine) be turned into a dll with little work, so that it may be called from C#?
(Which will leave me worried about the Jacobian, being a matrix, i.e. 2D array.)
Specifically, I would like a situation similar to that with the Fortran NAG library, but instead offering me access to lsoda: http://www.nag.co.uk/numeric/csharpinfo.asp
Please keep in mind that I am a mathematician - so if your responses loose me, please be patient with me. And why am I so focused on C# - well, it is simple, especially when one has VisualStudio 2010.
Many thanks for any responses in advance.

SmartMathLibrary looks dead, but it claims to have ODEPACK bindings. You could also check out Wikipedia's List of .NET Numerical Packages.
If you're open to other languages, Python's SciPy library contains a binding to LSODA: enter link description here. It's available on Windows, easy to use, free, and widely embraced by the scientific community.

It's not a full solution, but f2c (Fortran-to-C converter) should be able to give you working C code from the Fortran source. That might at least be easier to get working from C#.
Disclaimer: I've never used f2c to convert a routine, I've only used some of the routines that someone else converted.

XXX image recognition software/algorithm [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Akismet does an amazing job at detecting spam comments. But comments are not the only form of spam these days. What if I wanted something like akismet to automatically detect porn images on a social networking site which allows users to upload their pics, avatars, etc?
There are already a few image based search engines as well as face recognition stuff available so I am assuming it wouldn't be rocket science and it could be done. However, I have no clue regarding how that stuff works and how I should go about it if I want to develop it from scratch.
How should I get started?
Is there any open source project for this going on?

This is actually reasonably easy. You can programatically detect skin tones - and porn images tend to have a lot of skin. This will create false positives but if this is a problem you can pass images so detected through actual moderation. This not only greatly reduces the the work for moderators but also gives you lots of free porn. It's win-win.
#!python
import os, glob
from PIL import Image
def get_skin_ratio(im):
im = im.crop((int(im.size[0]*0.2), int(im.size[1]*0.2), im.size[0]-int(im.size[0]*0.2), im.size[1]-int(im.size[1]*0.2)))
skin = sum([count for count, rgb in im.getcolors(im.size[0]*im.size[1]) if rgb[0]>60 and rgb[1]<(rgb[0]*0.85) and rgb[2]<(rgb[0]*0.7) and rgb[1]>(rgb[0]*0.4) and rgb[2]>(rgb[0]*0.2)])
return float(skin)/float(im.size[0]*im.size[1])
for image_dir in ('porn','clean'):
for image_file in glob.glob(os.path.join(image_dir,"*.jpg")):
skin_percent = get_skin_ratio(Image.open(image_file)) * 100
if skin_percent>30:
print "PORN {0} has {1:.0f}% skin".format(image_file, skin_percent)
else:
print "CLEAN {0} has {1:.0f}% skin".format(image_file, skin_percent)
This code measures skin tones in the center of the image. I've tested on 20 relatively tame "porn" images and 20 completely innocent images. It flags 100% of the "porn" and 4 out of the 20 of the clean images. That's a pretty high false positive rate but the script aims to be fairly cautious and could be further tuned. It works on light, dark and Asian skin tones.
It's main weaknesses with false positives are brown objects like sand and wood and of course it doesn't know the difference between "naughty" and "nice" flesh (like face shots).
Weakness with false negatives would be images without much exposed flesh (like leather bondage), painted or tattooed skin, B&W images, etc.
source code and sample images

This was written in 2000, not sure if the state of the art in porn detection has advanced at all, but I doubt it.
http://www.dansdata.com/pornsweeper.htm
PORNsweeper seems to have some ability to distinguish pictures of people from pictures of things that aren't people, as long as the pictures are in colour. It is less successful at distinguishing dirty pictures of people from clean ones.
With the default, medium sensitivity, if Human Resources sends around a picture of the new chap in Accounts, you've got about a 50% chance of getting it. If your sister sends you a picture of her six-month-old, it's similarly likely to be detained.
It's only fair to point out amusing errors, like calling the Mona Lisa porn, if they're representative of the behaviour of the software. If the makers admit that their algorithmic image recogniser will drop the ball 15% of the time, then making fun of it when it does exactly that is silly.
But PORNsweeper only seems to live up to its stated specifications in one department - detection of actual porn. It's half-way decent at detecting porn, but it's bad at detecting clean pictures. And I wouldn't be surprised if no major leaps were made in this area in the near future.

I would rather allow users report on bad images. Image recognition development can take too much efforts and time and won't be as much as accurate as human eyes. It's much cheaper to outsource that moderation job.
Take a look at: Amazon Mechanical Turk
"The Amazon Mechanical Turk (MTurk) is one of the suite of Amazon Web Services, a crowdsourcing marketplace that enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do."

Bag-of-Visual-Words Models for Adult Image Classiﬁcation and Filtering
What is the best way to programatically detect porn images?
A Brief Survey of Porn-Detection/Porn-Removal Software
Detection of Pornographic Digital Images (2011!)

BOOM! Here is the whitepaper containing the algorithm.
Does anyone know where to get the source code for a java (or any language) implementation?
That would rock.
One algorithm called WISE has a 98% accuracy rate but a 14% false positive rate. So what you do is you let the users flag the 2% false negatives, ideally with automatic removal if a certain number of users flag it, and have moderators view the 14% false positives.

Nude.js based on the whitepaper by Rigan Ap-apid from De La Salle University.

There is software that detects the probability for porn, but this is not an exact science, as computers can't recognize what is actually on pictures (pictures are only a big set of values on a grid with no meaning). You can just teach the computer what is porn and what not by giving examples. This has the disadvantage that it will only recognize these or similar images.
Given the repetitive nature of porn you have a good chance if you train the system with few false positives. For example if you train the system with nude people it may flag pictures of a beach with "almost" naked people as porn too.
A similar software is the facebook software that recently came out. It's just specialized on faces. The main principle is the same.
Technically you would implement some kind of feature detector that utilizes a bayes filtering. The feature detector may look for features like percentage of flesh colored pixels if it's a simple detector or just computes the similarity of the current image with a set of saved porn images.
This is of course not limited to porn, it's actually more a corner case. I think more common are systems that try to find other things in images ;-)

The answer is really easy: It's pretty safe to say that it won't be possible in the next two decades. Before that we will probably get good translation tools. The last time I checked, the AI guys were struggling to identify the same car on two photographs shot from a slightly altered angle. Take a look on how long it took them to get good enough OCR or speech recognition together. Those are recognition problems which can benefit greatly from dictionaries and are still far from having completely reliable solutions despite of the multi-million man months thrown at them.
That being said you could simply add an "offensive?" link next to user generated contend and have a mod cross check the incoming complaints.
edit:
I forgot something: IF you are going to implement some kind of filter, you will need a reliable one. If your solution would be 50% right, 2000 out of 4000 users with decent images will get blocked. Expect an outrage.

A graduate student from National Cheng Kung University in Taiwan did a research on this subject in 2004. He was able to achieve success rate of 89.79% in detecting nude pictures downloaded from the Internet. Here is the link to his thesis: The Study on Naked People Image Detection Based on Skin Color It's in Chinese therefore you may need a translator in case you can't read it.

short answer: use a moderator ;)
Long answer: I dont think there's a project for this cause what is porn? Only legs, full nudity, midgets etc. Its subjective.

Add an offensive link and store the md5 (or other hash) of the offending image so that it can automatically tagged in the future.
How cool would it be if somebody had a large public database of image md5 along with descriptive tags running as a webservice? Alot of porn isn't original work (in that the person who has it now, didn't probably make it) and the popular images tend to float around different places, so this could really make a difference.

If you're really have time and money:
One way of doing it is by 1) Writing an image detection algorithm to find whether an object is human or not. This can be done by bitmasking an image to retrieve it's "contours" and see if the contours fits a human contour.
2) Data mine a lot of porn images and use data mining techniques such as the C4 algorithms or Particle Swarm Optimization to learn to detect pattern that matches porn images.
This will require that you identify how a naked man/woman contours of a human body must look like in digitized format (this can be achieved in the same way OCR image recognition algorithms works).
Hope you have fun! :-)

Seems to me like the main obstacle is defining a "porn image". If you can define it easily, you could probably write something that would work. But even humans can't agree on what is porn. How will the application know? User moderation is probably your best bet.

I've seen a web filtering application which does porn image filtering, sorry I can't remember the name. It was pretty prone to false positives however most of the time it was working.
I think main trick is detecting "too much skin on the picture :)

Detecting porn images is still a definite AI task which is very much theoretical yet.
Harvest collective power and human intelligence by adding a button/link "Report spam/abuse". Or employ several moderators to do this job.
P.S. Really surprised how many people ask questions assuming software and algorithms are all-mighty without even thinking whether what they want could be done. Are they representatives of that new breed of programmers who have no understanding of hardware, low-level programming and all that "magic behind"?
P.S. #2. I also remember that periodically it happens that some situation when people themselves cannot decide whether a picture is porn or art is taken to the court. Even after the court rules, chances are half of the people will consider the decision wrong. The last stupid situation of the kind was quite recently when a Wikipedia page got banned in UK because of a CD cover image that features some nakedness.

Two options I can think of (though neither of them is programatically detecting porn):
Block all uploaded images until one of your administrators has looked at them. There's no reason why this should take a long time: you could write some software that shows 10 images a second, almost as a movie - even at this speed, it's easy for a human being to spot a potentially pornographic image. Then you rewind in this software and have a closer look.
Add the usual "flag this image as inappropriate" option.

The BrightCloud web service API is perfect for this. It's a REST API for doing website lookups just like this. It contains a very large and very accurate web filtering DB and one of the categories, Adult, has over 10M porn sites identified!

I've heard about tools which were using very simple, but quite effective algorithm. The algorithm calculated relative amount of pixels with color value near to some predefined "skin" colours. If that amount is higher than some predefined value then image is considered to be of erotic/pornographic content. Of course that algorithm will give false positive results for close-up face photos and many other things.
Since you are writing about social networking there will be lots of "normal" photos with high amount of skin colour on it, so you shouldn't use this algorithm to deny all pictures with positive result. But you can use it provide some help for moderators, for example flag these pictures with higher priority, so if moderator want to check some new pictures for pornographic content he can start from these pictures.

This one looks promising. Basically they detect skin (with calibration by recognizing faces) and determine "skin paths" (i.e. measuring the proportion of skin pixels vs. face skin pixels / skin pixels). This has decent performance.
http://www.prip.tuwien.ac.at/people/julian/skin-detection

Look at file name and any attributes. There's not nearly enough information to detect even 20% of naughty images, but a simple keyword blacklist would at least detect images with descriptive labels or metadata. 20 minutes of coding for a 20% success rate isn't a bad deal, especially as a prescreen that can at least catch some simple ones before you pass the rest to a moderator for judging.
The other useful trick is the opposite of course, maintain a whitelist of image sources to allow without moderation or checking. If most of your images come from known safe uploaders or sources, you can just accept them bindly.

I shall not today attempt further to
define the kinds of material I
understand to be embraced within that
shorthand description ["hard-core
pornography"]; and perhaps I could
never succeed in intelligibly doing
so. But I know it when I see it, and
the motion picture involved in this
case is not that.
— United States Supreme Court Justice Potter Stewart, 1964

You can find many whitepapers on the net dealing with this subject.

It is not rocket science. Not anymore. It is very similar to face recognition. I think that the easiest way to deal with it is to use machine learning. And since we are dealing with images, I can point towards neuronal networks, because these seem to be preferred for images. You will need training data. And you can find tons of training data on the internet but you have to crop the images to the specific part that you want the algorithm to detect. Of course you will have to break the problem into different body parts that you want to detect and create training data for each, and this is where things become amusing.
Like someone above said, it cannot be done 100% percent. There will be cases where such algorithms fail. The actual precision will be determined by your training data, the structure of your neuronal networks and how you will choose to cluster the training data (penises, vaginas, breasts, etc, and combinations of such). In any case I am very confident that this can be achieved with high accuracy for explicit porn imagery.

This is a nudity detector. I haven't tried it. It's the only OSS one I could find.
https://code.google.com/p/nudetech

There is no way you could do this 100% (i would say maybe 1-5% would be plausible) with nowdays knowledge. You would get much better result (than those 1-5%) just checking the image-names for sex-related-words :).
#SO Troll: So true.

If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?

If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?
It feels like standard coding style debates that have been raging since programming began like "put the bracket on the following line" or "properly indent your (" are no longer essential.
I realize in languages where white space matters the diff will have to consider it but for languages where the style is a personal preference is there really a need to worry about it anymore?

Auto-format can really only address whitespace.
It wont address developers giving variables bizarre nonsensical names.
It won't address some developers having functions return null on an error vs throwing an exception.
I'm sure others can think of more examples.

This is what we do at my work:
We all use Eclipse. We don't have a policy for using Eclipse but somehow none of us is an IDEA/IntelliJ guy. We also think our code should be written with legacy in mind. This means our code has to be readable in a certain way even years after (#1) no matter who wrote it and if that person even is in the company anymore.
Eclipse has couple handy features, automatic format on save and a specific Formatter tool. As you can see from the linked screenshot, it can be configured with XML. Thus there's a bunch of premade XML:s available for every worker in our company so that when a new guy comes in, we walk him through of the whole process and configure their Eclipse for them (yes, it's slightly evil thing to do) so that it actually uses those formatting XML:s we have provided. We do not enforce automatic format on save, we don't want to be completely intrusive, we just want to push all our developers into the right directions. For even increased compatability, we mostly use rules defined in JCC.
Next comes the important part, the actual builds. We are those who embrace automatic builds and for that we use Hudson Continuous Integration Server. There's two important parts in our configurations beyond this:
We use CVS loginfo to trigger builds whenever something is committed.
We utilize several plugins available for Hudson, including Continuous Integration Game in conjuction with the most important one, Checkstyle.
The Checkstyle Plugin is the magician in our code style enforcement guide line:
After commiting code to CVS, Hudson build is triggered
After build has been completed succesfully (all unit tests pass etc.), Checkstyle inspects the actual source files
Checkstyle ranks the code based rules we have defined for it
Continuous Integration Game sees the result of Checkstyle and awards/takes away points for the person who has the ownership for the relevant part of the code
Leaderboard shows total points for every commiter in system
Basically this means that when anyone commits ugly code into our CVS, our build server automatically reduces that person's points.
This means that eventually any one of us can be ranked on the Leaderboard based on the general code quality in both look and OO principles such as Law of Demeter, Cyclomatic complexity etc. etc. Naturally this isn't a completely serious statistic, but it's a good indication you're doing something wrong when causing a build to be initiated in our CI won't reduce your points - most of our commits are worth between 1 and 5 points.
And is it working? Sort of, I don't think anyone of us at my work writes ugly or unmaintainable code and personally I love to hunt all kinds of scores so it's definitely motivating me to make code that looks nice and follows all the OO paradigms I know of.
And do we as a company really need it? I think we do as you should see from reading this entire answer, it can be considered a good practice for the advancements it brings.
#1: in a related note, I refactored legacy code from 2002 today which used those standards, didn't look "bad" at all even in its original form and certainly not worse in its new form

No, not really.
If you can actually get it to work consistently and not make it flag code has changed due to a different style of laying the code out.
However, this is just a small part of coding standards. It won't cover multiple return statements, the use or not of ternary operators, etc.

It is always nice if the coding style that the shop uses is the same one that is also followed by the development tools.
Otherwise, if there is a large body of code that already follows a shop standard which is NOT the same as that of the tools you have two choices:
Modify all of the code to follow the tool standard, or
Maintain the existing shop standard.
Many shops do the latter. Regardless, there does need to be some kind of standard, and it does need to be followed.
Some development tools allow you to tweak their standard. In some cases you may be able to bring the tools in alignment with the shop standard.

It probably doesn't matter that much anymore if you can ensure that everybody in the team sees the source code "correctly" formatted, whatever they think it is. However I've not seen a system that can do that - you can do parts of it (say, reformat before and after checkin/checkout) but these days you also have to consider web interfaces into the version control, external code review systems that interact directly with the version control system etc.
The main purpose of a standard code style is (IMHO) to ensure that you can read other team members' code easily without having to start reverse engineering it because all the code is written using the same sort of guiding principles. Indenting and parentheses placement seem to be a major hangup on this but they are only a very small and in my opinion, somewhat overblown and not very important part of the need to make code consistent.
Unfortunately I'm not aware of any tools that can automatically apply consistent coding principles to source code...

Yes, coding styles are needed if there is a desire to have a homogeneous code base. Such a code base can be useful in preventing individual ownership of parts of the code base, which can cause problems when people leave the team. If you can't imagine having wildly different styles and problems understanding all of it, just look at all the different ways English text can be organized in various communications, all written but quite different such as tweets, e-mail, text messages, IM, message board posts, etc. and changes in fonts, capitalization, decorations, etc.

Detect language of text [duplicate]

This question already has answers here:
How to detect the language of a string?
(9 answers)
Closed 8 years ago.
Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".
I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?

Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.
There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).
It is pure .NET Standard 2.0 DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

Language detection is a pretty hard thing to do.
Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.
Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.
More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).
If such a library exists I would like to know about it, since I'm working on one myself.

Please find a C# implementation based on of 3grams analysis here:
http://idsyst.hu/development/language_detector.html

Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
This is probably good enough for many (most?) applications and doesn't require Internet access.
Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.
The other option would be to leverage Google's or Bing APIs if your app has Internet access.

You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.
Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.

There is a simple tool to identify text language:
http://www.detectlanguage.com/

I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.

artificial intelligence - Creative Writing

I am trying to find information (and hopefully c# source code) about trying to create a basic AI tool that can understand english words, grammar and context.
The Idea is to train the AI by using as many written documents as possible and then based on these documents, for the AI to create its own creative writitng in proper english that makes sense to a human.
While the idea is simple, I do realise that the hurdles are huge, any starting points or good resoueces will be appriacted.

A basic AI tool that you can use to do something like this is a Markov Chain. It's actually not too tricky to write!
See: http://pscode.com/vb/scripts/ShowCode.asp?txtCodeId=2031&lngWId=10
If that's not enough, you might be able to store WordNet synsets in your Markov chain instead of just words. This gives you some sense of the meaning of the words.

To be able to recompose a document you are going to have to have away to filter through the bad results.
Which means:
You are going to have to write a program that can evaluate if the output is valid (grammatically and syntactically is the best you can do reliablily) (This would would NLP)
You would need lots of training data and test data
You would need to watch out for overtraining (take a look at ROC curves)
Instead of writing a tool you could:
Manually score the output (will take a long time to properly train the algorigthm)
With this using the Amazon Mechanical Turk might be a good idea
The irony of this: The computer would have a difficult time "Creatively" composing something new. All of its worth will be based on its previous experiences [training data]

Some good references and reading at this Natural Language article.

As others said, Markov chain seems to be most suitable for such a task. Nice description of implementing Markov chain can be found in Kernighan & Pike, The Practice of Programming, section 3.1. Nice description of text-generating is also present in Programming Pearls.

One thing, though not quite what you need, would be a Markov chain of words. Here's a link I found by a quick search: http://blog.figmentengine.com/2008/10/markov-chain-code.html, but you can find much more information by searching for it.

Take a look at http://www.nltk.org/ (Natural Language Toolkit), lots of powerful tools there. They use Python (not C#) but Python is easy enough to pick up. Much easier to pick up than the breadth and depth of natural language processing, at least.

I agree, that you will have troubles in creating something creative. You could possibly also use a keyword spinner on certain words. You might also want to implement a stop word filter to remove anything colloquial.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.