Speech recognition of the large name list - c#

We use a solution in C#.net where someone can call a phone number and speak a persons First, and then Last Name. Then the name is entered on a guest registry on our website. We use an XML dictionary file with 5,000 First Names and 89,000 last names that we got from the US Census. We are using the Microsoft.Speech.Recognition library, (maybe that's the problem).
Our problem is that even with relatively easy names like Joshua McDaniels we are getting about a 30% fail rate. The performance, (speed-wise), is fine it just doesn't grab a good portion of the names.
Now, I understand that ultimately the quality of the spoken name will dictate, sorry for the pun, how well the system performs, but what we would like to get close to 99% in "laboratory" conditions with perfect enunciation and no accent and then call it good. But even after several trials with the same person speaking, same name, same phone, same environment, we are getting a 25% fail rate.
My question is: Does anyone have an idea of a better way to go after this? We thought of maybe trying to use an API, that way the matches would be more relevant and current.

The current state of the technology is that it is very hard to recognize names, moreover a large list of them. You can recognize names from the phone book (500 entries) with good quality, but for thousands of them it is very hard. Speech recognition engines are certainly not designed for that, in particular offline ones like System.Speech.
You might get way better results with online systems like https://www.projectoxford.ai which use advanced DNN acoustic models and bigger vocabularies.
There were whole big companies built around the capability to recognize large name lists, for example Novauris
used patented technology for that. You might consider building something like that using open source engine, but it would be a large undertaking anyway.

Related

[Full Text Search]Implement Full Text Search

I am implementing full text search on a single entity, document which contains name and content. The content can be quite big (20+ pages of text). I am wondering how to do it.
Currently I am looking at using Redis and RedisSearch, but I am not sure if it can handle search in big chunks of text. We are talking about a multitenant application with each customer having more than 1000 documents that are quite big.
TLDR: What to use to search into big chunks of text content.
This space is a bit unclear to me, sorry for the confusion. Will update the question when I have more clarity.
I can't tell you what the right answer is, but I can give you some ideas about how to decide.
Normally if I had documents/content in a DB I'd be inclined to search there - assuming that the search functionality that I could implement was (a) functionally effect enough, (b) didn't require code that was super ugly, and (c) it wasn't going to kill the database. There's usually a lot of messing around trying to implement search features and filters that you want to provide to the user - UI components, logic components, and then translating that with how the database & query language actually works.
So, based on what you've said, the key trade-offs are probably:
Functionality / functional fit (creating the features you need, to work in a way that's useful).
Ease of development & maintenance.
Performance - purely on the basis that gathering search results across "documents" is not necessarily the fastest thing you can do with a IT system.
Have you tried doing a simple whiteboard "options analysis" exercise? If not try this:
Get a small number of interested and smart people around a whiteboard. You can do this exercise alone, but bouncing ideas around with others is almost always better.
Agree what the high level options are. In your case you could start with two: one based on MSSQL, the other based on Redis.
Draw up a big table - each option has it's own column (starting at column 2).
In Column 1 list out all the important things which will drive your decision. E.g. functional fit, Ease of development & maintenance, performance, cost, etc.
For each driver in column 1, do a score for each option.
How you do it is up to you: you could use a 1-5 point system (optionally you could use planning poker type approach to avoid anchoring) or you could write down a few key notes.
Be ready to note down any questions that come up, important assumptions, etc so they don't get lost.
Sometimes as you work through the exercise the answer becomes obvious. If it's really close you can rely on scores - but that's not ideal. It's more likely that of all the drivers listed some will be more important than others, so don't ignore the significance of those.

How to determine how unique a word is?

I have a text file with around 300,000 words. Each word is 5 letters.
I'd like to be able to determine how unique each word is on the internet.
An idea I had was to Google the word and see how many results it yielded. Unfortunately, this is against their TOS.
I was trying to think of any other way but it would have to involve querying some website a lot and I doubt they would appreciate that much.
Anyone have any other ideas? Programming language doesn't matter that much but I would prefer C#.
To look up the frequency 'in books' you could use the Google Ngram dataset, but that's not 'for the internet'. If this is for academic purposes the Bing alternative might work also and it is based on internet-frequencies.
If your words do not contain slang, I would recommend looking at public domain books. The issue here is that most of these books will be older, so you really will be getting a snapshot in time of how popular a word is (or I guess was). The plus side is that these books are freely available in text file format allowing you to easily mine them for data.
One thing to note, if you're in the US and plan on using Project Gutenberg to get the books, they have a rule that the website is intended only for human users. There is a page that tells you how to get the same data via mirror.

Text classification extract tags from text

I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.

Neural Network, Genetic algorithm as an Intrusion detection system

Hi I need some help on getting started with creating my first algorithm; I want to create a NN/Genetic Algorithm for use as an Intrusion detection system.
But I’m struggling with some points (never written an algorithm before.)
I want to develop in C# would it be possible as a console app? If so, as a precursor how big would the programme roughly be, at its most simplistic form. Is it even possible in c#?
How to connect the program to read in data from the network? Also how packets can be converted to readable data for the algorithm.
How to get the programme to write rules for snort or some other form of firewall and block what the programme deems as a potential threat. (i.e it spots a threat from No.2 then it writes a rule into the snort rules page blocking that specific traffic)
How to track the data. (what its blocked what its observing how it came to that conclusion)
Where to place it on the network? (can the programme connect to other algorithms and share data on the same network, would that be beneficial)
If anyone can help start me off in the right direction or explain what other alternatives there are like fuzzy logic etc and why is it deemed as a black box?
Yes, a console app, and C#, can be used to create a Neural Network. Of course, if you want more visual aspects to the UI, you'll want to use WinForms/WPF/Silverlight etc.. It's impossible to tell how big the program will be as there's not enough information on what you want to do. Also, the size shouldn't really be a problem as long as it's efficient.
I assume this is some sort of final year project? What type of Neural Network are you using? You should read some academic papers /whitepapers on using NN with intrusion detection to get an idea. For example, this PDF has some information that might help.
You should take this one step at a time. Creating a Neural Network is separate from creating a new rule in Snort. Work on one topic at a time otherwise you'll just get overwhelmed. Considering the hard part will most likely be the NN, you should focus on that first.
It's unlikely anyone's going to go through each step with you as it's quite a large project. Show what you've done and explain where you need help.
My core realization when I started learning about neural networks is that they are just function approximators. I think that's a crucial thing to keep in mind. Whether you're using genetic algorithms or neural nets (or combining them as mentioned by #Ben Voigt, even though neural networks are typically associated with other training techniques) - what you get in the end is a function where you put in a number of real values and get out a single value.
Keeping this in mind, you can design your program and just think of the network as a black box providing those predictions, on the testing part. During training, think of another black box where you put in pairs of input and output pairs and assume it's gonna get better the more pairs you show to it.
Maybe you find this trivial, but with all the theory and mystic behaviour that's associated with this type of algorithms, I found it reassuring (though a bit disappointing ;) to reduce them to those kinds of boxes.

how to search for a word in a book programmatically?

I need to develop an application that can search through a book and list out all the pages and lines that contain a given keyword.
For books that are split up in some other way, such as a bible which is split up by chapter and verse; they would be able to search for all verses that contain a certain keyword. Or alternatively, search within certain chapters and verses for a keyword.
What format should I store the book into? Should it be stored into a SQL database?
What format would be easiest for searching as opposed to easiest for storage?
It kind off depends on the environment you want to run it on, and how many queries you expect per second.
The fastest is to store every word in a hashtable into memory, and the values contain reference to the chapters/verses, or whatever you call it, you want to retrieve.
But this may not scale well if the book is very large, or the client is very thin.
You could store every verse in a database record, and search with full-text-search. But if you need to host the app on a Website, you need to ensure that the hosting costs of the database of your choice does not exceed your budget.
If your application load can handle it, you can also store every verse in a text file (plain text, XML, or any other format), and scan each file, preferably with XPATH or regular expression. A very cheap and easy solution, that you can make as advanced as you like, but probably slower. Then again if you need to service only 1 request per hour, why not?
I would use the database with full-text-search, since that scales the best.
Years ago thee was a Bible already stored in an Access database that I used to make an application exactly like what you're talking about. The Access DB was a free download. A few years back, I ran across one in XML. I can't do it from work but I would recommend doing a search for Access Bible or XML Bible and see if you can find it. (I think the original Access one may have been called ASP Bible). At any rate, if you can find it, it should give you a good idea of how you can structure your database.
Is the program supposed to search any book or just a particular book? Books other than the Bible do not have content split up into chapter and verse like the Bible does. The answer will depend on what kind of format the book is in currently.
I would suggest using an off-the-shelf full text engine like Lucene.NET. You'll get all kinds of features you would not get if you did it yourself.
Do you expect multiple queries for the same book? i.e. do you want to do per-book preprocessing that may take a lot of time, but has to be done only once per book? Otherwise, the boyer-moore is probably the best way to go.
Do you only want to search for complete words, or also for beginnings of words? For complete words, a simple hashtable is probably fastest. If you want to look for parts of word, I'd suggest a suffix tree.
When you know what algorithm you're using, deciding the best data structure (database, flat file, etc.) should be an easier choice.
You could look into the Boyer-Moore (also, this contains a link to their original paper) algorithm
Unfortunately, the Boyer-Moore algorithm is much faster on longer strings than it is on short 'keyword' searches. So, for keyword searching you might want to implement some sort of crawler that could index likely search terms.
Another troubling consideration is that in most books chapters are contained on only certain pages, whereas with a bible, the chapters and verses could be split across multiple pages, and the pages could contain multiple verses and chapters.
This means that if you split up your text by verse, then any search phrases that cross verse boundaries will come up with no results (or incorrect ones).
A further consideration is the proximity search, such as whether or not you require exact search phrases, or just groups of keywords.
I think the first and most important task is to hammer down and harden your requirements. Then you should figure out what format you will be receiving the books in. Once you know your constraints, you can begin to make your architectural design decisions.
def findWord(keyword):
f = open("book.txt")
for line in f: # horribly bad performance for a large block of text
if line.find(keyword) > -1:
print line
Substitute each line for a block of text for your specific bible example. How you store the text is really irrelevant. All you're doing is searching some given text (most likely in a loop), for a keyword.
If you want to search line numbers, and other arbitrary fields, you're best off storing the information in a database with the relevant fields and running the search on any field that is relevannt.
FYI - the code above is Python.

Categories

Resources