I have a text file with around 300,000 words. Each word is 5 letters.
I'd like to be able to determine how unique each word is on the internet.
An idea I had was to Google the word and see how many results it yielded. Unfortunately, this is against their TOS.
I was trying to think of any other way but it would have to involve querying some website a lot and I doubt they would appreciate that much.
Anyone have any other ideas? Programming language doesn't matter that much but I would prefer C#.
To look up the frequency 'in books' you could use the Google Ngram dataset, but that's not 'for the internet'. If this is for academic purposes the Bing alternative might work also and it is based on internet-frequencies.
If your words do not contain slang, I would recommend looking at public domain books. The issue here is that most of these books will be older, so you really will be getting a snapshot in time of how popular a word is (or I guess was). The plus side is that these books are freely available in text file format allowing you to easily mine them for data.
One thing to note, if you're in the US and plan on using Project Gutenberg to get the books, they have a rule that the website is intended only for human users. There is a page that tells you how to get the same data via mirror.
Related
We use a solution in C#.net where someone can call a phone number and speak a persons First, and then Last Name. Then the name is entered on a guest registry on our website. We use an XML dictionary file with 5,000 First Names and 89,000 last names that we got from the US Census. We are using the Microsoft.Speech.Recognition library, (maybe that's the problem).
Our problem is that even with relatively easy names like Joshua McDaniels we are getting about a 30% fail rate. The performance, (speed-wise), is fine it just doesn't grab a good portion of the names.
Now, I understand that ultimately the quality of the spoken name will dictate, sorry for the pun, how well the system performs, but what we would like to get close to 99% in "laboratory" conditions with perfect enunciation and no accent and then call it good. But even after several trials with the same person speaking, same name, same phone, same environment, we are getting a 25% fail rate.
My question is: Does anyone have an idea of a better way to go after this? We thought of maybe trying to use an API, that way the matches would be more relevant and current.
The current state of the technology is that it is very hard to recognize names, moreover a large list of them. You can recognize names from the phone book (500 entries) with good quality, but for thousands of them it is very hard. Speech recognition engines are certainly not designed for that, in particular offline ones like System.Speech.
You might get way better results with online systems like https://www.projectoxford.ai which use advanced DNN acoustic models and bigger vocabularies.
There were whole big companies built around the capability to recognize large name lists, for example Novauris
used patented technology for that. You might consider building something like that using open source engine, but it would be a large undertaking anyway.
I wasn't quite sure how to word this question, as this is a field in which I am not very familiar, and I'm seeking less of a specific solution and more of what I should be looking to learn to better understand the problem...
if this is to be closed as a result, please suggest ways I can better express the question as I would very much like to get some input.
Basically the problem is this: I have a several different tables of data, each of which identifies different properties of a user. For example, one table might define a users demographic data (gender, location, etc.), another their interests, and another perhaps their favorite songs.
I want to be able to issue different searches of this data via an application running asp.net mvc, but rather than find specific matches (such as say a song title), I want to be able to do something like "women who like burgers and live in texas".
clearly this is a more dynamic search than just a simple keyword because the criteria can vary both by which data is being searched, what combinations of data is being aggregated, and what actually constitutes a match on each parameter.
If I want to research the different ways something like this can be accomplished, what should I look for? is this something Functional Programming could help resolve? or perhaps dynamic LINQ? i've seen some docs on expression trees which went completely over my head, but looked promising. however I wasn't sure this would fit because the data may change as well (such as new tables being added) and I'm not sure if that is something that needs to be fully defined ahead of time.
What concepts, algorithms and patterns should I explore that might help me create such a system?
I'm happy to learn, but this is something I'm completely in the dark about and don't even know where to begin, so any introductory concepts that I can start exploring would be greatly appreciated.
EDIT: I just realized I missed one important requirement which is that these searches also need to be saved. so in addition to dynamically searching the data, I also need a way to persist these searches.
the closest thing I can think of that does something like this is say a CRM or Project Management tool which lets you build queries on the fly and save them to be run on demand or on a schedule...
what are some of the strategies that these systems use? the more time i spend researching Dynamic LINQ the better it seems but I'm not sure if I am on the right track.
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.
I need to develop an application that can search through a book and list out all the pages and lines that contain a given keyword.
For books that are split up in some other way, such as a bible which is split up by chapter and verse; they would be able to search for all verses that contain a certain keyword. Or alternatively, search within certain chapters and verses for a keyword.
What format should I store the book into? Should it be stored into a SQL database?
What format would be easiest for searching as opposed to easiest for storage?
It kind off depends on the environment you want to run it on, and how many queries you expect per second.
The fastest is to store every word in a hashtable into memory, and the values contain reference to the chapters/verses, or whatever you call it, you want to retrieve.
But this may not scale well if the book is very large, or the client is very thin.
You could store every verse in a database record, and search with full-text-search. But if you need to host the app on a Website, you need to ensure that the hosting costs of the database of your choice does not exceed your budget.
If your application load can handle it, you can also store every verse in a text file (plain text, XML, or any other format), and scan each file, preferably with XPATH or regular expression. A very cheap and easy solution, that you can make as advanced as you like, but probably slower. Then again if you need to service only 1 request per hour, why not?
I would use the database with full-text-search, since that scales the best.
Years ago thee was a Bible already stored in an Access database that I used to make an application exactly like what you're talking about. The Access DB was a free download. A few years back, I ran across one in XML. I can't do it from work but I would recommend doing a search for Access Bible or XML Bible and see if you can find it. (I think the original Access one may have been called ASP Bible). At any rate, if you can find it, it should give you a good idea of how you can structure your database.
Is the program supposed to search any book or just a particular book? Books other than the Bible do not have content split up into chapter and verse like the Bible does. The answer will depend on what kind of format the book is in currently.
I would suggest using an off-the-shelf full text engine like Lucene.NET. You'll get all kinds of features you would not get if you did it yourself.
Do you expect multiple queries for the same book? i.e. do you want to do per-book preprocessing that may take a lot of time, but has to be done only once per book? Otherwise, the boyer-moore is probably the best way to go.
Do you only want to search for complete words, or also for beginnings of words? For complete words, a simple hashtable is probably fastest. If you want to look for parts of word, I'd suggest a suffix tree.
When you know what algorithm you're using, deciding the best data structure (database, flat file, etc.) should be an easier choice.
You could look into the Boyer-Moore (also, this contains a link to their original paper) algorithm
Unfortunately, the Boyer-Moore algorithm is much faster on longer strings than it is on short 'keyword' searches. So, for keyword searching you might want to implement some sort of crawler that could index likely search terms.
Another troubling consideration is that in most books chapters are contained on only certain pages, whereas with a bible, the chapters and verses could be split across multiple pages, and the pages could contain multiple verses and chapters.
This means that if you split up your text by verse, then any search phrases that cross verse boundaries will come up with no results (or incorrect ones).
A further consideration is the proximity search, such as whether or not you require exact search phrases, or just groups of keywords.
I think the first and most important task is to hammer down and harden your requirements. Then you should figure out what format you will be receiving the books in. Once you know your constraints, you can begin to make your architectural design decisions.
def findWord(keyword):
f = open("book.txt")
for line in f: # horribly bad performance for a large block of text
if line.find(keyword) > -1:
print line
Substitute each line for a block of text for your specific bible example. How you store the text is really irrelevant. All you're doing is searching some given text (most likely in a loop), for a keyword.
If you want to search line numbers, and other arbitrary fields, you're best off storing the information in a database with the relevant fields and running the search on any field that is relevannt.
FYI - the code above is Python.
Does anyone know of a "similar words or keywords" algorithm available in open source or via an API? I am looking for something sort of like a thesaurus but smarter.
So for example:
intel
returns:
processor,
i7 core chip,
quad core chip,
.. etc
Any ideas or even something to point me in the right direction in C#?
Edit:
I would love to hear your thoughts, but why cant we just use the Google Adwords API to generate keywords relevant to those entered?
Why not send a search query out to Google and parse what it returns?
Also, check out Google Sets.
There is no algorithm for such a thing. You are going to have to acquire data for a Thesaurus, and load it into a data structure then it is a simple dictionary lookup (you can use the C# Dictionary class for that). Maybe you can look at Wordnet, or Moby Thesaurus as a source for data. Other options are using a Thesaurus server and getting the information online as needed.
You will need a large database containing this information. The rest is simple - look up the input and see what releated words are stored.
The hard part is generating the database. Doing it manually might take years if you want to cover a large number of words and topics.
Generating it is surly non-trivial. Maybe you could try to download web pages and analyze words frequently appearing together, but I assume this will still take months to build, tune, and finally gather good quality data. Maybe extracting links from Wikipedia might be a good source of information because of its semi-structure.
I've made the open office thesaurus functions available for .NET in the NHunspell project. You can use the OO Thesaurus files.
Here is the NHunspell Project