I need to develop an application that can search through a book and list out all the pages and lines that contain a given keyword.
For books that are split up in some other way, such as a bible which is split up by chapter and verse; they would be able to search for all verses that contain a certain keyword. Or alternatively, search within certain chapters and verses for a keyword.
What format should I store the book into? Should it be stored into a SQL database?
What format would be easiest for searching as opposed to easiest for storage?
It kind off depends on the environment you want to run it on, and how many queries you expect per second.
The fastest is to store every word in a hashtable into memory, and the values contain reference to the chapters/verses, or whatever you call it, you want to retrieve.
But this may not scale well if the book is very large, or the client is very thin.
You could store every verse in a database record, and search with full-text-search. But if you need to host the app on a Website, you need to ensure that the hosting costs of the database of your choice does not exceed your budget.
If your application load can handle it, you can also store every verse in a text file (plain text, XML, or any other format), and scan each file, preferably with XPATH or regular expression. A very cheap and easy solution, that you can make as advanced as you like, but probably slower. Then again if you need to service only 1 request per hour, why not?
I would use the database with full-text-search, since that scales the best.
Years ago thee was a Bible already stored in an Access database that I used to make an application exactly like what you're talking about. The Access DB was a free download. A few years back, I ran across one in XML. I can't do it from work but I would recommend doing a search for Access Bible or XML Bible and see if you can find it. (I think the original Access one may have been called ASP Bible). At any rate, if you can find it, it should give you a good idea of how you can structure your database.
Is the program supposed to search any book or just a particular book? Books other than the Bible do not have content split up into chapter and verse like the Bible does. The answer will depend on what kind of format the book is in currently.
I would suggest using an off-the-shelf full text engine like Lucene.NET. You'll get all kinds of features you would not get if you did it yourself.
Do you expect multiple queries for the same book? i.e. do you want to do per-book preprocessing that may take a lot of time, but has to be done only once per book? Otherwise, the boyer-moore is probably the best way to go.
Do you only want to search for complete words, or also for beginnings of words? For complete words, a simple hashtable is probably fastest. If you want to look for parts of word, I'd suggest a suffix tree.
When you know what algorithm you're using, deciding the best data structure (database, flat file, etc.) should be an easier choice.
You could look into the Boyer-Moore (also, this contains a link to their original paper) algorithm
Unfortunately, the Boyer-Moore algorithm is much faster on longer strings than it is on short 'keyword' searches. So, for keyword searching you might want to implement some sort of crawler that could index likely search terms.
Another troubling consideration is that in most books chapters are contained on only certain pages, whereas with a bible, the chapters and verses could be split across multiple pages, and the pages could contain multiple verses and chapters.
This means that if you split up your text by verse, then any search phrases that cross verse boundaries will come up with no results (or incorrect ones).
A further consideration is the proximity search, such as whether or not you require exact search phrases, or just groups of keywords.
I think the first and most important task is to hammer down and harden your requirements. Then you should figure out what format you will be receiving the books in. Once you know your constraints, you can begin to make your architectural design decisions.
def findWord(keyword):
f = open("book.txt")
for line in f: # horribly bad performance for a large block of text
if line.find(keyword) > -1:
print line
Substitute each line for a block of text for your specific bible example. How you store the text is really irrelevant. All you're doing is searching some given text (most likely in a loop), for a keyword.
If you want to search line numbers, and other arbitrary fields, you're best off storing the information in a database with the relevant fields and running the search on any field that is relevannt.
FYI - the code above is Python.
Related
I am implementing full text search on a single entity, document which contains name and content. The content can be quite big (20+ pages of text). I am wondering how to do it.
Currently I am looking at using Redis and RedisSearch, but I am not sure if it can handle search in big chunks of text. We are talking about a multitenant application with each customer having more than 1000 documents that are quite big.
TLDR: What to use to search into big chunks of text content.
This space is a bit unclear to me, sorry for the confusion. Will update the question when I have more clarity.
I can't tell you what the right answer is, but I can give you some ideas about how to decide.
Normally if I had documents/content in a DB I'd be inclined to search there - assuming that the search functionality that I could implement was (a) functionally effect enough, (b) didn't require code that was super ugly, and (c) it wasn't going to kill the database. There's usually a lot of messing around trying to implement search features and filters that you want to provide to the user - UI components, logic components, and then translating that with how the database & query language actually works.
So, based on what you've said, the key trade-offs are probably:
Functionality / functional fit (creating the features you need, to work in a way that's useful).
Ease of development & maintenance.
Performance - purely on the basis that gathering search results across "documents" is not necessarily the fastest thing you can do with a IT system.
Have you tried doing a simple whiteboard "options analysis" exercise? If not try this:
Get a small number of interested and smart people around a whiteboard. You can do this exercise alone, but bouncing ideas around with others is almost always better.
Agree what the high level options are. In your case you could start with two: one based on MSSQL, the other based on Redis.
Draw up a big table - each option has it's own column (starting at column 2).
In Column 1 list out all the important things which will drive your decision. E.g. functional fit, Ease of development & maintenance, performance, cost, etc.
For each driver in column 1, do a score for each option.
How you do it is up to you: you could use a 1-5 point system (optionally you could use planning poker type approach to avoid anchoring) or you could write down a few key notes.
Be ready to note down any questions that come up, important assumptions, etc so they don't get lost.
Sometimes as you work through the exercise the answer becomes obvious. If it's really close you can rely on scores - but that's not ideal. It's more likely that of all the drivers listed some will be more important than others, so don't ignore the significance of those.
We use a solution in C#.net where someone can call a phone number and speak a persons First, and then Last Name. Then the name is entered on a guest registry on our website. We use an XML dictionary file with 5,000 First Names and 89,000 last names that we got from the US Census. We are using the Microsoft.Speech.Recognition library, (maybe that's the problem).
Our problem is that even with relatively easy names like Joshua McDaniels we are getting about a 30% fail rate. The performance, (speed-wise), is fine it just doesn't grab a good portion of the names.
Now, I understand that ultimately the quality of the spoken name will dictate, sorry for the pun, how well the system performs, but what we would like to get close to 99% in "laboratory" conditions with perfect enunciation and no accent and then call it good. But even after several trials with the same person speaking, same name, same phone, same environment, we are getting a 25% fail rate.
My question is: Does anyone have an idea of a better way to go after this? We thought of maybe trying to use an API, that way the matches would be more relevant and current.
The current state of the technology is that it is very hard to recognize names, moreover a large list of them. You can recognize names from the phone book (500 entries) with good quality, but for thousands of them it is very hard. Speech recognition engines are certainly not designed for that, in particular offline ones like System.Speech.
You might get way better results with online systems like https://www.projectoxford.ai which use advanced DNN acoustic models and bigger vocabularies.
There were whole big companies built around the capability to recognize large name lists, for example Novauris
used patented technology for that. You might consider building something like that using open source engine, but it would be a large undertaking anyway.
I have a text file with around 300,000 words. Each word is 5 letters.
I'd like to be able to determine how unique each word is on the internet.
An idea I had was to Google the word and see how many results it yielded. Unfortunately, this is against their TOS.
I was trying to think of any other way but it would have to involve querying some website a lot and I doubt they would appreciate that much.
Anyone have any other ideas? Programming language doesn't matter that much but I would prefer C#.
To look up the frequency 'in books' you could use the Google Ngram dataset, but that's not 'for the internet'. If this is for academic purposes the Bing alternative might work also and it is based on internet-frequencies.
If your words do not contain slang, I would recommend looking at public domain books. The issue here is that most of these books will be older, so you really will be getting a snapshot in time of how popular a word is (or I guess was). The plus side is that these books are freely available in text file format allowing you to easily mine them for data.
One thing to note, if you're in the US and plan on using Project Gutenberg to get the books, they have a rule that the website is intended only for human users. There is a page that tells you how to get the same data via mirror.
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.
I am looking for a good method to extract relevant keywords from text on a page using SQL or C#. I intend to use this to link these keywords to other parts of the website to navigate to relevant content.This seems pretty common across some blogs.
One simple approach might be to download into memory using C#, filter out HTML tags, Javascript etc (i.e. identify the real content), break that up into individual words, filter vs a list of words which appear with a high frequency in any generic written document, count the frequency of each word occurring in the document, take the words which appear the most as keywords.
You would need to develop your filtered word list over time.
Depending on your domain it might be more appropriate to go about this the opposite way and build up a list of domain-specific keywords (or groups of keywords, so that "seatbelt" and "safety belt" etc would be recognised as the same word), and find how many times each word or word group appears in a given document. Those above a certain threshold, or top 5 or something, would be the keywords associated with that document.
There's a good informative answer from Joseph Turian to a more general version of this question on: How do I extract keywords used in text?