I have a website that has over 400,000 items. Some similar, some vastly different. We want to provide a way to search these items the best way possible. After being delivered the website it was using full text indexing. The solution is basic at best, woefully inadequate at worst.
So what is the best way to search these items? They are stored in a SQL Server Database (2005). Our website is designed in C# 2.0.
Currently here is the process:
User enters value into text box.
We 'clean' this entry. Removing 'scary' characters that could be an attempted hack. Remove key words (and, or, etc..)
Pass value into a stored procedure to return results.
Return results.
Look at Lucene.NET. I think it's a vast improvement over full-text search in SQL Server.
SQL Server Central has a nice article on creating a Google-like Full Text Search using SQL Server. Unfortunately you have to register view the full article, but registration is free and they post a lot of good information. Here is the link:
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/
Excerpt:
...
Google Style
The key to a successful application is
to make it easy to use but powerful.
Google has done this with their Web
search engine. The syntax for queries
is simple and intuitive, but
full-featured. Though the basic
building blocks of a Google query are
simple you can combine them in
powerful ways. I'll begin with basic
Google query syntax and add some
additional operators to take advantage
of the power of SQL Server CONTAINS
predicate syntax. The full Google
syntax is defined in the Google
Help:Cheat Sheet at
http://www.google.com/help/cheatsheet.html.
...
The article has full example code and even a link to download it. Its an interesting read even if you don't plan on implementing it.
You can have a look at Lucene.net, it will minimize the calls to the database for the search queries.
Following from http://incubator.apache.org/lucene.net/
Lucene.Net is a source code,
class-per-class, API-per-API and
algorithmatic port of the Java Lucene
search engine to the C# and .NET
platform utilizing Microsoft .NET
Framework.
Lucene.Net sticks to the APIs and
classes used in the original Java
implementation of Lucene. The API
names as well as class names are
preserved with the intention of giving
Lucene.Net the look and feel of the C#
language and the .NET Framework. For
example, the method Hits.length() in
the Java implementation now reads
Hits.Length() in the C# port.
In addition to the APIs and classes
port to C#, the algorithm of Java
Lucene is ported to C# Lucene. This
means an index created with Java
Lucene is back-and-forth compatible
with the C# Lucene; both at reading,
writing and updating. In fact a Lucene
index can be concurrently searched and
updated using Java Lucene and C#
Lucene processes.
You could use Google site search to deliver your search results. Doesn't always give you the flexibility to display the results as you want, but for many is good enough.
Second step is quite controversial - what words you consider as 'scary'? If you use SQL server build-in full text search then instead of manually removing key words from input query you can setup lists of nose/stop words inside sql server.
Here is one feature I want to see here on StackOverflow as well as on any other site that provides search functionality:
give more priority(weight) to some fields of your documents
(in case of stackoverflow - search should prioritize topic title)
Also consider to use 3rd party solution for FTS such as Lucene or Sphinx - they can provide much better user experience than build-in functionality.
Some advantages of 3rd party FTS components are: reduced database load, better relevance of search results, better indexing speed, smaller size of database.
Related
We are using Azure Search to find courses from a list. We search on three fields. We need fuzzy searches on the Coursename and Keywords, but want only to include exact matches for the course code (which has sequential numeric codes like "RB046").
Using the Search Explorer, you can do something like this with the URL:
https://xxx.search.windows.net/indexes/prospectussearchindexlive/docs?api-version=2016-09-01&search=CourseCode:"HCN_6_006" OR Coursename:"HCN_6_006~1" OR Keywords:"HCN_6_006~1"
But in the API it seems you can only have one search term applied to all specified columns. Does anyone know of a way you can do this with the API without performing two searches?
So as pointed out in the comments by Bruce Johnston, largely the feature set (especially with respect to search query syntax) should be identical between the REST API and the Azure search .Net SDK. The search explorer on the Azure portal, is literally a call into the REST API, so there shouldn't be any differences there.
The following search API call might translate to what you are looking for (I have included the POST version, you should be able to use GET as well if you'd like).
POST /indexes/prospectussearchindexlive/docs/search?api-version=2016-09-01
{
"search": "CourseCode:HCN_6_006 OR Coursename:HCN_6_006~1 OR Keywords:HCN_6_006~1",
"queryType": "full",
"searchMode": "all"
}
You should take a look at the Lucene syntax for Azure search, which is here: https://learn.microsoft.com/en-us/rest/api/searchservice/lucene-query-syntax-in-azure-search that will help you write different search queries.
You can also refer to the SDK documentation here: https://learn.microsoft.com/en-us/azure/search/search-howto-dotnet-sdk which talks about how to use the .NET SDK to perform search queries. Look at the Documents.Search method for more details.
Is there any way to do full-text search on Redis with C#?
Yes, see http://playnice.ly/blog/2010/05/05/a-fast-fuzzy-full-text-index-using-redis/.
The code is about 100 LOC in Python, and can be transformed to C#.
The code uses metaphone Python library; you can find C# implementations online, such as this one: http://code.google.com/p/doublemetaphone/
RediSearch implements a search engine on top of Redis.
This also enables more advanced features, like exact phrase matching and numeric filtering for text queries, that are not possible or efficient with traditional Redis search approaches.
RediSearch supports the most of the leading programming languages including C# see: https://libraries.io/nuget/NRediSearch
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.
What is the easiest way to programmatically extract structured data from a bunch of web pages?
I am currently using an Adobe AIR program I have written to follow the links on one page and grab a section of data off of the subsequent pages. This actually works fine, and for programmers I think this(or other languages) provides a reasonable approach, to be written on a case by case basis. Maybe there is a specific language or library that allows a programmer to do this very quickly, and if so I would be interested in knowing what they are.
Also do any tools exist which would allow a non-programmer, like a customer support rep or someone in charge of data acquisition, to extract structured data from web pages without the need to do a bunch of copy and paste?
If you do a search on Stackoverflow for WWW::Mechanize & pQuery you will see many examples using these Perl CPAN modules.
However because you have mentioned "non-programmer" then perhaps Web::Scraper CPAN module maybe more appropriate? Its more DSL like and so perhaps easier for "non-programmer" to pick up.
Here is an example from the documentation for retrieving tweets from Twitter:
use URI;
use Web::Scraper;
my $tweets = scraper {
process "li.status", "tweets[]" => scraper {
process ".entry-content", body => 'TEXT';
process ".entry-date", when => 'TEXT';
process 'a[rel="bookmark"]', link => '#href';
};
};
my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );
for my $tweet (#{$res->{tweets}}) {
print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}
I found YQL to be very powerful and useful for this sort of thing. You can select any web page from the internet and it will make it valid and then allow you to use XPATH to query sections of it. You can output it as XML or JSON ready for loading into another script/ application.
I wrote up my first experiment with it here:
http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/
Since then YQL has become more powerful with the addition of the EXECUTE keyword which allows you to write your own logic in javascript and run this on Yahoo!s servers before returning the data to you.
A more detailed writeup of YQL is here.
You could create a datatable for YQL to get at the basics of the information you are trying to grab and then the person in charge of data acquisition could write very simple queries (in a DSL which is prettymuch english) against that table. It would be easier for them than "proper programming" at least...
There is Sprog, which lets you graphically build processes out of parts (Get URL -> Process HTML Table -> Write File), and you can put Perl code in any stage of the process, or write your own parts for non-programmer use. It looks a bit abandoned, but still works well.
I use a combination of Ruby with hpricot and watir gets the job done very efficiently
If you don't mind it taking over your computer, and you happen to need javasript support, WatiN is a pretty damn good browsing tool. Written in C#, it has been very reliable for me in the past, providing a nice browser-independent wrapper for running through and getting text from pages.
Are commercial tools viable answers? If so check out http://screen-scraper.com/ it is super easy to setup and use to scrape websites. They have free version which is actually fairly complete. And no, I am not affiliated with the company :)
Does anyone know of a "similar words or keywords" algorithm available in open source or via an API? I am looking for something sort of like a thesaurus but smarter.
So for example:
intel
returns:
processor,
i7 core chip,
quad core chip,
.. etc
Any ideas or even something to point me in the right direction in C#?
Edit:
I would love to hear your thoughts, but why cant we just use the Google Adwords API to generate keywords relevant to those entered?
Why not send a search query out to Google and parse what it returns?
Also, check out Google Sets.
There is no algorithm for such a thing. You are going to have to acquire data for a Thesaurus, and load it into a data structure then it is a simple dictionary lookup (you can use the C# Dictionary class for that). Maybe you can look at Wordnet, or Moby Thesaurus as a source for data. Other options are using a Thesaurus server and getting the information online as needed.
You will need a large database containing this information. The rest is simple - look up the input and see what releated words are stored.
The hard part is generating the database. Doing it manually might take years if you want to cover a large number of words and topics.
Generating it is surly non-trivial. Maybe you could try to download web pages and analyze words frequently appearing together, but I assume this will still take months to build, tune, and finally gather good quality data. Maybe extracting links from Wikipedia might be a good source of information because of its semi-structure.
I've made the open office thesaurus functions available for .NET in the NHunspell project. You can use the OO Thesaurus files.
Here is the NHunspell Project