I am looking to create a search engine that will be based on 5 columns in a SQL 2000 DB. I have looked into Lucene.NET and read the documentation on it, but wondering if anyone has any previous experience with this?
Thanks
IMHO it's not so much about performance, but about maintainability. In order to index your content using Lucene.NET you'll have to create some mechanism (service of triggered) which will add new rows (and remove deleted rows) from the Lucene index.
From a beginner's perspective I think it's probably easier to use the SQL Server built-in full text search engine.
i haven't dealt with Lucene yet but a friend of mine has and he said that their performance was 4 to 5 times better with lucene than full text indexing.
Performance better? I think that largely depends on volume and how you expect the data to scale.
SQL Server Full Text is far superior in my opinion. To get this to work with lucene you will need a process to maintain the index by extracting data from the SQL database.
You cam either use a Lucene Index or SQL FTS Index. I personally lean toward Lucene from a simplicity standpoint. It is also not a black box. Alot of which solution will work (and they both may work) depnds on query load, data size and data update frequency. Lucene does provide a well worn path to building very scalable search solutions for websites. In the future please include some more information about your problem.
Related
I'm currently using sqlite embedded to store relatively big lists of data (starting from 100'000 rows per table). Queries include only:
paging
sorting by a field
Amount of data in a row is relatively small. Performance is really bad, especially for the first query, which is critical for my application. All kinds of tunings and pre-caching already tried and reached the practical limit.
Is there any alternative of an embedded data store library which can do these simple queries in a very fast and efficient way? Theres no requirement for it to support sql at all.
If it is (predominantly) read-only, consider using memory mapped views of a file.
It will be possible to achieve maximum performance rolling your own indexes.
Obviously it will be also be the most work-intensive and error-prone to roll-your-own.
May I suggest a traditional RDBMS with good indexes or perhaps a newfangled no-SQL style DB that supports your work-load?
You can try lucene.net, it is blazing fast, does not require any installation, supports paging and sorting by fields and much much more.
http://incubator.apache.org/lucene.net/
With Simple Lucene wrapper it is also quite easy to use: http://blogs.planetcloud.co.uk/mygreatdiscovery/post/SimpleLucene-e28093-Lucenenet-made-easy.aspx
I am new in the game for open source.
had a question, before i dive into what i plan to do. Assuming I plan to use c# , with no NoSQL (not planned which one (RavenDb, or MongoDb)), I wanted to do indexing for a site in asp.net.
I would like to use Lucene.net for indexing data and page links on my site, When do you actually tell Lucene.Net to start indexing?
I mean, is it a background process that starts indexing every night, just like the SharePoint indexes or the moment you call insert to nosql at the time you should call to index a record.
How about links on pages, when should the crawl engine run. I guess I am thinking in terms of SharePoint world and needs to be corrected by some people on this board.
I am particularly interested in sequence of steps, I am sorry, i am failing to understand when and why.
Any explanation or links to examples would help.
Appreciate your help.
Thanks
Sweety
Lucene is a search engine, not a crawler. So you would need to find a crawler which inserts the data into the Lucene index.
Think of Lucene as a SQL server. It can store data and retrieve data based on queries. But you have to create the application which actually inserts and queries the data.
You could very well use Solr (built on top of Lucene) and Nutch, both java projects, and use web services to between your C# app and the search index. The java version of Lucene is also under constant development, while the .Net version is somewhat up in the air.
I have a little over 1 million records in my lucene database and would like to move them into a new database so I can more easily do advanced searching and join it with my existing tables etc. I have done some searching and haven't found a good/fast way to take my existing lucene database files and move them into a sql database.
Any help would be appreciated or pointing me in the right direction.
Details: My sql database is Microsoft SQL Server Management Studio. My application which creates the lucene database is a web scraper writing in c#
EDIT: I am using Lucene.net
Not the answer you're looking for, but I'd just like to point out that an index and a relational database are two vastly different things. Unless you're storing all the data in the index as well, I really don't think what you're trying to do is possible.
Putting your Lucene index in DB negates the purpose of indexing. The main advantage of Lucene is extremely fast, relevant searches over huge amount of text. Instead of putting index into the DB you might as well just use MSSQL Server full text search instead.
I think you should consider your requirements once again and either switch to MSSQL full text search or use standard Lucene searching mechanisms.
Is it possible to use Lucene as full fledged data store (like other(mongo,couch) nosql variants).
I know there are some limitations like newly updated documents by one indexer will not be shown in other indexer. So we need to restart the indexer to get the updates.
But i stumble upon solr lately, it seems these problems are avoided by some kind of snapshot replication.
So i thought i could use lucene as a data store since this also uses same kind of documents(JSON based) used by mongo and couch internally to manage documents, and its proven indexing algorithm fetches the records super fast.
But i am curious has anybody tried that before..? if not what are reasons not choosing this approach.
There is also the problem of durability. While a Lucene index should not get corrupted ever, I've seen it happen. And the approach Lucene takes to repairing a broken index is "throw it away and rebuild from the original data". Which makes perfect sense for an indexing tool. But it does require you to have the data stored somewhere else.
I've only worked with Solr, the Lucene derivative (and I would recommend using Solr to just about anyone) so my opinion may be a little biased but it should be possible to use Solr as a datastore yes, however it wouldn't be very useful without something more permanent in the background.
The problem you may encounter is that entering data into Solr does not guarantee you will get it back when you expect it. Baring the use of pretty strict faceting you may encounter problems retrieving your data simply because the indexer has decided to lump your results in a certain way.
I've experimented a little with this approach but the only real benefit I saw was in situations where you want the search index on the client side so that they can search quickly internally a then query the database for extended information.
My suggestion is to use solr for search and then have it return a short sample of the data you may want as well as an index for further querying in a traditional data store.
TL;DR: Yes, but I wouldn't recommend it.
The Guardian uses Solr as their data store. You can see some of their reasons in that slideshow.
In any case, I think their website is very heavily trafficked (certainly more so than anything I work on), so I think I would feel comfortable saying that Solr will probably work for you., since it scales to their requirements.
I've been tasked with seeting up a search service on an eCommerce site.
Currently, it uses full text indexing on sql server, which isn't ideal, as it's slow, and not all that flexible.
How would you suggest i approach changing this over to lucene?
By that, i mean, how would i initially load all the data into the indexes, and how would it be maintained? on my "insert product" methods, would i also have it insert it into the index?
any information is of great help!
I'm currently using Solr, which is build on top of Lucene, as the search engine for one of my e-commerce projects. It works great.
http://lucene.apache.org/solr/
Also as far as keeping the products in sync between the DB and Solr, you can either build your own "sweeper" or implement the DataImportHandler in Solr.
http://wiki.apache.org/solr/DataImportHandler
We build our own sweeper that reads a DB view at some interval and checks to see if there are new products or any product data has been updated. It's a brute force method and I wish I knew about the DataImportHandler before.
Facets are also a really powerful part of Solr. I highly recommend using them.
If you do decide to use Lucene.NET for your search you need to do some of the following:
create your initial index by
iterating through all your records
and writing the data that you want
searched into your index
if the amount of records and data that you are writing to your indexes makes for large indexes then consider stuffing them into multiple indexes (this means you will have to make a more complex search program as you need to search each index and then merge the results!!)
when a product is updated or created you need to update your index (there is a process here to create additional index parts and then merge the indexes)
if you have a high traffic site and there is the possibility of multiple searches occurring at the exact same moment then you need to create a wrapper that can do the search for you across multiple duplicate indexs (or sets of indexes) (think singleton pattern here) as the index can only be accessed (opened) for one search at a time
This is a great platform. We initially tried to use the freetext search and found it to be a pain to create the indexes, update, and manage. The searches were not that much faster than a standard sql search. They did provide some flexibility in the search query...but even this pales in comparison to the power of Lucene!