I have an MVC site uses Lucene.net for its searching capabilities. The site has over 100k products. The indexes are built already for the site. The site, however, also has 2 data feeds that update the database on a regular basis ( potentially every 15 mins ). So the data is changing a lot. How should I go about updating the Lucene indexes or do I not have to at all?
Use a process scheduler (like Quartz.Net) to run every so often (potentially, every 15 minutes) to fetch the items in the database that aren't indexed.
Use a field as an ID to compare against (like a sequence number or a date time). You would fetch the latest added document from the index and the latest from the database and index everything in between. You have to be careful not to index duplicates (or worse, skip over un-indexed documents).
Alternatively, synchronize your indexing with the 2 data feeds and index the documents as they are stored in the database, saving you from the pitfalls above (duplicates/missing). I'm unsure how these feeds are updating your database, but you can intercept them and update the index accordingly.
Take a look at this solution, I had the same requirement and I used the solution from this link and it worked for me. Using a timer it creates the index every so often so there wont be any overlap/skipping issue. Give it a try.
Making Lucene.Net thread safe in the code
Thanks.
Related
I'm facing some doubts about what is better related to performance and best practices.
My system will do:
Per document insert and update, or
Batch document insert
As I figured out (in previous systems) #2 is straight forward:
Bulk delete old docs and add new ones, 10k docs max with no more then 20 fields per doc
Commit
Optimize
But the #1 still puzzles me as some customers will add docs one by one.
What is the penalty in commit and optimize at every insert and update? Or can I just ignore it as it only happens 20 times per day?
Java version is 3.5, .net version is 3.03
I just saw a blog post and want to know about what community have to say about.
I see no need to .Optimize() at all. Lucene will handle segment merges automatically, and you can provide your own logic to change how the merges are calculated. You could write something that merged away deleted documents when 10% of your documents are marked for deletion. There's no need for the functionality of Lucene to merge away every single deleted document.
Sure, you'll end up with more segment files and they will consume file descriptors, but have you ever ran into problems where you had too many files open? I tried googling for the maximum number of open files on a Windows server installation, but the answers varies from several thousand to limited by available memory.
I'm working on a web application in ASP.NET MVC which involves a fairly complex (I think) search situation. Basically, I have a bunch of entries with a title and content. These are the fields that I want to provide full-text search for. The catch is that I also keep track of a rating on these entries (like up-vote/down-vote). I'm using MongoDB as my database, and I have a separate collection for all these votes. I plan on using a map/reduce function to turn all of the documents in the votes collection into a single "score" for the article. When I perform a search, I want the article's score to be influential on the rankings.
I've been looking at many different full-text search services, and it looks like all the cool kids are using Lucene (and in my case, Lucene.NET). The problem is that since the score is not part of the document when I will first create the index, I don't know how I would set up Lucene. Each time somebody votes for an article, do I need to update the Lucene index? I'm a little lost here.
I haven't written any of this code yet, so if you have a better way to solve this problem, please share.
The problem is that since the score is not part of the document when I
will first create the index, I don't know how I would set up Lucene
What the problem? Just use default value for rating/votes (probably 0) and later when peoples will vote up update it.
Each time somebody votes for an article, do I need to update the
Lucene index?
No, this can be expensive and slow. In your app probably will be huge volume of updates and lucene can be slow when you will do often flushes to the disk. In general almost for any full text search real time updates not so important as full text search. So i suggest following strategy:
Solution #1:
1.Create collection in mongodb where you will store all updates related to lucene:
{
_id,
title,
content,
rating, //increment it
status(new, updated, delete) // you need this for lucene
}
2.After this you need create tool that will process all this updates in background (once per 10 minutes for example). Just take in the mind that you need flush data to the disc, say, after 10000 of lucene update/insert/delete to make lucene index updates fast.
With above solution your data can be stale for 10 minutes, but inserts will be faster.
Solution #2:
Send async messages for each update related to lucene.
Handle this messages and update lucene each time once message come
Async handling very important, otherwise it can affect application performance.
I would go with #1, because it is should be less expensive for the server.
Choose what you like more.
Go straight to the MongoDB or the database and increment and decrement the votes. You have to be constantly updating the database in my view. Don't need to get complicated. Something is added, add something in the database. update, insert, delete all the time if there is a change in the website. Changes need to be tracked and the tracking place is in the mongodb or the sql database. For searching fields, use the mongodb field search parameters and combine all the fields that it returned and rank them yourself.
I am interested in what the best practices are for paging large datasets (100 000+ records) using ASP.NET and SQL Server.
I have used SQL server to perform the paging before, and although this seems to be an ideal solution, issues arise around dynamic sorting with this solution (case statements for the order by clause to determine column and case statements for ASC/DESC order). I am not a fan of this as not only does it bind the application with the SQL details, it is a maintainability nightmare.
Opened to other solutions...
Thanks all.
In my experience, 100 000+ records are too many records for the user looking at them. Last time I did this, I provided filters. So users can use them and see the filtered (less number of) records and order them, so paging and ordering became much faster (than paging/ordering on whole 100 000+ records). If user didn't use filters, I showed a "warning" that large number of records would be returned and there would be delays. Adding an index on the column being ordered as suggested by Erick would also definitely help.
I wanted to add a quick suggestion to Raj's answer. If you create a temp table with the format ##table, it will survive. However, it will also be shared across all connections.
If you create an Index on the column that will be sorted, the cost of this method is far lower.
Erick
If you use the Order by technique, every time you page through, you will cause the same load on the server because you running the query, then filtering the data.
When I have had the luxury of non-connection-pooled environments, I have created and left the connection open until paging is done. Created a #Temp table on the connection with just the IDs of the rows that need to get back, and added and IDENTITY field to this rowset. Then do paging using this table to get the fastest returns.
If you are restricted to a connection-pooled environment, then the #Temp table is lost as soon as the connection is closed. In that case, you will have to cache the list of Ids on the server - never send them to the client to be cached.
I've been tasked with seeting up a search service on an eCommerce site.
Currently, it uses full text indexing on sql server, which isn't ideal, as it's slow, and not all that flexible.
How would you suggest i approach changing this over to lucene?
By that, i mean, how would i initially load all the data into the indexes, and how would it be maintained? on my "insert product" methods, would i also have it insert it into the index?
any information is of great help!
I'm currently using Solr, which is build on top of Lucene, as the search engine for one of my e-commerce projects. It works great.
http://lucene.apache.org/solr/
Also as far as keeping the products in sync between the DB and Solr, you can either build your own "sweeper" or implement the DataImportHandler in Solr.
http://wiki.apache.org/solr/DataImportHandler
We build our own sweeper that reads a DB view at some interval and checks to see if there are new products or any product data has been updated. It's a brute force method and I wish I knew about the DataImportHandler before.
Facets are also a really powerful part of Solr. I highly recommend using them.
If you do decide to use Lucene.NET for your search you need to do some of the following:
create your initial index by
iterating through all your records
and writing the data that you want
searched into your index
if the amount of records and data that you are writing to your indexes makes for large indexes then consider stuffing them into multiple indexes (this means you will have to make a more complex search program as you need to search each index and then merge the results!!)
when a product is updated or created you need to update your index (there is a process here to create additional index parts and then merge the indexes)
if you have a high traffic site and there is the possibility of multiple searches occurring at the exact same moment then you need to create a wrapper that can do the search for you across multiple duplicate indexs (or sets of indexes) (think singleton pattern here) as the index can only be accessed (opened) for one search at a time
This is a great platform. We initially tried to use the freetext search and found it to be a pain to create the indexes, update, and manage. The searches were not that much faster than a standard sql search. They did provide some flexibility in the search query...but even this pales in comparison to the power of Lucene!
I've a US city/state list table in my sql server 2005 database which is having a million records. My web application pages are having location textbox which uses AJAX autocomplete feature. I need to show complete city/state when user types in 3 characters.
For example:
Input bos..
Output:Boston,MA
Currently, performance wise, this functionality is pretty slow. How can i improve it?
Thanks for reading.
Have you checked in the indexes on your database? If your query is formatted correctly, and you have the proper indexes on your table, you can query a 5 million row database and get your results in less then a second. I would suggest to see if you have an index on the City with added column State onto the index. That way when you query by city, it will return both the city and state from the index.
If you run your query in sql management studio and press ctrl-m you can see the execution plan on your query. If you see something like table scan or index scan then you have the wrong index on your table. You want to make sure your results have an index seek, this means that your query is going through the proper pages in the database to find your data.
Hope this helps.
My guess would be that the problem you're having is not the database itself (although you should check it for index problems), but the amount of time that it takes to retrieve the information from the database, put it in the appropriate objects etc., and send it to the browser. If this is the case, there aren't a lot of options without some real work.
You can cache frequently accessed information on the web server. If you know there are a lot of cities which are frequently accessed, you can store them up-front and then check the database if what the user is looking for isn't in the system. We use prefix trees to store information when a user is typing something and we need to find it in a list.
You can start to pull information from the database as soon as the user starts to type and then pair down full result set down after you get more information from the user. This is a little trickier, as you'll have to store the information in memory between requests (so if the user types "B", you start the retrieval and store it in a session. When the user is done typing "BOS" the result set from the initial query is in memory temporarily and you can loop through and pull the subset that matches the final request).
Use parent child dropdowns