Full Text Search with constantly updating data - c#

I'm working on a web application in ASP.NET MVC which involves a fairly complex (I think) search situation. Basically, I have a bunch of entries with a title and content. These are the fields that I want to provide full-text search for. The catch is that I also keep track of a rating on these entries (like up-vote/down-vote). I'm using MongoDB as my database, and I have a separate collection for all these votes. I plan on using a map/reduce function to turn all of the documents in the votes collection into a single "score" for the article. When I perform a search, I want the article's score to be influential on the rankings.
I've been looking at many different full-text search services, and it looks like all the cool kids are using Lucene (and in my case, Lucene.NET). The problem is that since the score is not part of the document when I will first create the index, I don't know how I would set up Lucene. Each time somebody votes for an article, do I need to update the Lucene index? I'm a little lost here.
I haven't written any of this code yet, so if you have a better way to solve this problem, please share.

The problem is that since the score is not part of the document when I
will first create the index, I don't know how I would set up Lucene
What the problem? Just use default value for rating/votes (probably 0) and later when peoples will vote up update it.
Each time somebody votes for an article, do I need to update the
Lucene index?
No, this can be expensive and slow. In your app probably will be huge volume of updates and lucene can be slow when you will do often flushes to the disk. In general almost for any full text search real time updates not so important as full text search. So i suggest following strategy:
Solution #1:
1.Create collection in mongodb where you will store all updates related to lucene:
{
_id,
title,
content,
rating, //increment it
status(new, updated, delete) // you need this for lucene
}
2.After this you need create tool that will process all this updates in background (once per 10 minutes for example). Just take in the mind that you need flush data to the disc, say, after 10000 of lucene update/insert/delete to make lucene index updates fast.
With above solution your data can be stale for 10 minutes, but inserts will be faster.
Solution #2:
Send async messages for each update related to lucene.
Handle this messages and update lucene each time once message come
Async handling very important, otherwise it can affect application performance.
I would go with #1, because it is should be less expensive for the server.
Choose what you like more.

Go straight to the MongoDB or the database and increment and decrement the votes. You have to be constantly updating the database in my view. Don't need to get complicated. Something is added, add something in the database. update, insert, delete all the time if there is a change in the website. Changes need to be tracked and the tracking place is in the mongodb or the sql database. For searching fields, use the mongodb field search parameters and combine all the fields that it returned and rank them yourself.

Related

Is it possible for Lucene to monitor a Sql Table and keep itself updated?

I am trying to understand some basics of Lucene, the full text search engine. More specifically I am looking at Lucene.Net.
Today I have an old legacy .NET 4.8 web app. Some is MVC, but the newer parts follow a pretty nice API first pattern. The app holds a lot of records (app half a million) with tons of different fields. The search functionality there is outdated to say the least. It is a ton of old Linq2SQL queries that fan out in like queries.
I would like to introduce a new and better way to search records, so I started looking at Lucene.Net. But I am trying to understand one key concept, and I can't seem to find the answer anywhere, and I think it might be because it cannot be done, but I would like to make sure.
Is it possible to set up Lucene to monitor a SQL table or view so I don't have to maintain the Lucene index from within my code. The code of this app does not lend itself to easily keeping a Lucene index updated when things are added, changed or deleted. But the database is good source of truth. I can live with a small delay on having the index up to date. But basically I would like define for each business model what fields are part of the index and what the id is, and then be able to query with that index from the C# server side code of my Web App.
Is such a scenario even possible or am I asking too much?
It's totally possible, but not out of the box. You have to implement it if you want it. Fundamentally you need to implement three things.
A way to know every time a piece of relevant data in the sql database changes
A place to capture information about that change, call it a change log.
A routine that reads the change log, applies those changes to the
LuceneNet index and than marks the record in the change log has processed.
There are of course lots of different ways to handle each of these.
This SO answer Lucene.Net index updates, when a manual change is done in SQL Database provides more details on one way this can be accomplished.

Update Lucene.net Indexes Regularly

I have an MVC site uses Lucene.net for its searching capabilities. The site has over 100k products. The indexes are built already for the site. The site, however, also has 2 data feeds that update the database on a regular basis ( potentially every 15 mins ). So the data is changing a lot. How should I go about updating the Lucene indexes or do I not have to at all?
Use a process scheduler (like Quartz.Net) to run every so often (potentially, every 15 minutes) to fetch the items in the database that aren't indexed.
Use a field as an ID to compare against (like a sequence number or a date time). You would fetch the latest added document from the index and the latest from the database and index everything in between. You have to be careful not to index duplicates (or worse, skip over un-indexed documents).
Alternatively, synchronize your indexing with the 2 data feeds and index the documents as they are stored in the database, saving you from the pitfalls above (duplicates/missing). I'm unsure how these feeds are updating your database, but you can intercept them and update the index accordingly.
Take a look at this solution, I had the same requirement and I used the solution from this link and it worked for me. Using a timer it creates the index every so often so there wont be any overlap/skipping issue. Give it a try.
Making Lucene.Net thread safe in the code
Thanks.

MongoDB, C# and NoRM + Denormalization

I am trying to use MongoDB, C# and NoRM to work on some sample projects, but at this point I'm having a much harder time wrapping my head around the data model. With RDBMS's related data is no problem. In MongoDB, however, I'm having a difficult time deciding what to do with them.
Let's use StackOverflow as an example... I have no problem understanding that the majority of data on a question page should be included in one document. Title, question text, revisions, comments... all good in one document object.
Where I start to get hazy is on the question of user data like username, avatar, reputation (which changes especially often)... Do you denormalize and update thousands of document records every time there is a user change or do you somehow link the data together?
What is the most efficient way to accomplish a user relationship without causing tons of queries to happen on each page load? I noticed the DbReference<T> type in NoRM, but haven't found a great way to use it yet. What if I have nullable optional relationships?
Thanks for your insight!
The balance that I have found is using SQL as the normalized database and Mongo as the denormalized copy. I use a ESB to keep them in sync with each other. I use a concept that I call "prepared documents" and "stored documents". Stored documents are data that is only kept in mongo. Useful for data that isn't relational. The prepared documents contain data that can be rebuilt using the data within the normalized database. They act as living caches in a way - they can be rebuilt from scratch if the data ever falls out of sync (in complicated documents this is an expensive process because these documents require many queries to be rebuilt). They can also be updated one field at a time. This is where the service bus comes in. It responds to events sent after the normalized database has been updated and then updates the relevant mongo prepared documents.
Use each database to their strengths. Allow SQL to be the write database that ensures data integrity. Let Mongo be the read-only database that is blazing fast and can contain sub-documents so that you need less queries.
** EDIT **
I just re-read your question and realized what you were actually asking for. I'm leaving my original answer in case its helpful at all.
The way I would handle the Stackoverflow example you gave is to store the user id in each comment. You would load up the post which would have all of the comments in it. Thats one query.
You would then traverse the comment data and pull out an array of user ids that you need to load. Then load those as a batch query (using the Q.In() query operator). Thats two queries total. You would then need to merge the data together into a final form. There is a balance that you need to strike between when to do it like this and when to use something like an ESB to manually update each document. Use what works best for each individual scenario of your data structure.
I think you need to strike a balance.
If I were you, I'd just reference the userid instead of their name/reputation in each post.
Unlike a RDBMS though, you would opt to have comments embedded in the document.
Why you want to avoid denormalization and updating 'thousands of document records'? Mongodb db designed for denormalization. Stackoverlow handle millions of different data in background. And some data can be stale for some short period and it's okay.
So main idea of above said is that you should have denormalized documents in order to fast display them at ui.
You can't query by referenced document, in any way you need denormalization.
Also i suggest have a look into cqrs architecture.
Try to investigate cqrs and event sourcing architecture. This will allow you to update all this data by queue.

Best practices for implementing a Lucene search in asp.net eCommerce site

I've been tasked with seeting up a search service on an eCommerce site.
Currently, it uses full text indexing on sql server, which isn't ideal, as it's slow, and not all that flexible.
How would you suggest i approach changing this over to lucene?
By that, i mean, how would i initially load all the data into the indexes, and how would it be maintained? on my "insert product" methods, would i also have it insert it into the index?
any information is of great help!
I'm currently using Solr, which is build on top of Lucene, as the search engine for one of my e-commerce projects. It works great.
http://lucene.apache.org/solr/
Also as far as keeping the products in sync between the DB and Solr, you can either build your own "sweeper" or implement the DataImportHandler in Solr.
http://wiki.apache.org/solr/DataImportHandler
We build our own sweeper that reads a DB view at some interval and checks to see if there are new products or any product data has been updated. It's a brute force method and I wish I knew about the DataImportHandler before.
Facets are also a really powerful part of Solr. I highly recommend using them.
If you do decide to use Lucene.NET for your search you need to do some of the following:
create your initial index by
iterating through all your records
and writing the data that you want
searched into your index
if the amount of records and data that you are writing to your indexes makes for large indexes then consider stuffing them into multiple indexes (this means you will have to make a more complex search program as you need to search each index and then merge the results!!)
when a product is updated or created you need to update your index (there is a process here to create additional index parts and then merge the indexes)
if you have a high traffic site and there is the possibility of multiple searches occurring at the exact same moment then you need to create a wrapper that can do the search for you across multiple duplicate indexs (or sets of indexes) (think singleton pattern here) as the index can only be accessed (opened) for one search at a time
This is a great platform. We initially tried to use the freetext search and found it to be a pain to create the indexes, update, and manage. The searches were not that much faster than a standard sql search. They did provide some flexibility in the search query...but even this pales in comparison to the power of Lucene!

How can I do search efficiently data in Database except using fullsearch

I want to search a sentence (word combination of) in some table or view of DB. I dont want to use Fultext search property of DB. Is there any alternative efficient way?
Without the use of an index, a database has to perform a "full table scan". This is rather like you looking through a book one page at a time to find what you need.
That being said, computers are a lot faster than humans. It really depends on how much load your system has. Using MySQL we successfully implemented a search system on a table of lead information. The nature of the problem was one that could not be solved by normal indexes (including full text). So we designed it to be powered using a full table scan.
That involved creating tables as narrow as possible with the search data, and joining them to a larger table with related, but non-search data.
At the time (4 years ago), 100,000 records could be scanned in .06 seconds. 1,000,000 records took about .6 seconds. The system is still in heavy production use with millions of records.
If your data needs exceed 6 digits of records, you may want to re-evaluate using a full text index, or do some research on inverted indexes.
Please comment if you would like any more info.
Edit: The search tables were kept as narrow as possible. Ideally 50-100 bytes per record. ENUMS and TINYINT are great space savers if you can use them to "map" to string values another way.
The search queries were generated using a PHP class. They were simply:
-- DataTable is the big table that holds all of the data
-- SearchTable is the narrow table that holds the bits of searchable data
SELECT
MainTable.ID,
MainTable.Name,
MainTable.Whatever
FROM
MainTable, SearchTable
WHERE
MainTable.ID = SearchTable.ID
AND SearchTable.State IN ('PA', 'DE')
AND SearchTable.Age < 40
AND SearchTable.Status = 3
Essentially, the two tables were joined on a primary key (fast), and the filtering was done by full table scan on the SearchTable (pretty fast). We were using MySQL.
We found that by having the record format == "FIXED" in the MyISAM tables, we could increase performace by 3x. This meant no blobs, no varchars, etc...
Let me know if this helps.
None as efficient as Fulltext search.
Basically it boils down to where with like derivatives and since indexes are tossed away in most of the scenarios , it becomes a very expensive query.
If you are using JAVA have at look at Lucene
If you are using .net, you can have a look at Lucene.net, it will minimize the calls to the database for the search queries.
Following from http://incubator.apache.org/lucene.net/
Lucene.Net is a source code,
class-per-class, API-per-API and
algorithmatic port of the Java Lucene
search engine to the C# and .NET
platform utilizing Microsoft .NET
Framework.
Lucene.Net sticks to the APIs and
classes used in the original Java
implementation of Lucene. The API
names as well as class names are
preserved with the intention of giving
Lucene.Net the look and feel of the C#
language and the .NET Framework. For
example, the method Hits.length() in
the Java implementation now reads
Hits.Length() in the C# port.
In addition to the APIs and classes
port to C#, the algorithm of Java
Lucene is ported to C# Lucene. This
means an index created with Java
Lucene is back-and-forth compatible
with the C# Lucene; both at reading,
writing and updating. In fact a Lucene
index can be concurrently searched and
updated using Java Lucene and C#
Lucene processes.
You could break up the text into individual words, stick them in a separate table, and use that to find PK IDs that have all the words in your search sentence [i.e. but not necessarily in the right order], and then search just those rows for the sentence. Should avoid having to do a table scan every time.
Please ask if you need me to explain further

Categories

Resources