Indexing an entire MongoDB collection into Elastcticsearch quickly

Indexing an entire MongoDB collection into Elastcticsearch quickly - c#

I have a collection in MongoDB which I am indexing into Elasticsearch. I am doing this in a C# process. The collection has 100 million documents, and for each document, I have to query other documents in order to denormalise into the Elasticsearch index.
This all takes time. Reading from MongoDB is the slow part (indexing is relatively quick). I am batching the data from MongoDB as efficiently as I can but the process takes over 2 days.
This only has to happen when the mapping in Elasticsearch changes, but that has happened a couple of times over the last month.
Are there any ways of improving the performance for this?

Maybe you don't need launch import from scratch (I mean import from MongoDB), when you change mappings. Read this: Elasticsearch Reindex API
When you need to change mapping you must:
Create new index with new mapping
Reindex data from the old index into a new index using the built-in feature of elasticsearch.
After this old documents will be indexed with new mappings inside the new index. And built-in reindex in elasticsearch will work more quickly, than import from MongoDB via HTTP API.
If you will use reindex, don't forget to use parameter wait_for_completion(this parameter described in the documentation). This will run the reindex in the background.
Is this approach will solve your problem?

Related

How to use Redis with ElasticSearch

I found NEST for ElasticSearch. But I did not realize how the relation between Redis and ElasticSearch. I'll build a social network and would like to know whether you have some parts Redis and some parts of ElasticSearch should be used or a combination of them.what part of the project i use Redis and which parts ElasticSearch use and which parts should be combined use.
I use C# , BookSleeve for Redis , ElasticSearch with NEST , ASP.NET MVC

There is exactly zero relationship between these two things. I suspect you may have gotten the wrong end of the stick in a previous conversation, where you were wanting to search inside an individual value in redis for uses of a work (this question: How to search content value in redis by BookSleeve). The point I was trying to make is that this simply isn't a feature of redis. So you have two options:
write your own word extraction code (stemmer, etc) and build an index manually inside redis
use a tool that is designed to do all of that for you
Tools like ElasticSearch (which sits on top of lucene) are good at that.
Or to put the question in other terms:
X asks "how do I cut wood in half with a screwdriver"
Y says "use a saw"
X then asks "how do I use a screwdriver with a saw to cut wood in half?"
Answer: you don't. These things are not related.

Actually Redis and Elasticsearch can be combined in quite a useful way; if you are pushing data into Elasticsearch from a source stream, and that stream of data suddenly bursts and becomes too much data for your Elasticsearch instance to ingest, then it will drop data. If however, you put a Redis instance in front of Elasticsearch to cache the data, then your Elasticsearch instance can survive the bursting without losing data because it will be cached in Redis.
That's just one example, but there are many more. see here for an example of how to cache queries.

Update Lucene.net Indexes Regularly

I have an MVC site uses Lucene.net for its searching capabilities. The site has over 100k products. The indexes are built already for the site. The site, however, also has 2 data feeds that update the database on a regular basis ( potentially every 15 mins ). So the data is changing a lot. How should I go about updating the Lucene indexes or do I not have to at all?

Use a process scheduler (like Quartz.Net) to run every so often (potentially, every 15 minutes) to fetch the items in the database that aren't indexed.
Use a field as an ID to compare against (like a sequence number or a date time). You would fetch the latest added document from the index and the latest from the database and index everything in between. You have to be careful not to index duplicates (or worse, skip over un-indexed documents).
Alternatively, synchronize your indexing with the 2 data feeds and index the documents as they are stored in the database, saving you from the pitfalls above (duplicates/missing). I'm unsure how these feeds are updating your database, but you can intercept them and update the index accordingly.

Take a look at this solution, I had the same requirement and I used the solution from this link and it worked for me. Using a timer it creates the index every so often so there wont be any overlap/skipping issue. Give it a try.
Making Lucene.Net thread safe in the code
Thanks.

Getting more than 1000 objects using SharpGs library

I have many 1000s of files in Google Cloud Storage and I'm writing a .Net application to process the list of files. I'm using the SharpGs .Net library (https://github.com/acropolium/SharpGs) which seems simple and easy enough to use. However, I only seem to be getting back 1000 objects.
I am using the following code:
var bucket = GoogleStorageClient.GetBucket(rootBucketName)
var objects = bucket.Objects;
There doesn't seem to be any obvious way to obtain the next 1000 objects so I'm a bit stuck at the moment.
Does anyone have any ideas or suggestions?

I am not familiar with this particular library, but 1000 objects is the current limit for a single list call. Beyond that, you'd need to use paging to access the rest of the objects. If this library has support for paging, I'd recommend using that.

If you look at the Bucket class:
https://github.com/acropolium/SharpGs/blob/master/SharpGs/Internal/Bucket.cs#L33
It returns a Query object. The Query object allows you to pass in a Marker parameter:
https://github.com/acropolium/SharpGs/blob/master/SharpGs/Internal/Query.cs#L36
You will have to take the initial Query object, extract its marker, then pass it to a new Query to get the next page of results.

Mongo DB - fastest way to retrieve 5 Million records from a collection

I am using MongoDB in our project and I'm currently learning how things work
I have created a collection with 5 million records. When i fire the query db.ProductDetails.find() on the console it takes too much time to display all the data.
Also when i use the following code in C#
var Products = db.GetCollection("ProductDetails").FindAll().Documents.ToList();
the system throws OutOfMemoryException after some time..
Is there any other faster or more optimized way to achieve this ?

Never try to fetch all entries at the same time. Use filters or get a few rows at a time.
Read this question: MongoDB - paging

Try to get the subset which is needed. If you try to fetch all objects, then it is for sure you will need enough RAM as the size of your database collection !!
Try to fetch the objects which will be used in the application.

Best practices for implementing a Lucene search in asp.net eCommerce site

I've been tasked with seeting up a search service on an eCommerce site.
Currently, it uses full text indexing on sql server, which isn't ideal, as it's slow, and not all that flexible.
How would you suggest i approach changing this over to lucene?
By that, i mean, how would i initially load all the data into the indexes, and how would it be maintained? on my "insert product" methods, would i also have it insert it into the index?
any information is of great help!

I'm currently using Solr, which is build on top of Lucene, as the search engine for one of my e-commerce projects. It works great.
http://lucene.apache.org/solr/
Also as far as keeping the products in sync between the DB and Solr, you can either build your own "sweeper" or implement the DataImportHandler in Solr.
http://wiki.apache.org/solr/DataImportHandler
We build our own sweeper that reads a DB view at some interval and checks to see if there are new products or any product data has been updated. It's a brute force method and I wish I knew about the DataImportHandler before.
Facets are also a really powerful part of Solr. I highly recommend using them.

If you do decide to use Lucene.NET for your search you need to do some of the following:
create your initial index by
iterating through all your records
and writing the data that you want
searched into your index
if the amount of records and data that you are writing to your indexes makes for large indexes then consider stuffing them into multiple indexes (this means you will have to make a more complex search program as you need to search each index and then merge the results!!)
when a product is updated or created you need to update your index (there is a process here to create additional index parts and then merge the indexes)
if you have a high traffic site and there is the possibility of multiple searches occurring at the exact same moment then you need to create a wrapper that can do the search for you across multiple duplicate indexs (or sets of indexes) (think singleton pattern here) as the index can only be accessed (opened) for one search at a time
This is a great platform. We initially tried to use the freetext search and found it to be a pain to create the indexes, update, and manage. The searches were not that much faster than a standard sql search. They did provide some flexibility in the search query...but even this pales in comparison to the power of Lucene!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Indexing an entire MongoDB collection into Elastcticsearch quickly - c#

Related

How to use Redis with ElasticSearch

Update Lucene.net Indexes Regularly

Getting more than 1000 objects using SharpGs library

Mongo DB - fastest way to retrieve 5 Million records from a collection

Best practices for implementing a Lucene search in asp.net eCommerce site

Categories

Resources