Best strategy for committing and optimizing documents in Lucenet.Net?

Best strategy for committing and optimizing documents in Lucenet.Net? - c#

I'm facing some doubts about what is better related to performance and best practices.
My system will do:
Per document insert and update, or
Batch document insert
As I figured out (in previous systems) #2 is straight forward:
Bulk delete old docs and add new ones, 10k docs max with no more then 20 fields per doc
Commit
Optimize
But the #1 still puzzles me as some customers will add docs one by one.
What is the penalty in commit and optimize at every insert and update? Or can I just ignore it as it only happens 20 times per day?
Java version is 3.5, .net version is 3.03
I just saw a blog post and want to know about what community have to say about.

I see no need to .Optimize() at all. Lucene will handle segment merges automatically, and you can provide your own logic to change how the merges are calculated. You could write something that merged away deleted documents when 10% of your documents are marked for deletion. There's no need for the functionality of Lucene to merge away every single deleted document.
Sure, you'll end up with more segment files and they will consume file descriptors, but have you ever ran into problems where you had too many files open? I tried googling for the maximum number of open files on a Windows server installation, but the answers varies from several thousand to limited by available memory.

Related

TFS max check-in files limitation

I'm using TFS API to manage versions of my application's data.
In the first use i'm trying to convert all the data base data to the TFS workspace and then the check-in stuck for long time (can take more than hour if it not stuck forever), i'm dealing with 100,000-200,000 files to check-in.
There is any limitation in TFS of number of check-in files? if not, what can be the bottle neck of this operation?
Split the check in to small packages of files would help? if so, any recommended bulk size?

The number of changes in a changeset is stored as the CLR's int type.
So there's definitely an upper limit of int.MaxValue or 2,147,483,647.
More details you can refer the answer from Edward in this question:Is there a limit on the number of files in a changeset in TFS?
In other words, you are far from the check-in limitation. Check in process aborted may relate the network connection and current system load.
Moreover, just like above mentioned in the comment. It's not recommend to check in and version control database data file in TFS. Suggest you to create a script. Here is also a discussion about it: Do you use source control for your database items?
The databases themselves? No
The scripts that create them, including static data inserts, stored
procedures and the like; of course. They're text files, they are
included in the project and are checked in and out like everything
else.

Database Relational Records Archive & Restore

Years back, I had created a small system against a requirement where a snapped image from Android was uploaded onto a server along with its respective custom data and then stored on the disk and the custom data describing the image was further broken up and stored in the database. Each of the snapped images was actually a part of a campaign. Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign. Though, the performance is not all that bad however I believe its just a matter of time. We now are thinking of archiving the past campaigns in another database called as Archive. Now here is what I am planning to do.
The Archive Database will have the exact same structure and the Archive functionality may have a search mechanism however, retrieval speed is not much of a concern here as this will happen very rarely.
I was thinking of removing records from one database and cloning it in the other, however the identity column probably will not let me do that very seamlessly. (and I may be wrong too.)
There needs to be a restore option too. (This is probably the most challenging part)
If I just make the records blank(except for the identity) from the original database and copy it to the other with no identity constraint, probably it is not going to help and I think it will loose the purpose of the exercise.
Any advise over this? Is there any known strategy or pattern or literature or even a link that may guide me on this?
Thank you in advance for your help.

I say: as long as you don't run out of space on your server, leave it as it is.
Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign.
→ That's 5-10 millions of rows (created over several years).
For SQL Server, that's not that much.
Yes, I know...we're talking about image files stored in the database, not "regular" rows. Still, if your server has reasonably sized hardware, it shouldn't really matter.
I'm talking from experience here - at work, we have a SQL Server database which we use to store PDF files and images.
In our case, we're using a "regular" image column - since you're using SQL Server 2008, you could even use FILESTREAM (maybe you already do, but I don't know - you didn't say anything how exactly you're storing the image in the database).
We started the project on SQL Server 2005, where FILESTREAM wasn't available yet. In the meantime, we upgraded to SQL Server 2012, but never changed the data type in the table where we're storing the files.
If you still prefer creating a separate archive database and moving old data there, one piece of advice concerning this:
2) I was thinking of removing records from one database and cloning it
in the other, however the identity column probably will not let me do
that very seamlessly. (and I may be wrong too.)
[...]
4) If I just make the records blank(except for the identity) from the
original database and copy it to the other with no identity
constraint, probably it is not going to help and I think it will loose
the purpose of the exercise.
You don't need to set the column to identity in the archive database as well.
Just leave everything as it is in the main database, but remove the identity setting from the primary key in the archive database.
The archive database doesn't ever need to generate new keys (hence no need for identity), you're just copying rows with already existing keys from the main database.

I think good solution for you case is SSIS. This technology can provide fast loading of big volume of data to you Archive system. In addition you can use table partitioning to increase performance of manipulation of big data in Archive system. Also check such thing like comumnstore indexes (but it depends on version of SQL server).
I created such solution with following steps:
1) switch partition from main table t to another table t_1(the oldest rows in a table) in production system
2) load data to Archive system from table t_1
3) drop or truncate table t_1

Update Lucene.net Indexes Regularly

I have an MVC site uses Lucene.net for its searching capabilities. The site has over 100k products. The indexes are built already for the site. The site, however, also has 2 data feeds that update the database on a regular basis ( potentially every 15 mins ). So the data is changing a lot. How should I go about updating the Lucene indexes or do I not have to at all?

Use a process scheduler (like Quartz.Net) to run every so often (potentially, every 15 minutes) to fetch the items in the database that aren't indexed.
Use a field as an ID to compare against (like a sequence number or a date time). You would fetch the latest added document from the index and the latest from the database and index everything in between. You have to be careful not to index duplicates (or worse, skip over un-indexed documents).
Alternatively, synchronize your indexing with the 2 data feeds and index the documents as they are stored in the database, saving you from the pitfalls above (duplicates/missing). I'm unsure how these feeds are updating your database, but you can intercept them and update the index accordingly.

Take a look at this solution, I had the same requirement and I used the solution from this link and it worked for me. Using a timer it creates the index every so often so there wont be any overlap/skipping issue. Give it a try.
Making Lucene.Net thread safe in the code
Thanks.

Inserting large csv files into a database

We have an application on the web that must allow the user to upload files with zip codes, these files are .csv's files. Any user will be able to upload the file from their computer, the issue is that the file may contain thousands of records. Right now i am getting the file, making sure it has the right headers but I am pushing the records one by one into the database.
I am using c# asp.net, is there a better way to do this?, more efficient from the code?. We cant use any external importers or data importers or tools like sql server business intelligence. How can I do this?, i was reading something about putting it in memory and then push it to the database?. Any urls, examples or suggestions would be much appreciated.
Regards

Firstly, I'm pretty sure that what you are asking is actually "How do you process a large file and insert the processed data into the database?".
Now assuming I am correct I would say the question is akin to 'how long is a piece of string?'. The reality is that an implementation for processing large files into a database is highly specific to your requirements.
However, at the simplest end of the spectrum you could simply upload the file straight into a table (or folder) and create a windows service that runs every x minutes, traverses through the table, picks each file and processes your data using bulk inserts and the prepare method (which may give you some performance benefits).
Alternatively you could look at something like MSMQ (Microsoft Message Queuing) and save any uploaded files direct to a queue which is then completely independent of your application and can be processed at any point in time along with easily scaled out.
At the end of the day though, honestly I don't think anyone here can give you a 'correct' answer to your question cause there really isn't one and you'll only be able to find improvements to your implementation by experimentation.

if this contains up to a million record, best to do this is to create a service to manage the inserting of records into the database to avoid timeout and prevent the web iis stress.
if you make it a windows service you can notify the service to process the zip files in certain directory where it was uploaded.
also, i would suggest to use bulk insert for more faster database transactions.
if there are validation you can probably stage the data into a different database and validate the data then push to the final database.

Since these records are in the same table and would then not be related to each other, Parallel.ForEach may be a valid answer here. Assuming you have a static method (may not necessarily need to be static) that inserts an individual record into the db, you can run Parallel.ForEach loop over an array where each index of the array represents a line of the CSV.
This assumes that uploading the large file to the server isn't the initial issue. If that is also part of the issue I would reccomend zipping the file and then using something like SharpZipLib to unzip it once it is uploaded. Since text compresses very well this may be the biggest boon to performance from the user's perspective.

Full Text Search with constantly updating data

I'm working on a web application in ASP.NET MVC which involves a fairly complex (I think) search situation. Basically, I have a bunch of entries with a title and content. These are the fields that I want to provide full-text search for. The catch is that I also keep track of a rating on these entries (like up-vote/down-vote). I'm using MongoDB as my database, and I have a separate collection for all these votes. I plan on using a map/reduce function to turn all of the documents in the votes collection into a single "score" for the article. When I perform a search, I want the article's score to be influential on the rankings.
I've been looking at many different full-text search services, and it looks like all the cool kids are using Lucene (and in my case, Lucene.NET). The problem is that since the score is not part of the document when I will first create the index, I don't know how I would set up Lucene. Each time somebody votes for an article, do I need to update the Lucene index? I'm a little lost here.
I haven't written any of this code yet, so if you have a better way to solve this problem, please share.

The problem is that since the score is not part of the document when I
will first create the index, I don't know how I would set up Lucene
What the problem? Just use default value for rating/votes (probably 0) and later when peoples will vote up update it.
Each time somebody votes for an article, do I need to update the
Lucene index?
No, this can be expensive and slow. In your app probably will be huge volume of updates and lucene can be slow when you will do often flushes to the disk. In general almost for any full text search real time updates not so important as full text search. So i suggest following strategy:
Solution #1:
1.Create collection in mongodb where you will store all updates related to lucene:
{
_id,
title,
content,
rating, //increment it
status(new, updated, delete) // you need this for lucene
}
2.After this you need create tool that will process all this updates in background (once per 10 minutes for example). Just take in the mind that you need flush data to the disc, say, after 10000 of lucene update/insert/delete to make lucene index updates fast.
With above solution your data can be stale for 10 minutes, but inserts will be faster.
Solution #2:
Send async messages for each update related to lucene.
Handle this messages and update lucene each time once message come
Async handling very important, otherwise it can affect application performance.
I would go with #1, because it is should be less expensive for the server.
Choose what you like more.

Go straight to the MongoDB or the database and increment and decrement the votes. You have to be constantly updating the database in my view. Don't need to get complicated. Something is added, add something in the database. update, insert, delete all the time if there is a change in the website. Changes need to be tracked and the tracking place is in the mongodb or the sql database. For searching fields, use the mongodb field search parameters and combine all the fields that it returned and rank them yourself.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.