Optimize Lucene batch indexing

Optimize Lucene batch indexing - c#

This is the question:
Im using Lucene.Net, and Im importing like ~255k documents with ~6 fields each. Ive tried a few things, but the process takes a lot (~1day). Im not using any strange analyzer, just the standard analizer and Im tokenizing only one of the fields. I tried changing the max merge docs and nothing.
Has anyone bumped into this problem?
Thanks and best regards

I'll take a different alternative and I've decided to post the result, so If anyone should face the same problem may find this other way to go.
Lucene.net has an interesting feature allowing to merge two indexes, so my idea is to index my content into several smaller indexes and join them using the merge feature.
This has worked for me. I tested this solution indexing WordNet to perform queries on it and it worked flawlessly.

Assuming you don't have access to a profiler (Redgate ANTS is very good), then:
Work out your bottleneck: is it the Lucene code or your data reader? Comment out the Lucene indexing code, leaving just your data reader. It should be easy to tell on which side your problem lies.
Make sure you're using lucene as built from SVN. The version 2.9.x from subversion is much better than earlier versions, especially with regards speed of indexing
Use the default merge factors etc. Lucene seems to be much better at this than my attempts at tweaking.
Lastly (and perhaps most importantly!) does it matter that indexing is slow? If you're only going to ever have to do this once or twice a year: I'd say don't worry about it. (Unless this is a learning exercise or somesuch)
Hope this helps,

Related

how to use typed DataSets in c#?

I tried to google but didn't find a decent tutorial with snippet code.
Does anyone used typed DataSets\DataTable in c# ?
Is it from .net 3.5 and above?

To answer the second parts of the question (not the "how to..." from the title, but the "does anyone..." and "is it...") - the answer would be a yes, but a yes with a pained expression on my face. For new code, I would strongly recommend looking at a class-based model; pick your poison between the many ORMs, micro-ORMs, and raw ADO.NET. DataTable itself does still have a use, in particular for processing and storing unpredictable data (where you have no idea what the schema is in advance). By the time you are talking about typed data-sets, I would suggest you obviously know enough about the type that this no longer applies, and an object-model is a very valid alternative.
It is still a supported part of the framework, and it is still in use as a technology. It has some nice features like the diff-set. However, most (if not all) of that is also available against an object-based design, with classes and properties (without the added overhead of the DataTable abstraction).

MSDN has guidance. It really hasn't changed since typed datasets were first introduced.
http://msdn.microsoft.com/en-us/library/esbykkzb(v=VS.100).aspx
There are tons of videos available here: http://www.learnvisualstudio.net/series/aspdotnet_2_0_data_access_and_databinding/
And I found one more tutorial here: http://www.15seconds.com/issue/031223.htm

Sparingly.... Unless you need to know to maintain legacy software, learn an ORM or two, particularly in conjunction with LINQ.
Some of my colleagues have them, the software I work on doesn't use them at all, on account of some big mouth developer getting his way again...

Cutting Stock Problem

Does anyone know how to implement the algorithm for this problem using the Knapsack algorithm?
The method I'm using at present makes extensive use of LINQ and Collections of Collections and a few Dictionaries. For those who dont know what I'm talking about check out The Cutting Stock Problem.

As mentioned in your given link, this problem is in fact an instance of an ILP, which is NP-hard normally.
Directly from wikipedia: Advanced algorithms for solving integer linear programs include:
cutting-plane method
branch and bound
branch and cut

Migrating C# code from Cassandra .5 to .6

I have some some simple code derived from an example that is meant to form a quick write to the Cassandra db, then loop back and read all current entries, everything worked fine. When .6 came out, i upgraded Cassandra and thrift, which threw errors in my code (www[dot]copypastecode[dot]com/26760/) - i was able to iron out the errors by converting the necessary types, however in the version that compiles now only seems to read one item back, im not sure if its not saving db changes or if its only reading back 1 entry. the "fixed" code is here: http://www.copypastecode.com/26752/. Any help would be greatly appreciated.

First of all, let me say that you should definitly use TBufferedStream instead of TSocket for the TBinaryProtocol, that will make a huge impact on your application performance.
For the Apache Thrift API documentation that BATCH_INSERT methods is deprecated, so it could have introduced a misleading bug on that operation that actually only insert the first column. Said so, why don't you try to use BATCH_MUTATE instead?
By the way, why are you trying to use Thrift directly? There are some nice c# clients for Cassandra that are actually performing really well. You can find the whole list at http://wiki.apache.org/cassandra/ClientOptions.
I'm the author of one of them are is pretty much updated with Apache and its being used by some companies on production environment. Take a look at my homepage.

How can I see who in my team are writing the most code and who is writing the least using TFS?

I am having a problem with one of my team members output. He seems to be always 'busy' yet I am unable to see exactly what code he has done and he seems be delivering very little and it seems to take a long time to do so. I'd like to investigate further using TFS and was wondering if there is any functionality in TFS that shows what has been written by an individual or similar?
Just to clarify I am NOT spying I am trying to resolve situation. This is only a starting point. I un derstand that quantity of code does not equate to best programmer
thanks for any answers

Your best programmer may in fact write less code than your worst programmer, in fact really good programmers often write less code. Be careful about using this information to evaluate performance. Since you are using TFS, I assume you are also using the work item tracking. That is really a better way to evaluate performance than using lines of code. See which checkins cause the most problems, which fix the most defects, and how many rounds it takes for something to be truly fixed.

For me the simplest thing is to set up email alerts for checkins. You get the checkin comment, some work item info assuming they are associating/resolving on checkin, and list of changed files, as they happen. Lets you see what part of the code that dev is in and after a while you get a sense when "it's quiet. Too quiet" because someone isn't checking in. It doesn't take the place of forensics of what he did all month, but it keeps me feeling connected. It also gives me intuitive feelings like "he's in the reports, so I'll be able to show those to the user earlier in the cycle" or "jeez, he's doing all the stupid typos in error messages and other no-thinking things, and not tackling his real hard stuff" or even "he's doing his pri 2 stuff while he has a large pile of pri 1". All of these enable a 30 second hallway conversation to deliver a course correction as close in time to the problem as possible.

See the following blog post I put together a while ago:
Getting Started with the TFS Data Warehouse
This one talks you through getting code churn for each area of your codebase, but it would be easy to add team members into that as well to get a breakdown by team member who did the check-in.
But I agree with your question - this is not a good way to check on the productivity of your colleague. Instead I would talk with them to raise your concerns.

While I am away from TFS right now, you can view a list of checkins by user in Team Explorer, and in each of these you can see the files which have been changed and look at the diffs for each.

You can get this from the TFS Cube, if you have it set up. There are a large number of dimensions within Code Churn. Some of this is also available in the TfsWarehouse database as well.
If you do have the cube set up, just point Excel at it and have some fun playing around. Keep in mind, though, that the numbers can point you in the wrong direction. Use discretion.

What can I use to determine similar words or keywords?

Does anyone know of a "similar words or keywords" algorithm available in open source or via an API? I am looking for something sort of like a thesaurus but smarter.
So for example:
intel
returns:
processor,
i7 core chip,
quad core chip,
.. etc
Any ideas or even something to point me in the right direction in C#?
Edit:
I would love to hear your thoughts, but why cant we just use the Google Adwords API to generate keywords relevant to those entered?

Why not send a search query out to Google and parse what it returns?
Also, check out Google Sets.

There is no algorithm for such a thing. You are going to have to acquire data for a Thesaurus, and load it into a data structure then it is a simple dictionary lookup (you can use the C# Dictionary class for that). Maybe you can look at Wordnet, or Moby Thesaurus as a source for data. Other options are using a Thesaurus server and getting the information online as needed.

You will need a large database containing this information. The rest is simple - look up the input and see what releated words are stored.
The hard part is generating the database. Doing it manually might take years if you want to cover a large number of words and topics.
Generating it is surly non-trivial. Maybe you could try to download web pages and analyze words frequently appearing together, but I assume this will still take months to build, tune, and finally gather good quality data. Maybe extracting links from Wikipedia might be a good source of information because of its semi-structure.

I've made the open office thesaurus functions available for .NET in the NHunspell project. You can use the OO Thesaurus files.
Here is the NHunspell Project

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize Lucene batch indexing - c#

Related

how to use typed DataSets in c#?

Cutting Stock Problem

Migrating C# code from Cassandra .5 to .6

How can I see who in my team are writing the most code and who is writing the least using TFS?

What can I use to determine similar words or keywords?

Categories

Resources