Lucene.net intermitantly indexed documents not appearing in the - c#

I've got an issue which shows up intermitantly in my unit tests and I can't work out why.
The unit test itself is adding multiple documents to an index, then trying to query the index to get the documents back out again.
So 95% of the time it works without any problems. Then the other 5% of the time it cannot retrieve the documents back out of the index.
My unit test code is as follows:
[Test]
public void InsertMultipleDocuments()
{
string indexPath = null;
using (LuceneHelper target = GetLuceneHelper(ref indexPath))
{
target.InsertOrUpdate(
target.MakeDocument(GetDefaultSearchDocument()),
target.MakeDocument(GetSecondSearchDocument()));
var doc = target.GetDocument(_documentID.ToString()).FirstOrDefault();
Assert.IsNotNull(doc);
Assert.AreEqual(doc.DocumentID, _documentID.ToString());
doc = target.GetDocument(_document2ID.ToString()).FirstOrDefault();
Assert.IsNotNull(doc);
Assert.AreEqual(doc.DocumentID, _document2ID.ToString());
}
TidyUpTempFolder(indexPath);
}
I won't post the full code from my LuceneHelper, but the basic idea of it is that it holds an IndexSearcher in reference which is closed every time an item is written to the index (so it can be re-opened again with all the of the latest documents).
The actual unit test will often fail when gathering the second document. I assumed it was to do with the searcher not being closed and seeing cached data, however this isn't the case.
Does Lucene have any delay in adding documents to the index? I assumed that once it had added the document to the index it was available immediately as long as you closed any old search indexers and opened a new one.
Any ideas?

How do you close the IndexWriter you use for updating the index? The close method has an overload that takes a single boolean parameter specifying whether or not you want to wait for merges to complete. The default merge scheduler runs the merges in a separate thread and that might cause your problems.
Try closing the writer like this:
indexWriter.Close(true);
More information can be found at Lucene.NET documentation.
Btw, which version of Lucene.NET are you using?

Related

Prevent insertion of duplicate documents into Lotus notes database

I have a c# web api hosted in iis which has a post method that takes a list of document ids to insert into a lotus notes database.
The post method can be called multiple times and I want to prevent insertion of duplicate documents.
This is the code(in a static class) that is called from the post:
lock (thisLock)
{
var id = "some unique id";
doc = vw.GetDocumentByKey(id, false);
if (doc == null)
{
NotesDocument docNew = db.CreateDocument();
//some more processing
docNew.Save(true, false, false);
}
}
Even with the lock in place, I am running into scenarios where duplicate documents are inserted. Is it because a request can be execute on a new process? What is the best way to prevent it from happening?
Your problem is: getdocumentbykey depends on the view index being up to date. On a busy server there is no guarantee that this is true. You can TRY to call a vw.Update, but unfortunately this does not trigger an update of the view index, so it might be without any effect (it just updates the vw object to represent what has changed in the backend, if the backend did not update, then it does nothing).
You could use db.Search('IdField ="' & id & '"', Nothing, 0) instead, as the search does not rely on an index to be rebuilt. This will be slightly slower, but should be way more accurate.
you might want to store the inserted ids in some singleton object or even simply static list. And lock on this list - whoever obtains the lock verifies that the ids it wants to insert are not present and then adds them to the list itself.
You need to keep them only for a short length of time, just so that 2 concurrent posts with the same content does not update plus normal view index gets updated. So rather store timestamp along id, so you can clean out older records if the list grows long.

Nested Parallel.For() loops and file creation problems

I've been investigating TPL as means of quickly generating a large volume of files - I have about 10 million rows in a database, events which belong to patients, which I want to output into their own text file, in the location d:\EVENTS\PATIENTID\EVENTID.txt
I'm using a two nested Parallel.ForEach loops - the outer in which a list of patients is retrieved and the inner in which the events for a patient are retrieved and written to a file.
This is the code I'm using, it's pretty rough at the moment, as I'm just trying to get things working.
DataSet1TableAdapters.GetPatientsTableAdapter ta = new DataSet1TableAdapters.GetPatientsTableAdapter();
List<DataSet1.GetPatientsRow> Pats = ta.GetData().ToList();
List<DataSet1.GetPatientEventsRow> events = null;
string patientDir = null;
System.IO.DirectoryInfo di = new DirectoryInfo(txtAllEventsPath.Text);
di.GetDirectories().AsParallel().ForAll((f) => f.Delete(true));
//get at the patients
Parallel.ForEach(Pats
, new ParallelOptions() { MaxDegreeOfParallelism = 8 }
, patient =>
{
patientDir = "D:\\Events\\" + patient.patientID.ToString();
//Output directory
Directory.CreateDirectory(patientDir);
events = new DataSet1TableAdapters.GetPatientEventsTableAdapter().GetData(patient.patientID).ToList();
if (Directory.Exists(patientDir))
{
Parallel.ForEach(events.AsEnumerable()
, new ParallelOptions() { MaxDegreeOfParallelism = 8 }
, ev =>
{
List<DataSet1.GetAllEventRow> anEvent =
new DataSet1TableAdapters.GetAllEventTableAdapter();
File.WriteAllText(patientDir + "\\" + ev.EventID.ToString() + ".txt", ev.EventData);
});
}
});
The code I have produced works very quickly but produces an error after a few seconds (in which about 6,000 files are produced). The error produced is one of two types:
DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.
Whenever this error is produced, the directory structure D:\Events\PATIENTID\ exists, as other files have been created within that directory. An if condition checks for the existence of D:\Events\PATIENTID\ before the second loop is entered.
The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.
When this error occurs, sometimes the indicated file exists or doesn't.
So, can anyone of any advice as to why these errors are being produced. I don't understand either, and as far I can see, it should just work (and indeed does, for a short while).
From MSDN:
Use the Parallel Loop pattern when you need to perform the same independent operation for each element of a collection or for a fixed number of iterations. The steps of a loop are independent if they don't write to memory locations or files that are read by other steps.
Parallel.For can speed up the processing of your rows by doing multi threading but it comes with a caveat that if it is not used correctly it will end with unexpected behavior of the program like the one you are having above.
The reason for following error :
DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.
can be that the one thread goes to write and the directory is not there mean while the other thread creates that. Normally when doing parallelism there can be race conditions as we are doing multi-threading and if we don't use proper mechanics like locks or monitors then we end up with these kind of issues.
As you are doing file writing so multiple threads when trying to write to the same file end up with the error you have latter i.e.
The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.
as one thread is already writing to file so at that time other threads would fail to access the file for writing to it.
I would suggest to use a normal loop instead of parallelism here.

RavenDB indices chains

Is it possible to use an output of one index as an input for another?
Something like:
public class ChainedIndex: AbstractIndexCreationTask<InputIndex, InputIndexOutputType, ReduceResult>
{
//blahblahblah
}
Yes. You can now do this.
Enable the Scripted Index Results bundle
Write your first index, for example - a map/reduce index.
Write a script that writes the result back to another document.
Write a new index against those documents.
As changes to the original documents are indexed, the resulting changes get written to new documents, which then get indexed. Repeat if desired, just be careful not to create an endless loop.
This is a new feature for RavenDB 2.5. Oren describes it in this video at 21:36.

Lucene.NET - Can't delete docs using IndexWriter

I'm taking over a project so I'm still learning this. The project uses Lucence.NET. I also have no idea if this piece of functionality is correct or not. Anyway, I am instantiating:
var writer = new IndexWriter(directory, analyzer, false);
For specific documents, I'm calling:
writer.DeleteDocuments(new Term(...));
In the end, I'm calling the usual writer.Optimize(), writer.Commit(), and writer.Close().
The field in the Term object is a Guid, converted to a string (.ToString("D")), and is stored in the document, using Field.Store.YES, and Field.Index.NO.
However, with these settings, I cannot seem to delete these documents. The goal is to delete, then add the updated versions, so I'm getting duplicates of the same document. I can provide more code/explanation if needed. Any ideas? Thanks.
The field must be indexed. If a field is not indexed, its terms will not show up in enumeration.
I don't think there is anything wrong with how you are handling the writer.
It sounds as if the term you are passing to DeleteDocuments is not returning any documents. Have you tried to do a query using the same term to see if it returns any results?
Also, if your goal is to simple recreate the document, you can call UpdateDocument:
// Updates a document by first deleting the document(s) containing term and
// then adding the new document. The delete and then add are atomic as seen
// by a reader on the same index (flush may happen only after the add). NOTE:
// if this method hits an OutOfMemoryError you should immediately close the
// writer. See above for details.
You may also want to check out SimpleLucene (http://simplelucene.codeplex.com) - it makes it a bit easier to do basic Lucene tasks.
[Update]
Not sure how I missed it but #Shashikant Kore is correct, you need to make sure the field is indexed otherwise your term query will not return anything.

why does entity framework+mysql provider enumeration returns partial results with no exceptions

I'm trying to make sense of a situation I have using entity framework on .net 3.5 sp1 + MySQL 6.1.2.0 as the provider. It involves the following code:
Response.Write("Products: " + plist.Count() + "<br />");
var total = 0;
foreach (var p in plist)
{
//... some actions
total++;
//... other actions
}
Response.Write("Total Products Checked: " + total + "<br />");
Basically the total products is varying on each run, and it isn't matching the full total in plist. Its varies widely, from ~ 1/5th to half.
There isn't any control flow code inside the foreach i.e. no break, continue, try/catch, conditions around total++, anything that could affect the count. As confirmation, there are other totals captured inside the loop related to the actions, and those match the lower and higher total runs.
I don't find any reason to the above, other than something in entity framework or the mysql provider that causes it to end the foreach when retrieving an item.
The body of the foreach can have some good variation in time, as the actions involve file & network access, my best shot at the time is that when the .net code takes beyond certain threshold there is some type of timeout in the underlying framework/provider and instead of causing an exception it is silently reporting no more items for enumeration.
Can anyone give some light in the above scenario and/or confirm if the entity framework/mysql provider has the above behavior?
Update 1: I can't reproduce the behavior by using Thread.Sleep in a simple foreach in a test project, not sure where else to look for this weird behavior :(.
Update 2: in the example above the .Count() always returns the same + correct amount of items. Using ToList or ToArray as suggested gets around of the issue as expected (no flow control statements in the foreach body) and both counts match + don't vary on each run.
What I'm interested in is what causes this behavior in entity framework + mysql. Would really prefer not having to change the code in all the projects that use entity framework + mysql to do .ToArray before enumerating the results because I don't know when it'll swallow some results. Or if I do it, at least know what/why it happened.
If the problem is related to the provider or whatever, then you can solve/identify that by realising the enumerable before you iterate over it:
var realisedList = plist.ToArray();
foreach(var p in realisedList)
{
//as per your example
}
If, after doing this, the problem still persists then
a) One of the actions in the enumerator is causing an exception that is getting swallowed somewhere
b) The underlying data really is different every time.
UPDATE: (as per your comment)
[deleted - multiple enumerations stuff as per your comment]
At the end of the day - I'd be putting the ToArray() call in to have the problem fixed in this case (if the Count() method is required to get a total, then just change it to .Length on the array that's constructed).
Perhaps MySql is killing the connection while you're enumerating, and doesn't throw an error to EF when the next MoveNext() is called. EF then just dutifully responds by saying that the enumerable is simply finished. If so, until such a bug in the provider is fixed, the ToArray() is the way forward.
I think actually that you hit on the answer in your question, but it may be the data that is causing the problem not the timeout. Here is the theory:
One (or several) row(s) in the result set has some data that causes an exception / problem, when it hits that row the system thinks that it has reached the last row.
To test this you could try:
Ordering the data and see if the number returned in the for each statement is the same each time.
Select only the id column and see if the problem goes away
Remove all rows from the table, add them back a few at a time to see if a specific row is causing the problem
If it is a timeout problem, have you tried changing the timeout in the connection string.
I believe it has to do with the way the EF handles lazy loading. You might have to use either Load() or Include() and also check using IsLoaded property within your processing loop. Check out these two links for more information:
http://www.singingeels.com/Articles/Entity_Framework_and_Lazy_Loading.aspx
http://blogs.msdn.com/jkowalski/archive/2008/05/12/transparent-lazy-loading-for-entity-framework-part-1.aspx
I apologize I don't know more about EF to be more specific. Hopefully the links will provide enough info to get you started and others can chime in with any questions you might have.
The issue, cause and workaround is described exactly in this mysql bug.
As suspected it Is a timeout related error in the provider, but its not the regular timeout i.e. net_write_timeout. That's why the simple reproduction in a test project didn't work, since the timeout relates to All the cycles of the foreach and not just a particularly long body between the read of 2 rows.
As of now, the issue is present in the latest version of the MySql provider and under normal conditions would only affect scenarios where rows are being read with a connection maintained for a long time (which might or not involve a slow query). This is great, because it doesn't affect all of the previous projects where I have used MySql / applying the workaround to the sources also means it doesn't fail silently.
Ps. couple of what seem to be related mysql bugs: 1, 2

Categories

Resources