Lucene.NET - Can't delete docs using IndexWriter - c#

I'm taking over a project so I'm still learning this. The project uses Lucence.NET. I also have no idea if this piece of functionality is correct or not. Anyway, I am instantiating:
var writer = new IndexWriter(directory, analyzer, false);
For specific documents, I'm calling:
writer.DeleteDocuments(new Term(...));
In the end, I'm calling the usual writer.Optimize(), writer.Commit(), and writer.Close().
The field in the Term object is a Guid, converted to a string (.ToString("D")), and is stored in the document, using Field.Store.YES, and Field.Index.NO.
However, with these settings, I cannot seem to delete these documents. The goal is to delete, then add the updated versions, so I'm getting duplicates of the same document. I can provide more code/explanation if needed. Any ideas? Thanks.

The field must be indexed. If a field is not indexed, its terms will not show up in enumeration.

I don't think there is anything wrong with how you are handling the writer.
It sounds as if the term you are passing to DeleteDocuments is not returning any documents. Have you tried to do a query using the same term to see if it returns any results?
Also, if your goal is to simple recreate the document, you can call UpdateDocument:
// Updates a document by first deleting the document(s) containing term and
// then adding the new document. The delete and then add are atomic as seen
// by a reader on the same index (flush may happen only after the add). NOTE:
// if this method hits an OutOfMemoryError you should immediately close the
// writer. See above for details.
You may also want to check out SimpleLucene (http://simplelucene.codeplex.com) - it makes it a bit easier to do basic Lucene tasks.
[Update]
Not sure how I missed it but #Shashikant Kore is correct, you need to make sure the field is indexed otherwise your term query will not return anything.

Related

Lucene index field not searchable

So I have a field in my Lucene index documents named "Field1" (for all intents and purposes).
When I open Luke, and browse the documents, I see most of the documents have this field. However when I switch to the search tab, and I input Field1:parameterValue I get zero search results.
When doing the indexing, for the document, I have
doc.Add(new Field("Field1", field1, Field.Store.YES, Field.Index.ANALYZED));
Why is my field not able to be searched? As an aside, I can't find any documentation on Luke that explains what the "IdfpTSVopNLB#" column is in the document record either. I'm thinking this information could possibly be useful, so for one of the records that has this field, the column value is IdfpTS---N--- and the "Norm" column is 4.0
The "IdfpTSVopNLB#" field is a collection of flags. You should see a key to it in Luke:
I would guess the reason your searches are failing is because you aren't taking your analysis into account. For instance, for your sample query: Field1:parameterValue, if the field is analyzed by StandardAnalyzer (and the query is not analyzer or is keyword analyzed), you'll get no results. This is because "parameterValue" would have been lowercased by the analyzer, so the actual searchable term would be "parametervalue", instead.
In the search tab, you should see a place to select an analyzer for Luke to use for query parsing. If you use the same analyzer you used to index the data, you may see better results.
As it turns out, this is the correct way to do this. I just needed to delete the entire index and rebuild it from scratch to get the new values in. It didn't automatically update the existing indexes.

Saving custom settings or attributes in a Word document

I've got a MS Word project where I'm building a number of Panes for users to complete some info which automatically populates text at bookmarks throughout the document. I'm just trying to find the best way of saving these values somehow that I can retrieve them easily when re-opening the document after users have typed in their values.
I could just try to retrieve them from the bookmarks themselves but of course in many cases they contain text values when I'd ideally want to store a primary key somewhere that's not visible to the user and just in case they made changes to the text which would make reverse engineering the values impossible.
I can't seem to find any information on saving custom attributes in a Word document, so would really appreciate some general guidance of how this might be achieved.
Thanks a lot!
I would suggest the use of custom document properties. there you can strings in a key -value manner (at least if it is similar to excel).
I found a thread which explains how to do it:
Set custom document properties with Word interop
After playing around with this a fair bit this is my final code in case it helps someone else, I've found this format easier to understand and work with. It's all based on the referenced article by Christian:
using Office = Microsoft.Office.Core;
using Word = Microsoft.Office.Interop.Word;
using System.Reflection;
Office.DocumentProperties properties = (Office.DocumentProperties)Globals.ThisDocument.CustomDocumentProperties;
//Check if the property exists already
if (properties.Cast<Office.DocumentProperty>().Where(c => c.Name == "nameofproperty").Count() == 0)
{
//Then add the property and value
properties.Add("nameofproperty", false, Office.MsoDocProperties.msoPropertyTypeString, "yourvalue");
}
else
{
//else just update the value
properties["nameofproperty"].Value = "yourvalue";
}
In terms of retrieving the value it's as easy as using the same three lines at the top to get the properties object, perhaps using the code in the if statement to check if it exists, and the retrieving it using properties["nameofproperty"].Value

How can you remove a field from a word document?

I'm working on a project where the user can insert data into a document using fields, document properties and variables. The user also needs to be able to remove the data from the document. So far, I've managed to remove the document property and variable, but I'm not sure how I would go about removing the field (that's already inserted into the document). Note that I need to compare the field to a string, and if it matches; delete it from the doc.
I'm assuming you're using .NET Interop with Word. In that case, I believe you're looking for Field.Delete.
This is of course also assuming you know how to get the field you're looking for, which would usually be enumerating through _Document.Fields (or a more finite range if you know one) until you get the right one.
The Field has a Delete method. See the documentation for Field.Delete.
So I think something like this would work:
foreach(Field f in ActiveDocument.Fields)
{
f.Select();
if(f.Type == TypeYouWantToDelete)
{
d.Delete();
}
}

Lucene.net intermitantly indexed documents not appearing in the

I've got an issue which shows up intermitantly in my unit tests and I can't work out why.
The unit test itself is adding multiple documents to an index, then trying to query the index to get the documents back out again.
So 95% of the time it works without any problems. Then the other 5% of the time it cannot retrieve the documents back out of the index.
My unit test code is as follows:
[Test]
public void InsertMultipleDocuments()
{
string indexPath = null;
using (LuceneHelper target = GetLuceneHelper(ref indexPath))
{
target.InsertOrUpdate(
target.MakeDocument(GetDefaultSearchDocument()),
target.MakeDocument(GetSecondSearchDocument()));
var doc = target.GetDocument(_documentID.ToString()).FirstOrDefault();
Assert.IsNotNull(doc);
Assert.AreEqual(doc.DocumentID, _documentID.ToString());
doc = target.GetDocument(_document2ID.ToString()).FirstOrDefault();
Assert.IsNotNull(doc);
Assert.AreEqual(doc.DocumentID, _document2ID.ToString());
}
TidyUpTempFolder(indexPath);
}
I won't post the full code from my LuceneHelper, but the basic idea of it is that it holds an IndexSearcher in reference which is closed every time an item is written to the index (so it can be re-opened again with all the of the latest documents).
The actual unit test will often fail when gathering the second document. I assumed it was to do with the searcher not being closed and seeing cached data, however this isn't the case.
Does Lucene have any delay in adding documents to the index? I assumed that once it had added the document to the index it was available immediately as long as you closed any old search indexers and opened a new one.
Any ideas?
How do you close the IndexWriter you use for updating the index? The close method has an overload that takes a single boolean parameter specifying whether or not you want to wait for merges to complete. The default merge scheduler runs the merges in a separate thread and that might cause your problems.
Try closing the writer like this:
indexWriter.Close(true);
More information can be found at Lucene.NET documentation.
Btw, which version of Lucene.NET are you using?

Performance issue with accessing Microsoft.Office.Core.DocumentProperties

I have a Excel COM addin which reads the CustomDocumentProperties section of a workbook.
This is how I access a particular entry from the CustomDocumentProperties section
DocumentProperties docProperties = (DocumentProperties)
xlWorkbook.CustomDocumentProperties;
docProperty = docProperties[propName];
The problem is when the CustomDocumentProperties contain more than 8000 entries, the performance of this
code is really bad. I have ran CPU profiler and it showed that the following line takes more than a minute.
docProperty = docProperties[propName];
Does anyone know how to improve the performance of accessing DocumentProperties?
Thanks!
I doubt that there is anything that you could do to improve the performance of the document properties. I believe that it is implemented as a simple list -- not as a dictionary or hash table. In fact, I don't believe that the list is sorted, so with 8000 entries, on average half of them, or 4000, would have to be accessed in order to find the property that you are looking for.
You might consider not using the CustomDocumentProperties as a dictionary. Instead, you might try putting all 8000 of your entries into a custom dictionary, serializing it, and then adding the entire serialized dictionary to the CustomDocumentProperties as a single entry. So to use it, you would access the CustomDocumentProperties, deserialize the dictionary, and then use it repeatedly. When done, if there were any changes to the dictionary, you would have to re-serialize it and save it back to the CustomDocumentProperties, which you would probably only want to do once -- for example, just before saving your workbook. (You might want to put code to re-serialize and save your custom dictionary to the CustomDocumentProperties within the Workbook.BeforeSave event.)

Categories

Resources