Lucene query documents where name matches against collection - c#

I have lucene documents with below structure
{
name : "A",
id :1
},
{
name : "B",
id :1
},
{
name : "C",
id :3
}
Now I have a collection like List which contains A, B. I wanted to select documents where name is A or B . So as per above lucene documents I should have documents A and B . I wanted to fetch these 2 documents with a single lucene call instead of multiple lucene calls for each document.
i tried with BooleanQuery and adding my search query in a loop but the search query did not return anything. if I hit lucene with single document it works and returns a single document.
Could anyone please suggest How I can retrieve all matching documents with a single query ?
I tried something like below
List<string> terms = new List<string>(){'A', 'B'};
var mainQuery = new BooleanQuery);
var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "name", analyzer);
foreach (var term in terms)
{
var query = parser.Parse(term);
mainQuery.Add(query, Occur.MUST_NOT);
}
var hits = _searcher.Search(mainQuery, 1000);
Above query did not work and returns 0 result .

I am able to resolve this by my own.It's just a simple OR clause which is Occur.SHOULD
var booleanQuery = new BooleanQuery();
foreach (var term in terms)
{
var termQuery = new TermQuery(new Term("name", term ));
booleanQuery.Add(termQuery, Occur.SHOULD);
}

Related

How can I write multiple updates in one BulkAll method in ElasticSearch NEST 7.13.2

Using ElasticSearch NEST .Net package 7.13.2 in Visual Studio 2019
For a list of products I am currently updating existing documents in my product index by using the following code:
var productIndex = "productindex";
foreach (var product in products)
{
productClassIdScript = $"ctx._source.productClassId = \"{product.ProductClassId}\"; ";
elasticClient.Update<productIndex, object>(product.Id,
q => q.Script(s => s.Source(productClassIdScript).Lang("painless")));
}
I do this for more than 10000 products and it takes about 2 hours.
I know I can insert new documents with the Bulk API.
Can I do the updates with the BulkAll method ?
Something like this:
var bulkAllObservable = elasticClient.BulkAll<Product>(myBulkAllRequest)
.Wait(TimeSpan.FromMinutes(15), next =>
{
// do something e.g. write number of pages to console
});
How should I construct myBulkAllRequest ?
Any help is much appreciated.
Bulk index will drastically reduce your indexing / updating time, so this is a good way to go.
You can still use BulkAll for updates, in case elasticsearch already has
document with provided id, the document will be updated.
var bulk = elasticClient.BulkAll<EsDocument>(new List<EsDocument> { new EsDocument { Id = "1", Name = "1" }}, d => d);
using var subscribe = bulk.Subscribe(new BulkAllObserver(onNext: response => Console.WriteLine("inserted")));
bulk.Wait(TimeSpan.FromMinutes(1), response => Console.WriteLine("Bulk insert done"));
var bulk2 = elasticClient.BulkAll<EsDocument>(new List<EsDocument> { new EsDocument { Id = "1", Name = "1_updated" }}, d => d);
using var subscribe2 = bulk2.Subscribe(new BulkAllObserver(onNext: response => Console.WriteLine("inserted")));
bulk2.Wait(TimeSpan.FromMinutes(1), response => Console.WriteLine("Bulk insert done"));
First BulkAll will insert document with Id "1" second, will update document with Id "1".
Index state after the first bulkd
and after second one

Lucene.net - How do I create a negative query, ie. search for objects NOT containing something

I'm working on an EPiServer website using a Lucene.net based search engine.
I have a query for finding only pages with a certain pageTypeId. Now I want to do the opposite, I want to only find pages that is NOT a certain pageTypeId. Is that possible?
This is the code for creating a query to search only for pages with pageTypeId 1, 2 or 3:
public BooleanClause GetClause()
{
var booleanQuery = new BooleanQuery();
var typeIds = new List<string>();
typeIds.Add("1");
typeIds.Add("2");
typeIds.Add("3");
foreach (var id in this.typeIds)
{
var termQuery = new TermQuery(
new Term(IndexFieldNames.PageTypeId, id));
var clause = new BooleanClause(termQuery,
BooleanClause.Occur.SHOULD);
booleanQuery.Add(clause);
}
return new BooleanClause(booleanQuery,
BooleanClause.Occur.MUST);
}
I want instead to create a query where I search for pages that have a pageTypeId that is NOT "4".
I tried simply replacing "SHOULD" and "MUST" with "MUST_NOT", but that didn't work.
Thanks to #goalie7960 for replying so quickly. Here is my revised code for searching for anything except some selected page types. This search includes all documents except those with pageTypeId "1", "2" or "3":
public BooleanClause GetClause()
{
var booleanQuery = new BooleanQuery();
booleanQuery.Add(new MatchAllDocsQuery(),
BooleanClause.Occur.MUST);
var typeIds = new List<string>();
typeIds.Add("1");
typeIds.Add("2");
typeIds.Add("3");
foreach (var typeId in this.typeIds)
{
booleanQuery.Add(new TermQuery(
new Term(IndexFieldNames.PageTypeId, typeId)),
BooleanClause.Occur.MUST_NOT);
}
return new BooleanClause(booleanQuery,
BooleanClause.Occur.MUST);
}
Assuming all your docs have a pageTypeId you can try using a MatchAllDocsQuery and then a MUST_NOT to remove all the docs you want to skip. Something like this would work I think:
BooleanQuery subQuery = new BooleanQuery();
subQuery.Add(new MatchAllDocsQuery(), BooleanClause.Occur.MUST);
subQuery.Add(new TermQuery(new Term(IndexFieldNames.PageTypeId, "4")), BooleanClause.Occur.MUST_NOT);
return subQuery;

Why does this Lucene.Net query fail?

I am trying to convert my search functionality to allow for fuzzy searches involving multiple words. My existing search code looks like:
// Split the search into seperate queries per word, and combine them into one major query
var finalQuery = new BooleanQuery();
string[] terms = searchString.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string term in terms)
{
// Setup the fields to search
string[] searchfields = new string[]
{
// Various strings denoting the document fields available
};
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
finalQuery.Add(parser.Parse(term), BooleanClause.Occur.MUST);
}
// Perform the search
var directory = FSDirectory.Open(new DirectoryInfo(LuceneIndexBaseDirectory));
var searcher = new IndexSearcher(directory, true);
var hits = searcher.Search(finalQuery, MAX_RESULTS);
This works correctly, and if I have an entity with the name field of "My name is Andrew", and I perform a search for "Andrew Name", Lucene correctly finds the correct document. Now I want to enable fuzzy searching, so that "Anderw Name" is found correctly. I changed my method to use the following code:
const int MAX_RESULTS = 10000;
const float MIN_SIMILARITY = 0.5f;
const int PREFIX_LENGTH = 3;
if (string.IsNullOrWhiteSpace(searchString))
throw new ArgumentException("Provided search string is empty");
// Split the search into seperate queries per word, and combine them into one major query
var finalQuery = new BooleanQuery();
string[] terms = searchString.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string term in terms)
{
// Setup the fields to search
string[] searchfields = new string[]
{
// Strings denoting document field names here
};
// Create a subquery where the term must match at least one of the fields
var subquery = new BooleanQuery();
foreach (string field in searchfields)
{
var queryTerm = new Term(field, term);
var fuzzyQuery = new FuzzyQuery(queryTerm, MIN_SIMILARITY, PREFIX_LENGTH);
subquery.Add(fuzzyQuery, BooleanClause.Occur.SHOULD);
}
// Add the subquery to the final query, but make at least one subquery match must be found
finalQuery.Add(subquery, BooleanClause.Occur.MUST);
}
// Perform the search
var directory = FSDirectory.Open(new DirectoryInfo(LuceneIndexBaseDirectory));
var searcher = new IndexSearcher(directory, true);
var hits = searcher.Search(finalQuery, MAX_RESULTS);
Unfortunately, with this code if I submit the search query "Andrew Name" (same as before) I get zero results back.
The core idea is that all terms must be found in at least one document field, but each term can reside in different fields. Does anyone have any idea why my rewritten query fails?
Final Edit: Ok it turns out I was over complicating this by a LOT, and there was no need to change from my first approach. After reverting back to the first code snippet, I enabled fuzzy searching by changing
finalQuery.Add(parser.Parse(term), BooleanClause.Occur.MUST);
to
finalQuery.Add(parser.Parse(term.Replace("~", "") + "~"), BooleanClause.Occur.MUST);
Your code works for me if I rewrite the searchString to lower-case. I'm assuming that you're using the StandardAnalyzer when indexing, and it will generate lower-case terms.
You need to 1) pass your tokens through the same analyzer (to enable identical processing), 2) apply the same logic as the analyzer or 3) use an analyzer which matches the processing you do (WhitespaceAnalyzer).
You want this line:
var queryTerm = new Term(term);
to look like this:
var queryTerm = new Term(field, term);
Right now you're searching field term (which probably doesn't exist) for the empty string (which will never be found).

Set operations in RavenDB

I read this article on ravendb set operations, but it didn't show me exactly how to update a set of documents via C#. I would like to update a field on all documents that match a certain criteria. Or to put it another way, I would like to take this C# and make it more efficient:
var session = db.GetSession();
foreach(var data in session.Query<Data>().Where(d => d.Color == "Red"))
{
data.Color = "Green";
session.Store(data);
}
session.SaveChanges();
See http://ravendb.net/docs/2.5/faq/denormalized-updates
First parameter is the name of the index you wish to update.
Second parameter is the index query which lets you specify your where clause. The syntax for the query is the lucene syntax (http://lucene.apache.org/java/2_4_0/queryparsersyntax.html). Third parameter is the update clause. Fourth parameter is if you want stale results.
documentStore.DatabaseCommands.UpdateByIndex("DataByColor",
new IndexQuery
{
Query = "Color:red"
}, new[]
{
new PatchRequest
{
Type = PatchCommandType.Set,
Name = "Color",
Value = "Green"
}
},
allowStale: false);

Refactor linq statement

I have a linq expression that I've been playing with in LINQPad and I would like to refactor the expression to replace all the tests for idx == -1 with a single test. The input data for this is the result of a free text search on a database used for caching Active Directory info. The search returns a list of display names and associated summary data from the matching database rows. I want to extract from that list the display name and the matching Active Directory entry. Sometimes the match will only occur on the display name so there may be no further context. In the example below, the string "Sausage" is intended to be the search term that returned the two items in the matches array. Clearly this wouldn't be the case for a real search because there is no match for Sausage in the second array item.
var matches = new []
{
new { displayName = "Sausage Roll", summary = "|Title: Network Coordinator|Location: Best Avoided|Department: Coordination|Email: Sausage.Roll#somewhere.com|" },
new { displayName = "Hamburger Pattie", summary = "|Title: Network Development Engineer|Location: |Department: Planning|Email: Hamburger.Pattie#somewhere.com|" },
};
var context = (from match in matches
let summary = match.summary
let idx = summary.IndexOf("Sausage")
let start = idx == -1 ? 0 : summary.LastIndexOf('|', idx) + 1
let stop = idx == -1 ? 0 : summary.IndexOf('|', idx)
let ctx = idx == -1 ? "" : string.Format("...{0}...", summary.Substring(start, stop - start))
select new { displayName = match.displayName, summary = ctx, })
.Dump();
I'm trying to create a list of names and some context for the search results if any exists. The output below is indicative of what Dump() displays and is the correct result:
displayName summary
---------------- ------------------------------------------
Sausage Roll ...Email: Sausage.Roll#somewhere.com...
Hamburger Pattie
Edit: Regex version is below, definitely tidier:
Regex reg = new Regex(#"\|((?:[^|]*)Sausage[^|]*)\|");
var context = (from match in matches
let m = reg.Match(match.summary)
let ctx = m.Success ? string.Format("...{0}...", m.Groups[1].Value) : ""
select new { displayName = match.displayName, context = ctx, })
.Dump();
(I know this doesn't answer your specific question), but here's my contribution anyway:
You haven't really described how your data comes in. As #Joe suggested, you could use a regex or split the fields as I've done below.
Either way I would suggested refactoring your code to allow unit testing.
Otherwise if your data is invalid / corrupt whatever, you will get a runtime error in your linq query.
[TestMethod]
public void TestMethod1()
{
var matches = new[]
{
new { displayName = "Sausage Roll", summary = "|Title: Network Coordinator|Location: Best Avoided|Department: Coordination|Email: Sausage.Roll#somewhere.com|" },
new { displayName = "Hamburger Pattie", summary = "|Title: Network Development Engineer|Location: |Department: Planning|Email: Hamburger.Pattie#somewhere.com|" },
};
IList<Person> persons = new List<Person>();
foreach (var m in matches)
{
string[] fields = m.summary.Split('|');
persons.Add(new Person { displayName = m.displayName, Title = fields[1], Location = fields[2], Department = fields[3] });
}
Assert.AreEqual(2, persons.Count());
}
public class Person
{
public string displayName { get; set; }
public string Title { get; set; }
public string Location { get; set; }
public string Department { get; set; }
/* etc. */
}
Or something like this:
Regex reg = new Regex(#"^|Email.*|$");
foreach (var match in matches)
{
System.Console.WriteLine(match.displayName + " ..." + reg.Match(match.summary) + "... ");
}
I haven't tested this, probably not even correct syntax but just to give you an idea of how you could do it with regex.
Update
Ok, i've seen your answer and it's good that you posted it because I think i didn't explain it clearly.
I expected your answer to look something like this at the end (tested using LINQPad now, and now i understand what you mean by using LINQPad because it actually does run a C# program not just linq commands, awesome!) Anyway this is what it should look like:
foreach (var match in matches)
Console.WriteLine(string.Format("{0,-20}...{1}...", match.displayName, Regex.Match(match.summary, #"Email:(.*)[|]").Groups[1]));
}
That's it, the whole thing, take linq out of it, completely!
I hope this clears it up, you do not need linq at all.
like this?
var context = (from match in matches
let summary = match.summary
let idx = summary.IndexOf("Sausage")
let test=idx == -1
let start =test ? 0 : summary.LastIndexOf('|', idx) + 1
let stop = test ? 0 : summary.IndexOf('|', idx)
let ctx = test ? "" : string.Format("...{0}...", summary.Substring(start, stop - start))
select new { displayName = match.displayName, summary = ctx, })
.Dump();

Categories

Resources