Get query terms from Lucene query for highlighting

Get query terms from Lucene query for highlighting - c#

My Lucene queries will usually exist of a bunch of AND combined fields.
Is it possible to get the queried fields out of the Query object again?

Did you mean extracting the terms or the field names? Since you already know you're handling a BooleanQuery, to extract the fields you can simply iterate the BooleanClause array returned by BooleanQuery.getClauses(), rewrite each clause to its base query (Query.rewrite) and apply recursively until you have a TermQuery on your hands.
If you did mean term extraction, I'm not sure about Lucene.NET, but in Java Lucene you can use org.apache.lucene.search.highlight.QueryTermExtractor; you pass a (rewritten) query to one of its getTerms overloads and get an array of WeightedTerms.
As far as I remember, the downsides to using this technique are:
Since it internally uses a term set it won't handle multiple instances of the same token, e.g. "dream within a dream"
It only supports base query types (TermQuery, BooleanQuery and any other query type which supports Query.extractTerms). I believe we've used it internally for SpanNearQuery and SpanNearOrderedQuery instances, but I may be wrong on this.
Either way I hope this is enough to get you started.

Related

Why would I want to use an ExpressionVisitor?

I know from the MSDN's article about How to: Modify Expression Trees what an ExpressionVisitor is supposed to do. It should modify expressions.
Their example is however pretty unrealistic so I was wondering why would I need it? Could you name some real-world cases where it would make sense to modify an expression tree? Or, why does it have to be modified at all? From what to what?
It has also many overloads for visiting all kinds of expressions. How do I know when I should use any of them and what should they return? I saw people using VisitParameter and returning base.VisitParameter(node) the other on the other hand were returning Expression.Parameter(..).

There was a issue where on the database we had fields which contained 0 or 1 (numeric), and we wanted to use bools on the application.
The solution was to create a "Flag" object, which contained the 0 or 1 and had a conversion to bool. We used it like a bool through all the application, but when we used it in a .Where() clause the EntityFramework complained that it is unable to call the conversion method.
So we used a expression visitor to change all property accesses like .Where(x => x.Property) to .Where(x => x.Property.Value == 1) just before sending the tree to EF.

Could you name some real-world cases where it would make sense to modify an expression tree?
Strictly speaking, we never modify an expression tree, as they are immutable (as seen from the outside, at least, there's no promise that it doesn't internally memoise values or otherwise have mutable private state). It's precisely because they are immutable and hence we can't just change a node that the visitor pattern makes a lot of sense if we want to create a new expression tree that is based on the one we have but different in some particular way (the closest thing we have to modifying an immutable object).
We can find a few within Linq itself.
In many ways the simplest Linq provider is the linq-to-objects provider that works on enumerable objects in memory.
When it receives enumerables directly as IEnumerable<T> objects it's pretty straight-forward in that most programmers could write an unoptimised version of most of the methods pretty quickly. E.g. Where is just:
foreach (T item in source)
if (pred(item))
yield return item;
And so on. But what about EnumerableQueryable implementing the IQueryable<T> versions? Since the EnumerableQueryable wraps an IEnumerable<T> we could do the desired operation on the one or more enumerable objects involved, but we have an expression describing that operation in terms of IQueryable<T> and other expressions for selectors, predicates, etc, where what we need is a description of that operation in terms of IEnumerable<T> and delegates for selectors, predicates, etc.
System.Linq.EnumerableRewriter is an implementation of ExpressionVisitor does exactly such a re-write, and the result can then simply be compiled and executed.
Within System.Linq.Expressions itself there are a few implementations of ExpressionVisitor for different purposes. One example is that the interpreter form of compilation can't handle hoisted variables in quoted expressions directly, so it uses a visitor to rewrite it into working on indices into a a dictionary.
As well as producing another expression, an ExpressionVisitor can produce another result. Again System.Linq.Expressions has internal examples itself, with debug strings and ToString() for many expression types working by visiting the expression in question.
This can (though it doesn't have to be) be the approach used by a database-querying linq provider to turn an expression into a SQL query.
How do I know when I should use any of them and what should they return?
The default implementation of these methods will:
If the expression can have no child expressions (e.g. the result of Expression.Constant()) then it will return the node back again.
Otherwise visit all the child expressions, and then call Update on the expression in question, passing the results back. Update in turn will either return a new node of the same type with the new children, or return the same node back again if the children weren't changed.
As such, if you don't know you need to explicitly operate on a node for whatever your purposes are, then you probably don't need to change it. It also means that Update is a convenient way to get a new version of a node for a partial change. But just what "whatever your purposes are" means of course depends on the use case. The most common cases are probably go to one extreme or the other, with either just one or two expression types needing an override, or all or nearly all needing it.
(One caveat is if you are examining the children of those nodes that have children in a ReadOnlyCollection such as BlockExpression for both its steps and variables or TryExpression for its catch-blocks, and you will only sometimes change those children then if you haven't changed you are best to check for this yourself as a flaw [recently fixed, but not in any released version yet] means that if you pass the same children to Update in a different collection to the original ReadOnlyCollection then a new expression is created needlessly which has effects further up the tree. This is normally harmless, but it wastes time and memory).

The ExpressionVisitor enables the visitor pattern for Expression's.
Conceptually, the problem is that when you navigate an Expression tree, all you know is that any given node is an Expression, but you don't know specifically what kind of Expression. This pattern allows you to know what kind of Expression you're working with and specify type-specific handling for different kinds.
When you have an Expression, you can just call .Modify. The Expression knows its own type, so it'll call back the appropriate override.
Looking at the MSDN example you linked:
public class AndAlsoModifier : ExpressionVisitor
{
public Expression Modify(Expression expression)
{
return Visit(expression);
}
protected override Expression VisitBinary(BinaryExpression b)
{
if (b.NodeType == ExpressionType.AndAlso)
{
Expression left = this.Visit(b.Left);
Expression right = this.Visit(b.Right);
// Make this binary expression an OrElse operation instead of an AndAlso operation.
return Expression.MakeBinary(ExpressionType.OrElse, left, right, b.IsLiftedToNull, b.Method);
}
return base.VisitBinary(b);
}
}
In this example, if the Expression happens to be a BinaryExpression, it'll call back VisitBinary(BinaryExpression b) given in the example. Now, you can deal with that BinaryExpression knowing that it's a BinaryExpression. You could also specify other override methods that handle other kinds of Expression's.
It's worth noting that, since this is an overloaded resolution trick, visited Expression's will call back the best-fitting method. So, if there're different kinds of BinaryExpression's, then you could write an override for one specific subtype; if another subtype calls back, it'll just use the default BinaryExpression handling.
In short, this pattern allows you to navigate an Expression tree knowing what kind of Expression's you're working with.

Specific real world example I have just encountered occurred when shifting to EF Core and migrating from Sql Server (MS Specific) to SqlLite (platform independent).
The existing business logic revolved around a middle tier/ service layer interface that assumed Full Text Search (FTS) happened auto-magically in the background which it does with SQL Server. Search related queries were passed into this tier via Expressions and FTS against an Sql Server store required no additional FTS specific entities.
I didn't want to change any of this but with SqlLite you have to target a specific virtual table for a Full Text Search which would in turn have meant changing all the middle tier calls to re-target the FTS tables/entities and then joining them to the business entity tables to get a similar result set.
But by sub-classing ExpressionVisitor I was able to intercept the calls in the DAL layer and simply rewrite the incoming expression (or more precisely some of the BinaryExpressions within the overall search expression) to specifically handle SqlLites FTS requirements.
This meant that specialization of the datalayer to the data store happened within a single class that was called from a single place within a repository base class. No other aspects of the application needed to be altered in order to support FTS via EFCore and any SqlLite FTS related entities could be contained in a single pluggable assembly.
So ExpressionVisitor is really very useful, especially when combined with the whole notion of being able to pass around expression trees as data via various forms of IPC.

Query ContentItems in Orchard

I have a list of 'Article' content items. On this article is an article part, which has a field I'd like to reference in my query.
_contentManager.Query("Article").Where<ArticlePartRecord>(o => o.MyField == "criteria");
The above would work, but I don't have the strongly typed ArticlePartRecord to pass into the Where.
How else can I achieve this?
What I've Tried
I've tried iterating through the parts and fields within the Where, but this would then be done for every single article, of which there could be thousands. It'd pose a few performance problems.
Must I create the type? Or can I pass a string or work around it somehow? If it's a case of creating the class, what fields should this have?

The long and short is that you cant query fields really. Can you move the field into the Article part?
Do you need to be querying or can you use a Projection? There you can query fields because it indexes them all. I suppose you could try to search that index, though I don't believe it exposes it in a very friendly way, but I'm not 100% on that.

Dynamic Linq on Dapper dynamic collections - possible?

We are investigating using LinQ to query an internal dynamic collection created by Dapper. The question is:
How to execute dynamic LinQ against the collection using Scott Guthrie dynamic linq (or some other technology if possible)? (http://weblogs.asp.net/scottgu/dynamic-linq-part-1-using-the-linq-dynamic-query-library)
This is what we want to do (much simplified):
Use Dapper to return a dynamic collection (here called rows):
rows = conn.Query(“select ACCOUNT, UNIT, AMOUNT from myTable”);
If we use a “static” LinQ query there is no problem. So this works fine:
var result = rows.Where(w => w.AMOUNT > 0);
But we would like to write something similar to this using dynamic Linq:
var result = rows.Where("AMOUNT > 0");
But we can’t get this to work.
The error we get is:
No property or field ‘AMOUNT’ exists in type ‘Object’
(We have tried a lot of other syntax also – but cant get it to work)
Please note: We do not want to use dynamic SQL when Dapper requests data from the database (that is easy). We want to execute many small dynamic Linq statements on the collection that the Dapper query returns.
Can it be that ScottGu dynamic Linq only works with ‘LinQ to SQL’?
Is there some other alternative approach to achieve the same thing?
(Performance is a key issue)
/Erik

conn.Query("...")
returns an IEnumerable<dynamic>. However, "dynamic LINQ" pre-dates dynamic, and presumably nobody has updated it to work with dynamic; it could certainly be done, but it is work (and isn't trivial). Options:
use Query<T> for some T
make the required changes to "dynamic LINQ", and preferably make those changes available to the wider community

I suspect that conn.Query(“select ACCOUNT, UNIT, AMOUNT from myTable”); returns IEnumerable<object>. In order for DLinq to work, you need to have IEnumerable<TheActualType>.
You can try this:
conn.Query<dynamic>("yourQueryString")
.ToList()
//.ToAnonymousList()
.Where("AMOUNT > 0");
If that doesn't work, you could try and use ToAnonymousList, that tries to return the IList<TheActualType.

Common problem for me in C#, is my solution good, stupid, reasonable? (Advanced Beginner)

Ok, understand that I come from Cold Fusion so I tend to think of things in a CF sort of way, and C# and CF are as different as can be in general approach.
So the problem is: I want to pull a "table" (thats how I think of it) of data from a SQL database via LINQ and then I want to do some computations on it in memory. This "table" contains 6 or 7 values of a couple different types.
Right now, my solution is that I do the LINQ query using a Generic List of a custom Type. So my example is the RelevanceTable. I pull some data out that I want to do some evaluation of the data, which first start with .Contains. It appears that .Contains wants to act on the whole list or nothing. So I can use it if I have List<string>, but if I have List<ReferenceTableEntry> where ReferenceTableEntry is my custom type, I would need to override the IEquatable and tell the compiler what exactly "Equals" means.
While this doesn't seem unreasonable, it does seem like a long way to go for a simple problem so I have this sneaking suspicion that my approach is flawed from the get go.
If I want to use LINQ and .Contains, is overriding the Interface the only way? It seems like if there way just a way to say which field to operate on. Is there another collection type besides LIST that maybe has this ability. I have started using List a lot for this and while I have looked and looked, a see some other but not necessarily superior approaches.
I'm not looking for some fine point of performance or compactness or readability, just wondering if I am using a Phillips head screwdriver in a Hex screw. If my approach is a "decent" one, but not the best of course I'd like to know a better, but just knowing that its in the ballpark would give me little "Yeah! I'm not stupid!" and I would finish at least what I am doing completely before switch to another method.
Hope I explained that well enough. Thanks for you help.

What exactly is it you want to do with the table? It isn't clear. However, the standard LINQ (-to-Objects) methods will be available on any typed collection (including List<T>), allowing any range of Where, First, Any, All, etc.
So: what is you are trying to do? If you had the table, what value(s) do you want?
As a guess (based on the Contains stuff) - do you just want:
bool x= table.Any(x=>x.Foo == foo); // or someObj.Foo
?

There are overloads for some of the methods in the List class that takes a delegate (optionally in the form of a lambda expression), that you can use to specify what field to look for.
For example, to look for the item where the Id property is 42:
ReferenceTableEntry found = theList.Find(r => r.Id == 42);
The found variable will have a reference to the first item that matches, or null if no item matched.
There are also some LINQ extensions that takes a delegate or an expression. This will do the same as the Find method:
ReferenceTableEntry found = theList.FirstOrDefault(r => r.Id == 42);

Ok, so if I'm reading this correctly you want to use the contains method. When using this with collections of objects (such as ReferenceTableEntry) you need to be careful because what you're saying is you're checking to see if the collection contains an object that IS the same as the object you're comparing against.
If you use the .Find() or .FindAll() method you can specify the criteria that you want to match on using an anonymous method.
So for example if you want to find all ReferenceTableEntry records in your list that have an Id greater than 1 you could do something like this
List<ReferenceTableEntry> listToSearch = //populate list here
var matches = listToSearch.FindAll(x => x.Id > 1);
matches will be a list of ReferenceTableEntry records that have an ID greater than 1.
having said all that, it's not completely clear that this is what you're trying to do.

Here is the LINQ query involved that creates the object I am talking about, and the problem line is:
.Where (searchWord => queryTerms.Contains(searchWord.Word))
List<queryTerm> queryTerms = MakeQueryTermList();
public static List<RelevanceTableEntry> CreateRelevanceTable(List<queryTerm> queryTerms)
{
SearchDataContext myContext = new SearchDataContext();
var productRelevance = (from pwords in myContext.SearchWordOccuranceProducts
where (myContext.SearchUniqueWords
.Where (searchWord => queryTerms.Contains(searchWord.Word))
.Select (searchWord => searchWord.Id)).Contains(pwords.WordId)
orderby pwords.WordId
select new {pwords.WordId, pwords.Weight, pwords.Position, pwords.ProductId});
}
This query returns a list of WordId's that match the submitted search string (when it was List and it was just the word, that works fine, because as an answerer mentioned before, they were the same type of objects). My custom type here is queryTerms, a List that contains WordId, ProductId, Position, and Weight. From there I go about calculating the relevance by doing various operations on the created object. Sum "Weight" by product, use position matches to bump up Weights, etc. My point for keeping this separate was that the rules for doing those operations will change, but the basic factors involved will not. I would have even rather it be MORE separate (I'm still learning, I don't want to get fancy) but the rules for local and interpreted LINQ queries seems to trip me up when I do.
Since CF has supported queries of queries forever, that's how I tend to lean. Pull the data you need from the db, then do your operations (which includes queries with Aggregate functions) on the in-memory table.
I hope that makes it more clear.

Dynamic "WHERE" like queries on memory objects

What would be the best approach to allow users to define a WHERE-like constraints on objects which are defined like this:
Collection<object[]> data
Collection<string> columnNames
where object[] is a single row.
I was thinking about dynamically creating a strong-typed wrapper and just using Dynamic LINQ but maybe there is a simpler solution?
DataSet's are not really an option since the collections are rather huge (40,000+ records) and I don't want to create DataTable and populate it every time I run a query.

What kind of queries do you need to run? If it's just equality, that's relatively easy:
public static IEnumerable<object[]> WhereEqual(
this IEnumerable<object[]> source,
Collection<string> columnNames,
string column,
object value)
{
int columnIndex = columnNames.IndexOf(column);
if (columnIndex == -1)
{
throw new ArgumentException();
}
return source.Where(row => Object.Equals(row[columnIndex], value);
}
If you need something more complicated, please give us an example of what you'd like to be able to write.

If I get your point : you'd like to support users writting the where clause externally - I mean users are real users and not developers so you seek solution for the uicontrol, code where condition bridge. I just though this because you mentioned dlinq.
So if I'm correct what you want to do is really :
give the user the ability to use column names
give the ability to describe a bool function (which will serve as where criteria)
compose the query dynamically and run
For this task let me propose : Rules from the System.Workflow.Activities.Rules namespace. For rules there're several designers available not to mention the ones shipped with Visual Studio (for the web that's another question, but there're several ones for that too).I'd start with Rules without workflow then examine examples from msdn. It's a very flexible and customizable engine.
One other thing: LINQ has connection to this problem as a function returning IQueryable can defer query execution, you can previously define a query and in another part of the code one can extend the returned queryable based on the user's condition (which then can be sticked with extension methods).

When just using object, LINQ isn't really going to help you very much... is it worth the pain? And Dynamic LINQ is certainly overkill. What is the expected way of using this? I can think of a few ways of adding basic Where operations.... but I'm not sure how helpful it would be.

How about embedding something like IronPython in your project? We use that to allow users to define their own expressions (filters and otherwise) inside a sandbox.

I'm thinking about something like this:
((col1 = "abc") or (col2 = "xyz")) and (col3 = "123")
Ultimately it would be nice to have support for LIKE operator with % wildcard.

Thank you all guys - I've finally found it. It's called NQuery and it's available from CodePlex. In its documentation there is even an example which contains a binding to my very structure - list of column names + list of object[]. Plus fully functional SQL query engine.
Just perfect.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.