Detecting "near duplicates" using a LINQ/C# query - c#

I'm using the following queries to detect duplicates in a database.
Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".
var results = result
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)
My main goals are:
To list the first record with a subset of all duplicates (or "near duplicates")
To have some flexibility over the fields and text comparisons I use for my duplicates.

For your explicit "ignoring spaces" case, you can simply call
var results = result.GroupBy(c => c.Name.Replace(" ", ""))...
However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:
public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
public bool Equals(Company x, Company y)
{
return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
}
public int GetHashCode(Company obj)
{
return obj.Name.Replace(" ", "").GetHashCode();
}
}
which you could use as
var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...
It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.
Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)

Okay, so since you're looking for different permutations you could do something like this:
Bear in mind this was written in the answer so it may not fully compile, but you get the idea.
var results = result
.Where(g => CompanyNamePermutations(g.Key.CompanyName).Contains(g.Key.CompanyName))
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
private static List<string> CompanyNamePermutations(string companyName)
{
// build your permutations here
// so to build the one in your example
return new List<string>
{
companyName,
string.Join("", companyName.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
};
}

In this case you need to define where the work is going to take place i.e. fully on the server, in local memory or a mixture of both.
In local memory:
In this case we have two routes, to pull back all the data and just do the logic in local memory, or to stream the data and apply the logic piecewise. To pull all the data just ToList() or ToArray() the base table. To stream the data would suggest using ToLookup() with custom IEqualityComparer, e.g.
public class CustomEqualityComparer: IEqualityComparer<String>
{
public bool Equals(String str1, String str2)
{
//custom logic
}
public int GetHashCode(String str)
{
// custom logic
}
}
//result
var results = result.ToLookup(r => r.Name,
new CustomEqualityComparer())
.Select(r => ....)
Fully on the server:
Depends on your provider and what it can successfully map. E.g. if we define a near duplicate as one with an alternative delimiter one could do something like this:
private char[] delimiters = new char[]{' ','-','*'}
var results = result.GroupBy(r => delimiters.Aggregate( d => r.Replace(d,'')...
Mixture:
In this case we are splitting the work between the two. Unless you come up with a nice scheme this route is most likely to be inefficient. E.g. if we keep the logic on the local side, build groupings as a mapping from a name into a key and just query the resulting groupings we can do something like this:
var groupings = result.Select(r => r.Name)
//pull into local memory
.ToArray()
//do local grouping logic...
//Query results
var results = result.GroupBy(r => groupings[r]).....
Personally I usually go with the first option, pulling all the data for small data sets and streaming large data sets (empirically I found streaming with logic between each pull takes a lot longer than pulling all the data then doing all the logic)
Notes: Dependent on the provider ToLookup() is usually immediate execution and in construction applies its logic piecewise.

Related

Remove all elements that have at least one duplicates completely out of a list in C#

Introduction
I am a Belgian software engineer working in a company that is producing press brakes. I now have an interesting problem, where I would like to know the best solution, performance is really imporant in my working context. I think it might be interesting for other programmers as well.
Data
I have a list with a bunch of objects of class type "CS3DLine".
List <CS3DLine> ListParallelLines = new List<CS3DLine>();
I also have a custom method which takes two of these objects as arguments and returns a boolean telling if this two objects are equal or not.
public static bool IsSameLineIn3D(CS3DLine povleft, CS3DLine povright)
Wanted
I would like to get a FilteredListParallelLines where the CS3DLines that are equal are completely filtered out of the list.
Remarks
On Internet I found examples (e.g. on this page on dotNetPerls) with the Distinct-method and IEqualityComparer, but in these cases only the duplicates where deleted, not the originals that had the duplicates.
I know I can also try to solve this iteratively, but I am afraid that if the list contains a huge amount of objects, this will result in a bad performance.
If I understand correctly, the following is a set-based approach that might satisfy your requirements. I can't vouch for performance.
Can be simplified if the ordering of the list is not of importance.
In the absence of a definition of CS3DLine, I've provided an example for my own Line class.
As ever, when using set-based methods, it's best that the line class is immutable.
void Main()
{
List<Line> lines = new List<Line>();
var comparer = LineEqualityComparer.Instance;
var filtered = lines
.Select((line, idx) => new { line, idx })
.GroupBy(x => x.line, comparer)
.Where(g => g.Count() == 1)
.SelectMany(g => g)
.OrderBy(x => x.idx)
.Select(x => x.line);
}
class Line
{
public int X1 { get; }
public int Y1 { get; }
public int X2 { get; }
public int Y2 { get; }
}
class LineEqualityComparer : IEqualityComparer<Line>
{
public static IEqualityComparer<Line> Instance { get; } = new LineEqualityComparer();
public bool Equals(Line x, Line y)
{
//fill-in the blanks
}
public int GetHashCode(Line obj)
{
//fill-in the blanks
}
}
On a large dataset, you might be able to get better performance on the query by strategically placing a .AsParallel() somewhere in the chain of linq methods.
For complex objects you need to override Equals And GetHashCode after that you can simply compare it
http://www.loganfranken.com/blog/687/overriding-equals-in-c-part-1/
In the first step you need to create a class that implements IEqualityComparer for your CS3DLines class.
This might look close to this:
public class CS3DComparer : IEqualityComparer {
public bool Equals(CS3DLines a, CS3DLines b) {
return IsSameLineIn3D(a, b);
}
public int GetHashCode(CS3DLines line){
// You do not need to use all properties of line to calculate the
// hashCode. If performance is not good enough you can experiment by
// adding and removing properties from the hash code calculation.
var hashCode = line.Property1?.GetHashCode() ?? 0;
hashCode = (hashCode * 397) ^ (line.Property2?.GetHashCode() ?? 0);
hashCode = (hashCode * 397) ^ (line.Property3?.GetHashCode() ?? 0);
return hashCode;
}
}
Next to get an unsorted list of all elements in your ListParallelLines collection you can call this code:
var singles = ListParallelLines
.GroupBy(line => line, new CS3DComparer())
.Where(group => group.Count() == 1)
.Select(group => group.Key)
.ToList();
singles is now a list of all lines that have no duplicates in ListParallelLines.
For a possible speedup through parallelization you can try using PLINQ by starting the LINQ Query with a call to AsParallel().
var singles = ListParallelLines
.AsParallel()
.GroupBy(line => line, new CS3DComparer())
.Where(group => group.Count() == 1)
.Select(group => group.Key)
.ToList();
Due to your requirement to remove items that have any duplicates completely from the list one approach is to first group your set and then to filter based on any groups that have more than one item.
Performance for this kind of filtering is always limiting but to save time when grouping and having to run equality comparisons if you have your object maintain its own hash to group by ahead of time this will reduce the load at time of wanting to filter and the hash would need to be consistently updated through changes to the given instance. The considerations would be different if your hardware is the limit thus you would not want to have to store hashes for all items in memory or if it is the speed you are concerned about. Storing hashes and not calculating them is not ideal due to possible moving parts within your code that could inadvertently not trigger a hash update but if performance is a large factor it may help if carefully implemented.
var results = ListParallelLines.GroupBy(x => x.EqualityHash).Where(x => x.Count() == 1);
This would given a hash return you a list of items that have no duplicates what so ever.
There is a default implementation of GetHashCode() but it has a pretty high chance of conflicts and I have seen an issue in the past which caused a massive headache because of it so try to avoid using it.
https://learn.microsoft.com/en-us/dotnet/api/system.object.gethashcode?redirectedfrom=MSDN&view=netframework-4.7.2#remarks

How to build () => new { x.prop} lambda expression dynamically?

How to dynamically create the below linq expression.
IQueryable abc = QueryData.Select(a => new { a, TempData = a.customer.Select(b => b.OtherAddress).ToList()[0] }).OrderBy(a => a.TempData).Select(a => a.a);
public class Orders
{
public long OrderID { get; set; }
public string CustomerID { get; set; }
public int EmployeeID { get; set; }
public double Freight { get; set; }
public string ShipCountry { get; set; }
public string ShipCity { get; set; }
public Customer[] customer {get; set;}
}
public class Customer
{
public string OtherAddress { get; set; }
public int CustNum { get; set; }
}
Actual data:
List<Orders> order = new List<Orders>();
Customer[] cs = { new Customer { CustNum = 5, OtherAddress = "Hello" }, new
Customer { CustNum = 986, OtherAddress = "Other" } };
Customer[] cso = { new Customer { OtherAddress = "T", CustNum = 5 }, new
Customer { CustNum = 777, OtherAddress = "other" } };
order.Add(new Orders(code + 1, "ALFKI", i + 0, 2.3 * i, "Mumbari", "Berlin", cs));
order.Add(new Orders(code + 2, "ANATR", i + 2, 3.3 * i, "Sydney", "Madrid", cso));
order.Add(new Orders(code + 3, "ANTON", i + 1, 4.3 * i, "NY", "Cholchester", cs));
order.Add(new Orders(code + 4, "BLONP", i + 3, 5.3 * i, "LA", "Marseille", cso));
order.Add(new Orders(code + 5, "BOLID", i + 4, 6.3 * i, "Cochin", "Tsawassen", cs));
public Orders(long OrderId, string CustomerId, int EmployeeId, double Freight, string ShipCountry, string ShipCity, Customer[] Customer = null)
{
this.OrderID = OrderId;
this.CustomerID = CustomerId;
this.EmployeeID = EmployeeId;
this.Freight = Freight;
this.ShipCountry = ShipCountry;
this.ShipCity = ShipCity;
this.customer = Customer;
}
If i sort the OtherAddress field 0th index means Customer field only sorted. I need to sort the whole order data based on OtherAddress field.
I have tried the below way:
private static IQueryable PerformComplexDataOperation<T>(this IQueryable<T> dataSource, string select)
{
string[] selectArr = select.Split('.');
ParameterExpression param = Expression.Parameter(typeof(T), "a");
Expression property = param;
for (int i = 0; i < selectArr.Length; i++)
{
int n;
if (int.TryParse(selectArr[i + 1], out n))
{
int index = Convert.ToInt16(selectArr[i + 1]);
property = Expression.PropertyOrField(Expression.ArrayIndex(Expression.PropertyOrField(property, selectArr[i]), Expression.Constant(index)), selectArr[i + 2]);
i = i + 2;
}
else property = Expression.PropertyOrField(property, selectArr[i]);
}
var TempData = dataSource.Select(Expression.Lambda<Func<T, object>>(property, param));
IQueryable<object> data = dataSource.Select(a => new { a, TempData = property});// Expression.Lambda<Func<T, object>>(property, param) });
return data;
}
Method call : PerformComplexDataOperation(datasource, "customer.0.OtherAddress")
I can get the value from this line : var TempData = dataSource.Select(Expression.Lambda>(property, param));
But i can't get the values in dataSource.Select(a => new { a, TempData = property});
It is working when we use the below code :
var TempData = dataSource.Select(Expression.Lambda<Func<T, object>>(property, param)).ToList();
IQueryable<object> data = dataSource.Select((a, i) => new { a, TempData = TempData[i] });
Is it proper solution ?
XY problem?
This feels like it's a case of the XY problem. Your solution is contrived (no offense intended), and the problem you're trying to solve is not apparent by observing your proposed solution.
However, I do think there is technical merit to your question when I read the intention of your code as opposed to your described intention.
Redundant steps
IQueryable abc = QueryData
.Select(a => new {
a,
TempData = a.customer.Select(b => b.OtherAddress).ToList()[0] })
.OrderBy(a => a.TempData)
.Select(a => a.a);
First of all, when you inline this into a single chained command, TempData becomes a redundant step. You could simply shift the first TempData logic (from the first Select) directly into the OrderBy lambda:
IQueryable abc = QueryData
.OrderBy(a => a.customer.Select(b => b.OtherAddress).ToList()[0])
.AsQueryable();
As you can see, this also means that you no longer need the second Select (since it existed only to undo the earlier Select)
Parametrization and method abstraction
You mentioned you're looking for a usage similar to:
PerformComplexDataOperation(datasource, "customer.0.OtherAddress")
However, this doesn't quite make sense, since you've defined an extension method:
private static IQueryable PerformComplexDataOperation<T>(this IQueryable<T> dataSource, string select)
I think you need to reconsider your intended usage, and also the method as it is currently defined.
Minor note, the return type of the method should be IQueryable<T> instead of IQueryable. Otherwise, you lose the generic type definition that LINQ tends to rely on.
Based on the method signature, your expected usage should be myData = myData.PerformComplexDataOperation("customer.0.OtherAddress").
Strings are easy hacks to allow you to circumvent an otherwise strongly typed system. While your strign usage is technically functional, it is non-idiomatic and it opens the door to unreadable and/or bad code.
Using strings leads to a contrived string parsing logic. Look at your method definition, and count how many lines are there simply to parse the string and translate that into actual code again.
Strings also mean that you get no Intellisense, which can cause unseen bugs further down the line.
So let's not use strings. Let's look back at how I initially rewrote the `OrderBy:
.OrderBy(a => a.customer.Select(b => b.OtherAddress).ToList()[0])
When you consider OrderBy as an ordinary method, no different from any custom method you and I can develop, then you should understand that a => a.customer.Select(b => b.OtherAddress).ToList()[0] is nothing more than a parameter that's being passed.
The type of this parameter is Func<A,B>, where:
A equals the type of your entity. So in this case, A is the same as T in your existing method.
B equals the type of your sorting value.
OrderBy(x => x.MyIntProp) means that B is of type int.
OrderBy(x => x.MyStringProp) means that B is of type string.
OrderBy(x => x.Customer) means that B is of type Customer.
Generally speaking, the type of B doesn't matter for you (since it will only be used by LINQ's internal ordering method).
Let's look at a very simple extension method that uses a parameter for its OrderBy:
public static IQueryable<A> OrderData<A, B>(this IQueryable<A> data, Func<A, B> orderbyClause)
{
return data
.OrderBy(orderbyClause)
.AsQueryable();
}
Using the method looks like:
IQueryable<MyEntity> myData = GetData(); //assume this returns a correct value
myData = myData.OrderData(x => x.MyIntProperty);
Notice how I did not need to specify either of the generic type arguments when calling the method.
A is already known to be MyEntity, because we're calling the method on an object of type IQueryable<MyEntity>.
B is already known to be an int, since the used lambda method returns a value of type int (from MyIntProperty)
As it stands, my example method is just a boring wrapper that does nothing different from the existing OrderBy method. But you can change the method's logic to suit your needs, and actually make it meaningfully different from the existing OrderBy method.
Your expectations
Your description of your goals makes me think that you're expecting too much.
I need to sort "customer.0.OtherAddress" nested file compared to whole base data. But it sorted only for that field. For this case, I find that field value and stored it to TempData. Then Sorting the TempData field.
i need to sort the parent nodes not an sibling alone. QueryData.Select(a => new { a, TempData = a.customer.Select(b => b.OtherAddress).ToList()[0] }).OrderBy(a => a.TempData).Select(a => a.a); I sorting a original data based on temp data. Then i split the original data alone.
It's not possible to sort an entire nested data structure based on a single OrderBy call. OrderBy only sorts the collection on which you call Orderby, nothing else.
If you have a list of Customer entities, who each have a list of Adress entities, then you are working with many lists (a list of customer and several lists of adresses). OrderBy will only sort the list that you ask it to sort, it will not look for any nested lists.
You mention that your TempData solution works. I actually wrote an entire answer contesting that notion (it should be functionally similar to my suggested alternatives, and it should always order the original list, not any nested list), until I noticed that you've made it work for a very insidious and non-obvious reason:
.Select(a => new {
a,
TempData = a.customer.Select(b => b.OtherAddress).ToList()[0]
})
You are calling .ToList(), which changes how the code behaves. You started off with an IQueryable<>, which means that LINQ was preparing an SQL command to retrieve the data when you enumerate it.
This is the goal of an IQueryable<>. Instead of pulling all the data into memory and then filtering it according to your specifications, it instead constructs a complex SQL query, and will then only need to execute a single (constructed) query.
The execution of that constructed query occurs when you try to access the data (obviously, the data needs to be fetched if you want to access it). A common method of doing so is by enumerating the IQueryable<> into an IEnumerable<>.
This is what you've done in the Select lambda. Instead of asking LINQ to enumerate your list of orders, you've asked it to enumerate every list of addresses from every customer from every order in the list of orders.
But in order to know which adresses need to be enumerated, LINQ must first know which customers it's supposed to get the adresses from. And to find out which customers it needs, it must first figure out which orders you're working with. The only way it can figure all of that out is by enumerating everything.
My initial suggestion, that you should avoid using the TempData solution, is still valid. It's a redundant step that serves no functional purpose. However, the enumeration that also takes place may actually be of use to you here, because it changes LINQ's behavior slightly. You claim that it fixes your problem, so I'm going to take your statement at face value and assume that the slightly different behavior between LINQ-to-SQL and LINQ-to-Entities solves your problem.
You can keep the enumeration and still omit the TempData workaround:
IQueryable abc = QueryData
.OrderBy(a => a.customer.Select(b => b.OtherAddress).ToList()[0])
.AsEnumerable()
.AsQueryable();
Some footnotes:
You can use ToList() instead of AsEnumerable(), the result is the same.
When you use First() or Single(), enumeration will inherently take place, so you don't need to call AsEnumerable() beforehand.
Notice that I cast the result to an IEnumerable<>, but then I immediately re-cast it to IQueryable<>. Once a collection has been enumerated, any further operation on it will occur in-memory. Casting it back to an IQueryable<> does not change the fact that the collection has already been enumerated.
But does it work?
Now, I think that this still doesn't sort all of your nested lists with a single call. However, you claim it does. If you still believe that it does, then you don't need to read on (because your problem is solved). Otherwise, the following may be useful to you.
SQL, and by extension LINQ, has made it possible to sort a list based on information that is not found in the list. This is essentially what you're doing, you're asking LINQ to sort a list of orders based on a related address (regardless of whether you want the adresses to be retrieved from the database or not!) You're not asking it to sort the customers, or the addresses. You're only asking it to sort the orders.
Your sort logic feels a bit dirty to me. You are supplying an Address entity to your OrderBy method, without specifiying any of its (value type) properties. But how are you expecting your addresses to be sorted? By alphabetical street name? By database id? ...
I would expect you to be more explicit about what you want, e.g. OrderBy(x => x.Address.Street).ThenBy(x => x.Address.HouseNumber) (this is a simplified example).
After enumeration, since all the (relevant) data is in-memory, you can start ordering all the nested lists. For example:
foreach(var order in myOrders)
{
order.Customer.Addresses = order.Customer.Addresses.OrderBy(x => x.Street).ToList();
}
This orders all the lists of addresses. It does not change the order of the list of orders.
Do keep in mind that if you want to order data in-memory, that you do in fact need the data to be present in-memory. If you never loaded the customer's addresses, you can't use addresses as a sorting argument.
Ordering the list of orders should be done before enumeration. It's generally faster to have it handled by your SQL database, which is what happens when you're working with LINQ-to-SQL.
Ordering nested lists should be done after enumeration, because the order of these lists is unrelated to the original IQueryable<Order>, which only focused on sorting the orders, not its nested related entities (during enumeration, the included entities such as Customer and Address are retrieved without ordering them).
You can transform your OrderBy so you don't need an anonymous type (though I like the Perl/Lisp Schwartzian Transform) and then it is straightforward to create dynamically (though I am not sure how dynamically you mean).
Using the new expression:
var abc = QueryData.OrderBy(a => a.customer[0].OtherAddress);
Not being sure what you mean by dynamic, you can create the lambda
x => x.OrderBy(a => a.customer[0].Otheraddress)
using Expression as follows:
var parmx = Expression.Parameter(QueryData.GetType(), "x");
var parma = Expression.Parameter(QueryData[0].GetType(), "a");
var abc2 = Expression.Lambda(Expression.Call(MyExtensions.GetMethodInfo((IEnumerable<Orders> x)=>x.OrderBy(a => a.customer[0].OtherAddress)),
new Expression[] { parmx,
Expression.Lambda(Expression.Property(Expression.ArrayIndex(Expression.Property(parma, "customer"), Expression.Constant(0)), "OtherAddress"), parma) }),
parmx);

Sort list by string field of class

I have a list of objects, each with time data, id numbers, and a string descriptor in the type field. I wish to pull all the values to the front of the list with a certain string type, while keeping the order of those list elements the same, and the order of the rest of the list elements the same, just attached to the back of those with my desired string.
I've tried, after looking for similar SE questions,
list.OrderBy(x => x.type.Equals("Auto"));
which has no effect, though all other examples I could find sorted by number rather than by a string.
List Objects class definition:
public class WorkLoad
{
public long id;
public DateTime timestamp;
...
public String type;
}
...create various workload objects...
schedule.Add(taskX)
schedule.OrderBy(x => x.type.Equals("Manual"));
//has no effect currently
If you already have a sorted list I think the fastest way to resort it by "having a type of auto or not" without losing the original order (and without having to resort all over again) could be this:
var result = list.Where(x => x.type.Equals("Auto"))
.Concat(list.Where(x => !x.type.Equals("Auto")))
.ToList();
Update:
You commented that "everyting else should be sorted by time", so you can simply do this:
var result = list.OrderByDescending(x => x.type.Equals("Auto"))
.ThenBy(x => x.Time).ToList();
You can use multiple orderings in a sequence:
list.OrderBy(x => x.type == "Auto" ? 0 : 1).ThenBy(x => x.type);

Sorting a list of objects based on another

public class Product
{
public string Code { get; private set; }
public Product(string code)
{
Code = code;
}
}
List<Product> sourceProductsOrder =
new List<Product>() { new Product("BBB"), new Product("QQQ"),
new Product("FFF"), new Product("HHH"),
new Product("PPP"), new Product("ZZZ")};
List<Product> products =
new List<Product>() { new Product("ZZZ"), new Product("BBB"),
new Product("HHH")};
I have two product lists and I want to reorder the second one with the same order as the first.
How can I reorder the products list so that the result would be : "BBB", "HHH", "ZZZ"?
EDIT: Changed Code property to public as #juharr mentioned
You would use IndexOf:
var sourceCodes = sourceProductsOrder.Select(s => s.Code).ToList();
products = products.OrderBy(p => sourceCodes.IndexOf(p.Code));
The only catch to this is if the second list has something not in the first list those will go to the beginning of the second list.
MSDN post on IndexOf can be found here.
You could try something like this
products.OrderBy(p => sourceProductsOrder.IndexOf(p))
if it is the same Product object. Otherwise, you could try something like:
products.OrderBy(p => GetIndex(sourceProductsOrder, p))
and write a small GetIndex helper method. Or create a Index() extension method for List<>, which would yield
products.OrderBy(p => sourceProductsOrder.Index(p))
The GetIndex method is rather simple so I omit it here.
(I have no PC to run the code so please excuse small errors)
Here is an efficient way to do this:
var lookup = sourceProductsOrder.Select((p, i) => new { p.Code, i })
.ToDictionary(x => x.Code, x => x.i);
products = products.OrderBy(p => lookup[p.Code]).ToList();
This should have a running time complexity of O(N log N), whereas an approach using IndexOf() would be O(N2).
This assumes the following:
there are no duplicate product codes in sourceProductsOrder
sourceProductsOrder contains all of the product codes in products
you make the Code field/property non-private
If needed, you can create a safeguard against the first bullet by replacing the first statement with this:
var lookup = sourceProductsOrder.GroupBy(p => p.Code)
.Select((g, i) => new { g.Key, i })
.ToDictionary(x => x.Key, x => x.i);
You can account for the second bullet by replacing the second statement with this:
products = products.OrderBy(p =>
lookup.ContainsKey(p.Code) ? lookup[p.Code] : Int32.MaxValue).ToList();
And you can use both if you need to. These will slow down the algorithm a bit, but it should continue to have an O(N log N) running time even with these alterations.
I would implement a compare function that does a lookup of the order from sourceProductsOrder using a hash table. The lookup table would look like
(key) : (value)
"BBB" : 1
"QQQ" : 2
"FFF" : 3
"HHH" : 4
"PPP" : 5
"ZZZ" : 6
Your compare could then lookup the order of the two elements and do a simple < (pseudo code):
int compareFunction(Product a, Product b){
return lookupTable[a] < lookupTable[b]
}
Building the hash table would be linear and doing the sort would generally be nlogn
Easy come easy go:
IEnumerable<Product> result =
products.OrderBy(p => sourceProductsOrder.IndexOf(sourceProductsOrder.FirstOrDefault(p2 => p2.Code == p.Code)));
This will provide the desired result. Objects with ProductCodes not available in the source list will be placed at the beginning of the resultset. This will perform just fine for a couple of hundred of items I suppose.
If you have to deal with thousands of objects than an answer like #Jon's will likely perform better. There you first create a kind of lookup value / score for each item and then use that for sorting / ordering.
The approach I described is O(n2).

NHibernate query extremely slow compared to hard coded SQL query

I'm re-writing some of my old NHibernate code to be more database agnostic and use NHibernate queries rather than hard coded SELECT statements or database views. I'm stuck with one that's incredibly slow after being re-written. The SQL query is as such:
SELECT
r.recipeingredientid AS id,
r.ingredientid,
r.recipeid,
r.qty,
r.unit,
i.conversiontype,
i.unitweight,
f.unittype,
f.formamount,
f.formunit
FROM recipeingredients r
INNER JOIN shoppingingredients i USING (ingredientid)
LEFT JOIN ingredientforms f USING (ingredientformid)
So, it's a pretty basic query with a couple JOINs that selects a few columns from each table. This query happens to return about 400,000 rows and has roughly a 5 second execution time. My first attempt to express it as an NHibernate query was as such:
var timer = new System.Diagnostics.Stopwatch();
timer.Start();
var recIngs = session.QueryOver<Models.RecipeIngredients>()
.Fetch(prop => prop.Ingredient).Eager()
.Fetch(prop => prop.IngredientForm).Eager()
.List();
timer.Stop();
This code works and generates the desired SQL, however it takes 120,264ms to run. After that, I loop through recIngs and populate a List<T> collection, which takes under a second. So, something NHibernate is doing is extremely slow! I have a feeling this is simply the overhead of constructing instances of my model classes for each row. However, in my case, I'm only using a couple properties from each table, so maybe I can optimize this.
The first thing I tried was this:
IngredientForms joinForm = null;
Ingredients joinIng = null;
var recIngs = session.QueryOver<Models.RecipeIngredients>()
.JoinAlias(r => r.IngredientForm, () => joinForm)
.JoinAlias(r => r.Ingredient, () => joinIng)
.Select(r => joinForm.FormDisplayName)
.List<String>();
Here, I just grab a single value from one of my JOIN'ed tables. The SQL code is once again correct and this time it only grabs the FormDisplayName column in the select clause. This call takes 2498ms to run. I think we're on to something!!
However, I of course need to return several different columns, not just one. Here's where things get tricky. My first attempt is an anonymous type:
.Select(r => new { DisplayName = joinForm.FormDisplayName, IngName = joinIng.DisplayName })
Ideally, this should return a collection of anonymous types with both a DisplayName and an IngName property. However, this causes an exception in NHibernate:
Object reference not set to an instance of an object.
Plus, .List() is trying to return a list of RecipeIngredients, not anonymous types. I also tried .List<Object>() to no avail. Hmm. Well, perhaps I can create a new type and return a collection of those:
.Select(r => new TestType(r))
The TestType construction would take a RecipeIngredients object and do whatever. However, when I do this, NHibernate throws the following exception:
An unhandled exception of type 'NHibernate.MappingException' occurred
in NHibernate.dll
Additional information: No persister for: KitchenPC.Modeler.TestType
I guess NHibernate wants to generate a model matching the schema of RecipeIngredients.
How can I do what I'm trying to do? It seems that .Select() can only be used for selecting a list of a single column. Is there a way to use it to select multiple columns?
Perhaps one way would be to create a model with my exact schema, however I think that would end up being just as slow as the original attempt.
Is there any way to return this much data from the server without the massive overhead, without hard coding a SQL string into the program or depending on a VIEW in the database? I'd like to keep my code completely database agnostic. Thanks!
The QueryOver syntax for conversion of selected columns into artificial object (DTO) is a bit different. See here:
16.6. Projections for more details and nice example.
A draft of it could be like this, first the DTO
public class TestTypeDTO // the DTO
{
public string PropertyStr1 { get; set; }
...
public int PropertyNum1 { get; set; }
...
}
And this is an example of the usage
// DTO marker
TestTypeDTO dto = null;
// the query you need
var recIngs = session.QueryOver<Models.RecipeIngredients>()
.JoinAlias(r => r.IngredientForm, () => joinForm)
.JoinAlias(r => r.Ingredient, () => joinIng)
// place for projections
.SelectList(list => list
// this set is an example of string and int
.Select(x => joinForm.FormDisplayName)
.WithAlias(() => dto.PropertyStr1) // this WithAlias is essential
.Select(x => joinIng.Weight) // it will help the below transformer
.WithAlias(() => dto.PropertyNum1)) // with conversion
...
.TransformUsing(Transformers.AliasToBean<TestTypeDTO>())
.List<TestTypeDTO>();
So, I came up with my own solution that's a bit of a mix between Radim's solution (using the AliasToBean transformer with a DTO, and Jake's solution involving selecting raw properties and converting each row to a list of object[] tuples.
My code is as follows:
var recIngs = session.QueryOver<Models.RecipeIngredients>()
.JoinAlias(r => r.IngredientForm, () => joinForm)
.JoinAlias(r => r.Ingredient, () => joinIng)
.Select(
p => joinIng.IngredientId,
p => p.Recipe.RecipeId,
p => p.Qty,
p => p.Unit,
p => joinIng.ConversionType,
p => joinIng.UnitWeight,
p => joinForm.UnitType,
p => joinForm.FormAmount,
p => joinForm.FormUnit)
.TransformUsing(IngredientGraphTransformer.Create())
.List<IngredientBinding>();
I then implemented a new class called IngredientGraphTransformer which can convert that object[] array into a list of IngredientBinding objects, which is what I was ultimately doing with this list anyway. This is exactly how AliasToBeanTransformer is implemented, only it initializes a DTO based on a list of aliases.
public class IngredientGraphTransformer : IResultTransformer
{
public static IngredientGraphTransformer Create()
{
return new IngredientGraphTransformer();
}
IngredientGraphTransformer()
{
}
public IList TransformList(IList collection)
{
return collection;
}
public object TransformTuple(object[] tuple, string[] aliases)
{
Guid ingId = (Guid)tuple[0];
Guid recipeId = (Guid)tuple[1];
Single? qty = (Single?)tuple[2];
Units usageUnit = (Units)tuple[3];
UnitType convType = (UnitType)tuple[4];
Int32 unitWeight = (int)tuple[5];
Units rawUnit = Unit.GetDefaultUnitType(convType);
// Do a bunch of logic based on the data above
return new IngredientBinding
{
RecipeId = recipeId,
IngredientId = ingId,
Qty = qty,
Unit = rawUnit
};
}
}
Note, this is not as fast as doing a raw SQL query and looping through the results with an IDataReader, however it's much faster than joining in all the various models and building the full set of data.
IngredientForms joinForm = null;
Ingredients joinIng = null;
var recIngs = session.QueryOver<Models.RecipeIngredients>()
.JoinAlias(r => r.IngredientForm, () => joinForm)
.JoinAlias(r => r.Ingredient, () => joinIng)
.Select(r => r.column1, r => r.column2})
.List<object[]>();
Would this work?

Categories

Resources