Why would LINQ ToLookup perform slowly with object as key

Why would LINQ ToLookup perform slowly with object as key - c#

I have the following code:
var data = dc.table.ToLookup(x => new ComboObject()
{
PartnerId = x.Partner_ID,
CustomerId = x.Customer_ID_
},
y => y);
This code is creating a lookup where the Key is a "ComboObject" object, and the values are the associated results from "dc.table".
My issue is, this code isn't very performant. When querying production data, that line takes about 10 minutes. After doing some digging today, I found that if I instead made the key a String (where I concatenated the PartnerId and CustomerId values), this code runs 99% faster. So, the good news is.. I fixed the bug! The bad news is.. I don't know why!
So my question is: Why would a String-based key perform so much better than an object-based key?
Oh, and in case you were curious, my ComboObject class looks like:
public struct ComboObject
{
public string PartnerId { get; set; }
public string CustomerId { get; set; }
public override string ToString()
{
return string.Format("{0}/{1}", PartnerId, CustomerId);
}
}

Related

Is it possible to select huge list from db by list of composite primary keys

I have 2 databases.
From my DB I'am taking List items (im getting this by Date, it can be up to 300 000 elements)
public class Item
{
public string A { get; set; }
public string B { get; set; }
public string C { get; set; }
public DateTime Date { get; set; }
}
In other database (I don't control that DB, I can olny read from it, and i can't change anything in this DB) I need to select List
public class OtherDbItem
{
public string X { get; set; }
public string Y { get; set; }
public string Z { get; set; }
public string FewOtherProperties { get; set; }
}
Where X, Y, Z are primary key, I need to select all otherDbItems where Item.A = OtherDbItem.X and Item.B = OtherDbItem.Y and Item.C = OtherDbItem.Z (than map OtherDbItems to my model and save in my database).
I am using 2 different EF Core DbContext for connecting with databases.
I tryed:
var otherDbItems = new List<OtherDbItem>();
foreach (var item in Items)
{
var otherDbItem = await this.context.OtherDbItems.FindAsync(item.A, item.B, item.C);
if (otherDbItem != null)
{
otherDbItems.Add(otherDbItem);
}
}
return otherDbItems;
But this can be 300 000 Items, so it's 300 000 requests to database, obviusly it's not optimal, and not acceptable.
I tryed also:
var ids = items.Select(item => item.A + item.B + item.C).ToList();
var otherDbItems = await this.context.OtherDbItems.Where(otherDbItem => ids.Contains(otherDbItem.X + otherDbItem.Y + otherDbItem.Z)).ToListAsync();
But this result in huge sql query, it's slow, and cause ConnectionTimeOut.
Is it possible to get OtherDbItems fast and relaiable?
And do I have to get this item's in parts? For example .take(1000).skip(0) items at 1 call? If yes how big should this parts be?

I can't say for sure that this is the best approach because I'm not an EF expert, but I had a similar scenario recently where I was dealing with a sync that came from an external JSON export to an EF Core database. Part of that operation was validating that existing EF Core entries that would grow based on the imported data were still valid if the export changed, suffice to say as the database grew towards a million or so records that had to be validated, we encountered timeout and expensive query issues.
The approach we ended with that actually ended up improving the speed of even our original process was to batch the operations. The one thing we did different than just the take()skip() approach was we actually batched on the input side. In essence, we took a collection of 1000 ids at a time and used that for the query before moving onto the next. So with your code/data that might look something like this:
int chunkIndex = 0;
int batch = 1000;
var ids = items.Select(item => item.A + item.B + item.C).ToList();
while (chunkIndex < ids.Count)
{
var chunkIDs = ids.GetRange(chunkIndex,
chunkIndex + batch >= ids.Count ? ids.Count - chunkIndex : batch);
var otherDbItems = await this.context.OtherDbItems.Where(otherDbItem => chunkIDs.Contains(otherDbItem.X + otherDbItem.Y + otherDbItem.Z)).ToListAsync();
chunkIndex += batch;
}
So I think this makes your query less expensive since it isn't having to run the entire thing and then limiting the result, but where your situation is slightly different is that your source is also a database whereas ours was JSON content. You could probably further optimize this by using the take() approach on your query of of ids in the Items source table. The syntax on this might not be 100% right, but perhaps this gives the idea:
int chunkIndex = 0;
int batch = 1000;
// Update dbItemsContext.Items to your source context and table
int totalRecords = dbItemsContext.Items.Count();
while (chunkIndex < totalRecords)
{
// Update dbItemsContext.Items to your source context and table
var chunkIDs = dbItemsContext.Items.Select(item => item.A + item.B + item.C).Take(batch).Skip(chunkIndex).ToList();
var otherDbItems = await this.context.OtherDbItems.Where(otherDbItem => chunkIDs.Contains(otherDbItem.X + otherDbItem.Y + otherDbItem.Z)).ToListAsync();
chunkIndex += batch;
}
I hope that helps demonstrate our approach, but I think this route you'd need to lock the tables to avoid changes until your operations are complete. I welcome any feedback since it could improve our process as well. I'll also note that our application/context is not setup to run async so you might need some additional modifications or could possibly even have these batches run asynchronously for your use case.
Final note in regards to batch size: you may need to play with it a bit. Our query was quite a bit more complex so 1000 seemed to be the sweet spot for us, but you may be able to take quite a bit more at a time. I'm not sure there's any other way to determine the best batch size without just testing some different sizes.

Ok, it was much easier than i thought. Both databases are in the same SQL server, so it was mater of simple inner join.
I just added properties from Item to OtherDbItem
public class OtherDbItem
{
public string A { get; set; }
public string B { get; set; }
public string C { get; set; }
public DateTime Date { get; set; }
public string X { get; set; }
public string Y { get; set; }
public string Z { get; set; }
public string FewOtherProperties { get; set; }
}
And in OnModemCreating:
protected override void OnModelCreating(ModelBuilder builder)
{
builder.Entity<OtherDbItem>(
entity =>
{
entity.ToSqlQuery(#"SELECT
i.A,
i.B,
i.C,
i.Date,
o.X,
o.Y,
o.Z,
o.FewOtherProperties
FROM [DB1].[dbo].[Items] i
inner join [DB2].[dbo].[OtherDbItem] o on i.A = o.X and i.B = o.Y and i.C = o.Z");
entity.HasKey(o => new { o.X, o.Y, o.Z});
});
}
And last thing to do:
{
return this.context.OtherDbItems
.Where(x => x.Date == date)
.Distinct()
.ToListAsync();
}

LINQ - AND, ANY and NOT Query

I am fairly rookie with LINQ. I can do some basic stuff with it but I am in need of an expert.
I am using Entity Framework and I have a table that has 3 columns.
public class Aspect
{
[Key, Column(Order = 0)]
public int AspectID { get; set; }
[Key, Column(Order = 1)]
public int AspectFieldID { get; set; }
public string Value { get; set; }
}
I have 3 lists of words from a user's input. One contains phrases or words that must be in the Value field (AND), another contains phrases or words that don't have to be in the Value field (ANY) and the last list contains phrases or words that can not be found in the Value field (NOT).
I need to get every record that has all of the ALL words, any of the ANY words and none of the NOT words.
Here are my objects.
public class SearchAllWord
{
public string Word { get; set; }
public bool includeSynonoyms { get; set; }
}
public class SearchAnyWord
{
public string Word { get; set; }
public bool includeSynonoyms { get; set; }
}
public class SearchNotWord
{
public string Word { get; set; }
}
What I have so far is this,
var aspectFields = getAspectFieldIDs().Where(fieldID => fieldID > 0).ToList();//retrieves a list of AspectFieldID's that match user input
var result = db.Aspects
.Where(p => aspectFields.Contains(p.AspectFieldID))
.ToList();
Any and all help is appreciated.

First let me say, if this is your requirement... your query will read every record in the database. This is going to be a slow operation.
IQueryable<Aspect> query = db.Aspects.AsQueryable();
//note, if AllWords is empty, query is not modified.
foreach(SearchAllWord x in AllWords)
{
//important, lambda should capture local variable instead of loop variable.
string word = x.Word;
query = query.Where(aspect => aspect.Value.Contains(word);
}
foreach(SearchNotWord x in NotWords)
{
string word = x.Word;
query = query.Where(aspect => !aspect.Value.Contains(word);
}
if (AnyWords.Any()) //haha!
{
List<string> words = AnyWords.Select(x => x.Value).ToList();
query =
from aspect in query
from word in words //does this work in EF?
where aspect.Value.Contains(word)
group aspect by aspect into g
select g.Key;
}
If you're sending this query into Sql Server, be aware of the ~2100 parameter limit. Each word is going to be sent as a parameter.

What you need are the set operators, specifically
Intersect
Any
Bundle up your "all" words into a string array (or some other enumerable) and then you can use intersect and count to check they are all present.
Here are two sets
var A = new string[] { "first", "second", "third" };
var B = new string[] { "second", "third" };
A is a superset of B?
var isSuperset = A.Intersect(B).Count() == B.Count();
A is disjoint with B?
var isDisjoint1 = !A.Intersect(B).Any();
var isDisjoint2 = !A.Any(a => B.Any(b => a == b)); //faster
Your objects are not strings so you will want the overload that allows you to supply a comparator function.
And now some soapboxing.
Much as I love Linq2sql it is not available in ASP.NET Core and the EF team wants to keep it that way, probably because jerks like me keep saying "gross inefficiency X of EF doesn't apply to Linq2Sql".
Core is the future. Small, fast and cross platform, it lets you serve a Web API from a Raspberry Pi running Windows IOT or Linux -- or get ridiculously high performance on big hardware.
EF is not and probably never will be a high performance proposition because it takes control away from you while insisting on being platform agnostic, which prevents it from exploiting the platform.
In the absence of Linq2sql, the solution seems to be libraries like Dapper, which handle parameters when sending the query and map results into object graphs when the result arrives, but otherwise don't do much. This makes them more or less platform agnostic but still lets you exploit the platform - apart from parameter substitution your SQL is passthrough.

Slow Query Compilation in C# with EntityFramework 4.4 When Using 100's of Id's

I was hoping I could get some help with a performance problem I'm having in EntityFramework 4.4. I'm working on converting an application that was using EDMX files over to code first and I've run into a problem when running queries with a large number of objects in the "where" clause of the LINQ query.
Here's a short overview of how everything is laid out (Entity doesn't refer to EF, it's the name given to a generic "thing" in our code):
public class ExampleDbContext : DbContext
{
public DbSet<EntityTag> EntityTags { get; set; }
public DbSet<Entity> Entities { get; set; }
public DbSet<Log> Logs { get; set; }
protected override void OnModelCreating(DbmodelBuilder modelBuilder)
{
// Fluent mappings added to modelBuilder.Configurations.Add() in here
}
}
public class EntityTag
{
public int Id { get; set; }
public virtual Entity Entity { get; set; }
public int EntityId { get; set; }
public virtual Log Deleted { get; set; }
public int? DeletedId { get; set; }
}
public class Entity
{
public int Id { get; set; }
pulic byte[] CompositeId { get; set; }
}
// Used to log when an event happens
public class Log
{
public int Id { get; set; }
public string Username { get; set; }
public DateTime Timestamp { get; set; }
}
The query I'm running that causes the problem is:
// Creates an IEnumerable<byte[]> with the keys to find
var computedKeys = CreateCompositeIDs(entityKeys);
// Run the query and find any EntityTag that isn't deleted and is in
// the computedKeys list
var result = from et in Context.EntityTags
where computedKeys.Contains(et.Entity.CompositeId) &&
et.Deleted == null
select et;
var entityTags = result.ToList();
When computedKeys contains only a few Ids (15 for example) the code and query runs quickly. When I have a large number of Ids (1600 is normal at this point and it could get higher) it takes minutes (at 500, I haven't even tried with 1500 yet) to run that query once it's enumerated with ToList(). I've also removed the computedKeys.Contains() (leaving et.Deleted) from the query with a large number of computedKeys and the query ends up running quickly.
Through debugging I've determined that creating the list of keys is fast, so that's not the problem. When I hook a profiler up to MSSQL to see the query that's generated it looks normal in that all of the CompositeId's are included in a WHERE CompositeId IN ( /* List of Ids, could be 1500 of them */) and when the query shows up in the profiler it executes in less than a second so I don't think it's a database optimization thing, either. The profiler will sit there without anything showing up for the entire time it's running aside from the last second or so when it quickly returns a result.
I hooked up dotTrace and it looks like a lot of the time is spent within System.Data.Query.PlanCompiler.JoinGraph.GenerateTransitiveEdge(JoinEdge, JoinEdge) (119,640 ms) and System.Collections.Generic.List+Enumerator1.MoveNext` (54,270 ms) is called within that method twice, I think, based on the total execution time for each of them.
I just can't seem to figure out why it's taking so long to generate the query. It doesn't seem to be any faster the second time it executes after compiling, either, so it doesn't look like it's being cached.
Thanks in advance for the help!

I was able to figure it out. Once I decided not to be held to the original query and reconsidered the result, I rewrote the query to be:
var computedKeys = CreateCompositeIDs(entityKeys);
var entityTags = (from e in Context.Entities
where computedKeys.Contains(e.CompositeId)
from et in e.Tags
select et).Distinct();
entityTags = from et in entityTags
where et.Deleted == null
select et;
return entityTags;
When I started querying the entitites directly and took advantage of the relationship to EntityTag (which I forgot to include in the original question...) via Tags and then filtered only the existing EntityTag it sped up the query to the point where it's all running in under one second.

IQueryable returns null on invoking Count c#

I have a problem trying to get the count out of the following query:
var usersView = PopulateUsersView(); //usersView is an IQueryable object
var foo = usersView.Where(fields => fields.ConferenceRole.ToLower().Contains("role"));
Where UsersView is a class which is populated from an EF entity called users (refer to the first line in the code above)
This is the class definition for the UsersView class:
public class UsersView
{
public int UserId { get; set; }
public string Title { get; set; }
public string Name { get; set; }
public string Surname { get; set; }
public string Street1 { get; set; }
public string Street2 { get; set; }
public string City { get; set; }
public string PostCode { get; set; }
public string CountryName { get; set; }
public string WorkPlaceName { get; set; }
public string Gender { get; set; }
public string EMail { get; set; }
public string Company { get; set; }
public string RoleName { get; set; }
public string ConferenceRole { get; set; }
}
As I said trying to execute the line foo.Count() returns Null Exception and this might be because the ConferenceRole column allows Null in the database.
Now what I can't understand is that when I invoke the same query directly on the ObjectQuery the Count of records (i.e. invoking foo2.Count()) is returned without any exceptions.
var foo2 = entities.users.Where(fields => fields.ConferenceRole.ToLower().Contains("role"));
Is it possible to the same query above but using the IQueryable usersView object instead?
(It is crucial for me to use the usersView object rather than directly querying the entities.users entity)
EDIT
Below is the code from the PopulateUsersView method
private IQueryable<UsersView> PopulateUsersView()
{
using (EBCPRegEntities entities = new EBCPRegEntities())
{
var users = entities.users.ToList();
List<UsersView> userViews = new List<UsersView>();
foreach (user u in users)
{
userViews.Add(new UsersView()
{
UserId = u.UserId,
Title = u.Title,
Name = u.Name,
Surname = u.Surname,
Street1 = u.Street1,
Street2 = u.Street2,
City = u.City,
PostCode = u.Post_Code,
CountryName = u.country.Name,
WorkPlaceName = u.workplace.Name,
Gender = u.Gender,
EMail = u.E_Mail,
Company = u.Company,
RoleName = u.roles.FirstOrDefault().Name,
ConferenceRole = u.ConferenceRole
});
}
return userViews.AsQueryable();
}
}
Thanks
UPDATE...
Thanks guys I finally found a good answer to the difference between the IQueryable and the ObjectQuery objects.
As a solution I am checking if the ConferenceRole is null and then checking with the contains method as many of you guys have said.

My guess is that your PopulateUsersView() method is actually executing a query and returning an IQueryable Linq-to-Objects object - while the foo2 line executes the query only in the SQL layer. If this is the case, the obviously PopulateUsersView() is going to be quite an inefficient way to perform the Count
To debug this:
can you post some code from PopulateUsersView()?
can you try running both sets of code through the EF tracing provider to see what is executed in SQL? (see http://code.msdn.microsoft.com/EFProviderWrappers)
Update
#Ryan - thanks for posting the code to PopulateUsersView
Looks like my guess was right - you are doing a query which gets the whole table back into a List - and its this list that you then query further using Linq2Objects.
#ntziolis has provided one solution to your problem - by testing for null before doing the ToLower(). However, if your only requirement is to Count the non-empty items list, then I recommend you look at changing the PopulateUsersView method or changing your overall design. If all you need is a Count then it would be much more efficient to ensure that the database does this work and not the C# code. This is espeically the case if the table has lots of rows - e.g. you definitely don't want to be pulling 1000s of rows back into memory from the database.
Update 2
Please do consider optimising this and not just doing a simple != null fix.
Looking at your code, there are several lines which will cause multiple sql calls:
CountryName = u.country.Name
WorkPlaceName = u.workplace.Name
RoleName = u.roles.FirstOrDefault().Name
Since these are called in a foreach loop, then to calculate a count of ~500 users, then you will probably make somewhere around 1501 SQL calls (although some roles and countries will hopefully be cached), returning perhaps a megabyte of data in total? All this just to calculate a single integer Count?

Try to check whether ConferenceRole is null before calling a method on it:
var foo = usersView.Where(fields => fields.ConferenceRole != null
&& fields.ConferenceRole.ToLower().Contains("role"));
This will enable you to call the count method on the user view.
So why does it work against the ObjectQuery?
When executing the query against the ObjectQuery, LinqToSql is converting your query into proper sql which does not have problems with null values, something like this (it's sample markup sql only the actual query looks much different, also '=' is used rather than checking for contains):
SELECT COUNT(*) from USERS U WHERE TOLOWER(U.CONFERENCEROLE) = 'role'
The difference to the :NET code is: It will not call a method on an object but merely call a method and pass in the value, therefore no NullReference can occur in this case.
In order to confirm this you can try to force the .NET runtime to execute the SQL prior to calling the where method, by simply adding a ToList() before the .Where()
var foo2 = entities.users.ToList()
.Where(fields => fields.ConferenceRole.ToLower().Contains("role"));
This should result in the exact same error you have seen with the UserView.
And yes this will return the entire user table first, so don't use it in live code ;)
UPDATE
I had to update the answer since I c&p the wrong query in the beginning, the above points still stand though.

Clarification of Dapper Example Code

I'm trying to grok Dapper and seem to be missing something very fundamental, can someone explain the following code taken from the Dapper home page on Google code and explain why there is no From clause, and the second param to the Query method (dynamic) is passed an anonymous type, I gather this is somehow setting up a command object, but would like an explanation in mere mortal terminology.
Thank you,
Stephen
public class Dog {
public int? Age { get; set; }
public Guid Id { get; set; }
public string Name { get; set; }
public float? Weight { get; set; }
public int IgnoredProperty {
get { return 1; }
}
}
var guid = Guid.NewGuid();
var dog = connection.Query<Dog>("select Age = #Age, Id = #Id", new { Age = (int?)null, Id = guid });
dog.Count().IsEqualTo(1);
dog.First().Age.IsNull();
dog.First().Id.IsEqualTo(guid);

The first two examples just don't do any "real" data access, probably in order to keep them simple.
Yes, there is a connection used (connection.Query(...)), but only because that's the only way to call Dapper's methods (because they extend the IDbConnection interface).
Something like this is perfectly valid SQL code:
select 'foo', 1
...it just does "generate" its result on the fly, without actually selecting anything from a table.
The example with the parameters and the anonymous type:
var dog = connection.Query<Dog>("select Age = #Age, Id = #Id", new { Age = (int?)null, Id = guid });)
...just shows Dapper's ability to submit SQL parameters in the form of an anonymous type.
Again, the query does not actually select anything from a table, probably in order to keep it simple.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.