Using Distinct in LINQ to SQL on the primary key - c#

I have a system that uses tags to categorize content similar to the way Stack Overflow does. I am trying to generate a list of the most recently used tags using LINQ to SQL.
(from x in cm.ContentTags
join t in cm.Tags on x.TagID equals t.TagID
orderby x.ContentItem.Date descending select t)
.Distinct(p => p.TagID) // <-- Ideally I'd be able to do this
The Tags table has a many to many relationship with the ContentItems table. ContentTags joins them, each tuple having a reference to the Tag and ContentItem.
I can't just use distinct because it compares on Tablename.* rather than Tablename.PrimaryKey, and I can't implement an IEqualityComparer since that doesn't translate to SQL, and I don't want to pull potential millions of records from the DB with .ToList(). So, what should I do?

You could write your own query provider, that supports such an overloaded distinct operator. It's not cheap, but it would probably be worthwhile, particularly if you could make it's query generation composable. That would enable a lot of customizations.
Otherwise you could create a stored proc or a view.

Use a subselect:
var result = from t in cm.Tags
where(
from x in cm.ContentTags
where x.TagID == t.TagID
).Contains(t.TagID)
select t;
This will mean only distinct records are returned form cm.Tags, only problem is you will need to find some way to order result

Use:
.GroupBy(x => x.TagID)
And then extract the data in:
Select(x => x.First().Example)

Related

Is it possible to combine 2 LINQ queries, each filtering data, before fetching the results of the queries?

I need to retrieve data from 2 SQL tables, using LINQ. I was hoping to combine them using a Join. I've looked this problem up on Stack Overflow, but all the questions and answers I've seen involve retrieving the data using ToList(), but I need to use lazy loading. The reason for this is there's too much data to fetch it all. Therefore, I've got to apply a filter to both queries before performing a ToList().
One of these queries is easily specified:
var solutions = ctx.Solutions.Where(s => s.SolutionNumber.Substring(0, 2) == yearsToConsider.PreviousYear || s.SolutionNumber.Substring(0, 2) == yearsToConsider.CurrentYear);
It retrieves all the data from the Solution table, where the SolutionNumber starts with either the current or previous year. It returns an IQueryable.
The thing that's tough for me to figure out is how to retrieve a filtered list from another table named Proficiency. At this point all I've got is this:
var profs = ctx.Proficiencies;
The Proficiency table has a column named SolutionID, which is a foreign key to the ID column in the Solution table. If I were doing this in SQL, I'd do a subquery where SolutionID is in a collection of IDs from the Solution table, where those Solution records match the same Where clause I'm using to retrieve the IQueryable for Solutions above. Only when I've specified both IQueryables do I want to then perform a ToList().
But I don't know how to specify the second LINQ query for Proficiency. How do I go about doing what I'm trying to do?
As far as I understand, you are trying to fetch Proficiencies based on some Solutions. This might be achieved in two different ways. I'll try to provide solutions in Linq as it is more readable. However, you can change them in Lambda Expressions later.
Solution 1
var solutions = ctx.Solutions
.Where(s => s.SolutionNumber.Substring(0, 2) == yearsToConsider.PreviousYear || s.SolutionNumber.Substring(0, 2) == yearsToConsider.CurrentYear)
.Select(q => q.SolutionId);
var profs = (from prof in ctx.Proficiencies where (from sol in solutions select sol).Contains(prof.SolutionID) select prof).ToList();
or
Solution 2
var profs = (from prof in ctx.Proficiencies
join sol in ctx.Solutions on prof.SolutionId equals sol.Id
where sol.SolutionNumber.Substring(0, 2) == yearsToConsider.PreviousYear || sol.SolutionNumber.Substring(0, 2) == yearsToConsider.CurrentYear
select prof).Distinct().ToList();
You can trace both queries in SQL Profiler to investigate the generated queries. But I'd go for the first solution as it will generate a subquery that is faster and does not use Distinct function that is not recommended unless you have to.

Linq query timing out, how to streamline query

Our front end UI has a filtering system that, in the back end, operates over millions of rows. It uses a an IQueryable that is built up over the course of the logic, then executed all at once. Each individual UI component is ANDed together (for example, Dropdown1 and Dropdown2 will only return rows that have both of what is selected in common). This is not a problem. However, Dropdown3 has has two types of data in it, and the checked items need to be ORd together, then ANDed with the rest of the query.
Due to the large amount of rows it is operating over, it keeps timing out. Since there are some additional joins that need to happen, it is somewhat tricky. Here is my code, with the table names replaced:
//The end list has driver ids in it--but the data comes from two different places. Build a list of all the driver ids.
driverIds = db.CarDriversManyToManyTable.Where(
cd =>
filter.CarIds.Contains(cd.CarId) && //get driver IDs for each car ID listed in filter object
).Select(cd => cd.DriverId).Distinct().ToList();
driverIds = driverIds.Concat(
db.DriverShopManyToManyTable.Where(ds => filter.ShopIds.Contains(ds.ShopId)) //Get driver IDs for each Shop listed in filter object
.Select(ds => ds.DriverId)
.Distinct()).Distinct().ToList();
//Now we have a list solely of driver IDs
//The query operates over the Driver table. The query is built up like this for each item in the UI. Changing from Linq is not an option.
query = query.Where(d => driverIds.Contains(d.Id));
How can I streamline this query so that I don't have to retrieve thousands and thousands of IDs into memory, then feed them back into SQL?
There are several ways to produce a single SQL query. All they require to keep the parts of the query of type IQueryable<T>, i.e. do not use ToList, ToArray, AsEnumerable etc. methods that force them to be executed and evaluated in memory.
One way is to create Union query containing the filtered Ids (which will be unique by definition) and use join operator to apply it on the main query:
var driverIdFilter1 = db.CarDriversManyToManyTable
.Where(cd => filter.CarIds.Contains(cd.CarId))
.Select(cd => cd.DriverId);
var driverIdFilter2 = db.DriverShopManyToManyTable
.Where(ds => filter.ShopIds.Contains(ds.ShopId))
.Select(ds => ds.DriverId);
var driverIdFilter = driverIdFilter1.Union(driverIdFilter2);
query = query.Join(driverIdFilter, d => d.Id, id => id, (d, id) => d);
Another way could be using two OR-ed Any based conditions, which would translate to EXISTS(...) OR EXISTS(...) SQL query filter:
query = query.Where(d =>
db.CarDriversManyToManyTable.Any(cd => d.Id == cd.DriverId && filter.CarIds.Contains(cd.CarId))
||
db.DriverShopManyToManyTable.Any(ds => d.Id == ds.DriverId && filter.ShopIds.Contains(ds.ShopId))
);
You could try and see which one performs better.
The answer to this question is complex and has many facets that, individually, may or may not help in your particular case.
First of all, consider using pagination. .Skip(PageNum * PageSize).Take(PageSize) I doubt your user needs to see millions of rows at once in the front end. Show them only 100, or whatever other smaller number seems reasonable to you.
You've mentioned that you need to use joins to get the data you need. These joins can be done while forming your IQueryable (entity framework), rather than in-memory (linq to objects). Read up on join syntax in linq.
HOWEVER - performing explicit joins in LINQ is not the best practice, especially if you are designing the database yourself. If you are doing database first generation of your entities, consider placing foreign-key constraints on your tables. This will allow database-first entity generation to pick those up and provide you with Navigation Properties which will greatly simplify your code.
If you do not have any control or influence over the database design, however, then I recommend you construct your query in SQL first to see how it performs. Optimize it there until you get the desired performance, and then translate it into an entity framework linq query that uses explicit joins as a last resort.
To speed such queries up, you will likely need to perform indexing on all of the "key" columns that you are joining on. The best way to figure out what indexes you need to improve performance, take the SQL query generated by your EF linq and bring it on over to SQL Server Management Studio. From there, update the generated SQL to provide some predefined values for your #p parameters just to make an example. Once you've done this, right click on the query and either use display estimated execution plan or include actual execution plan. If indexing can improve your query performance, there is a pretty good chance that this feature will tell you about it and even provide you with scripts to create the indexes you need.
It looks to me that using the instance versions of the LINQ extensions is creating several collections before you're done. using the from statement versions should cut that down quite a bit:
driveIds = (from var record in db.CarDriversManyToManyTable
where filter.CarIds.Contains(record.CarId)
select record.DriverId).Concat
(from var record in db.DriverShopManyToManyTable
where filter.ShopIds.Contains(record.ShopId)
select record.DriverId).Distinct()
Also using the groupby extension would give better performance than querying each driver Id.

Simple table join in nHibernate

I'm new to working with nHibernate and have inherited a project which has implemented it for accessing its database. So far I've been able to write additional single-table query methods using QueryOvery<>, but I'm finding the logic behind table joins perplexing.
I want to implement the following T-SQL query:
SELECT DISTINCT f.FILE_ID, f.COMPANY_ID, f.FILE_META_ID, (etc...)
FROM AUDIT.FILE_INSTANCE f
INNER JOIN AUDIT.FILE_INSTANCE_REPORTING_PERIOD p ON p.FILE_ID = f.FILE_ID
WHERE p.REPORTING_PERIOD_ID BETWEEN 20150101 AND 20151304
AND FILE_STATUS != 'CANCELLED';
In this example, the reporting period ids would be parameters. Please note that they are integers, not dates, and the DISTINCT clause is important. How should I proceed?
I do not know your classes but if i would have written it, it would look like this:
var results = session.QueryOver<FileInstance>()
.Where(fi => fi.Status == FileStatus.Canceled)
.JoinQueryOver(fi => fi.ReportingPeriods)
.WhereRestrictionOn(period => period.Id).Between(20150101, 20151304)
.Transform(Transformers.DistinctRootEntity)
.List();

How to Linq-ify this query?

I currently have this Linq query:
return this.Context.StockTakeFacts
.OrderByDescending(stf => stf.StockTakeId)
.Where(stf => stf.FactKindId == ((int)kind))
.Take(topCount)
.ToList<IStockTakeFact>();
The intent is to return every fact for the topCount of StockTakes but instead I can see that I will only get the topCount number of facts.
How do I Linq-ify this query to achieve my aim?
I could use 2 queries to get the top-topCount StockTakeId and then do a "between" but I wondered what tricks Linq might have.
This is what I'm trying to beat. Note that it's really more about learning that not being able to find a solution. Also concerned about performance not for these queries but in general, I don't want to just to easy stuff and find out it's thrashing behind the scenes. Like what is the penalty of that contains clause in my second query below?
List<long> stids = this.Context.StockTakes
.OrderByDescending(st => st.StockTakeId)
.Take(topCount)
.Select(st => st.StockTakeId)
.ToList<long>();
return this.Context.StockTakeFacts
.Where(stf => (stf.FactKindId == ((int)kind)) && (stids.Contains(stf.StockTakeId)))
.ToList<IStockTakeFact>();
What about this?
return this.Context.StockTakeFacts
.OrderByDescending(stf => stf.StockTakeId)
.Where(stf => stf.FactKindId == ((int)kind))
.Take(topCount)
.Select(stf=>stf.Fact)
.ToList();
If I've understood what you're after correctly, how about:
return this.Context.StockTakes
.OrderByDescending(st => st.StockTakeId)
.Take(topCount)
.Join(
this.Context.StockTakeFacts,
st => st.StockTakeId,
stf => stf.StockTakeId,
(st, stf) => stf)
.OrderByDescending(stf => stf.StockTakeId)
.ToList<IStockTakeFact>();
Here's my attempt using mostly query syntax and using two separate queries:
var stids =
from st in this.Context.StockTakes
orderby st.StockTakeId descending
select st.StockTakeId;
var topFacts =
from stid in stids.Take(topCount)
join stf in this.Context.StockTakeFacts
on stid equals stf.StockTakeId
where stf.FactKindId == (int)kind
select stf;
return topFacts.ToList<IStockTakeFact>();
As others suggested, what you were looking for is a join. Because the join extension has so many parameters they can be a bit confusing - so I prefer query syntax when doing joins - the compiler gives errors if you get the order wrong, for instance. Join is by far preferable to a filter not only because it spells out how the data is joined together, but also for performance reasons because it uses indexes when used in a database and hashes when used in linq to objects.
You should note that I call Take in the second query to limit to the topCount stids used in the second query. Instead of having two queries, I could have used an into (i.e., query continuation) on the select line of the stids query to combine the two queries, but that would have created a mess for limiting it to topCount items. Another option would have been to put the stids query in parentheses and invoked Take on it. Instead, separating it out into two queries seemed the cleanest to me.
I ordinarily avoid specifying generic types whenever I think the compiler can infer the type; however, IStockTakeFact is almost certainly an interface and whatever concrete type implements it is likely contained by this.Context.StockTakeFacts; which creates the need to specify the generic type on the ToList call. Ordinarily I omit the generic type parameter to my ToList calls - that seems to be an element of my personal tastes, yours may differ. If this.Context.StockTakeFacts is already a List<IStockTakeFact> you could safely omit the generic type on the ToList call.

Linq with lambda expression: how do IN statement from sql?

In SQL I can use the IN operator:
select * from project where id IN (select f.id_project from funds)
How can I convert this statement to LINQ?
Controller.Project.Where(p => p.id == id_funds); // I can assign only one id funds
Use the Contains method:
Controller.Project.Where(p => id_funds.Contains(p.id));
Don't think it applies to your question as it appears id_funds links to a db table but...
Note that if id_funds is an object (i.e. a simple array or list etc.), not something in the database and you're not using Linq-To-Objects you may encounter issues with this as i don't think Linq-To-Entities before v 4.0 supports it (see here for a workaround), and Linq-To-SQL has problems if id_funds is a very large list (more than around 2000 items).

Categories

Resources