Linq no-noes - the catch all sql-like select?

Linq no-noes - the catch all sql-like select? - c#

One of the things that you have beaten into you as a junior developer is that you never, ever do a "SELECT *" on a data set, as it is unreliable for several reasons.
Since moving over to Linq (firstly Linq to SQL and then the Entity Framework), I have wondered if the Linq equivalent is equally frowned upon?
Eg
var MyResult = from t in DataContext.MyEntity
where t.Country == 7
select t;
Should we be selecting into an anonymous type with just the fields we want explicitly mentioned, or is the catch all select now acceptable for LinqToSql et al because of the extra stuff surrounding the data they provide?
Regards
Moo

It's not frowned upon, it's determined by your use case. If you want to update the result and persist it then you should select t, however if you don't want to do that and are just querying for display purposes you can make it more efficient by selecting the properties you want:
var MyResult = from t in DataContext.MyEntity
where t.Country == 7
select new { t.Prop1, t.Prop2 };
This is for a few reasons. The population of an anonymous type is slightly faster, but more importantly it disables change tracking...because you can't persist an anonymous type, there's no need to track changes to it.
Here's an excellent rundown of the common performance areas like this that's great when starting out. It includes a more in-depth explanation of the change tracking I just described as well.

select t in this case s selecting all fields from a known type. It is strongly typed and less subject to the same errors found in SQL.
For example in SQL
INSERT INTO aTable
SELECT * FROM AnotehrTable
could fail if AnotherTable changed, however in Linq / .Net this situation doesn't appear.
If you are joining multiple tables, then you can't do a select * in Linq, you would have to create an anonymous type with all types contained within.

The reason for avoiding SELECT * is that the underlying database might change and therefore column orders might change which could result in bugs in your data access layer.
You are not performing a SELECT * from your database, you are just saying that you want "t" and everything that goes with it. There is nothing wrong with that if that is truly what you need.

I would say what you are doing is the equivalent to a SELECT * statement. It is better to return only the fields you require e.g.
var myResult = from t in DataContext.MyEntity
where t.Country == 7
select new T
{
Field1 = t.Field1,
Field2 = t.Field2
}

You should still explicitly state what you want to select. If you select all, you are still pulling down far more data than you need and as new things are added, you will unnecessarily pull those as well. In general, a best practice is to pull only what you need.

Using LINQ will not alleviate the performance hit of getting extra fields.
However, it is impossible to generate a SELECT * FROM ... using LINQ to SQL.
Your code will generate a SELECT statement that explicitly names all of the columns defined in your model; it will ignore any changes to the database.
However, performance is still a concern, so you should use an anonymous type if you're only using some of the columns.

It may be necessary to do it as in the example, especially if what needs to be done is a change to the row(s).
in SQL select * is different than linq because linq will always return the same number of columns (as defined in the dbml).

Related

LINQ Grouping a List of Objects into Anonymous Type

I am having difficulty trying to use LINQ to query a sql database in such a way to group all objects (b) in one table associated with an object (a) in another table into an anonymous type with both (a) and a list of (b)s. Essentially, I have a database with a table of offers, and another table with histories of actions taken related to those offers. What I'd like to be able to do is group them in such a way that I have a list of an anonymous type that contains every offer, and a list of every action taken on that offer, so the signature would be:
List<'a>
where 'a is new { Offer offer, List<OfferHistories> offerHistories}
Here is what I tried initially, which obviously will not work
var query = (from offer in context.Offers
join offerHistory in context.OffersHistories on offer.TransactionId equals offerHistory.TransactionId
group offerHistory by offerHistory.TransactionId into offerHistories
select { offer, offerHistories.ToList() }).ToList();
Normally I wouldn't come to SE with this little information but I have tried many different ways and am at a loss for how to proceed.

Please try to avoid .ToList() calls, only do if really necessary. I have an important question: Do you really need all columns of OffersHistories? Because it is very expensive grouping a full object, try only grouping the necessary columns instead. If you really need all offerHistories for one offer then I'm suggesting to write a sub select (this is also cost more performance):
var query = (from offer in context.Offers
select new { offer, offerHistories = (from offerHistory in context.OffersHistories
where offerHistory.TransactionId == offer.TransactionId
select offerHistory) });
P.s.: it's a good idea to create indexes for foreign key columns, columns that are used in where and group by statements, those are going to make the query faster,

Linq query timing out, how to streamline query

Our front end UI has a filtering system that, in the back end, operates over millions of rows. It uses a an IQueryable that is built up over the course of the logic, then executed all at once. Each individual UI component is ANDed together (for example, Dropdown1 and Dropdown2 will only return rows that have both of what is selected in common). This is not a problem. However, Dropdown3 has has two types of data in it, and the checked items need to be ORd together, then ANDed with the rest of the query.
Due to the large amount of rows it is operating over, it keeps timing out. Since there are some additional joins that need to happen, it is somewhat tricky. Here is my code, with the table names replaced:
//The end list has driver ids in it--but the data comes from two different places. Build a list of all the driver ids.
driverIds = db.CarDriversManyToManyTable.Where(
cd =>
filter.CarIds.Contains(cd.CarId) && //get driver IDs for each car ID listed in filter object
).Select(cd => cd.DriverId).Distinct().ToList();
driverIds = driverIds.Concat(
db.DriverShopManyToManyTable.Where(ds => filter.ShopIds.Contains(ds.ShopId)) //Get driver IDs for each Shop listed in filter object
.Select(ds => ds.DriverId)
.Distinct()).Distinct().ToList();
//Now we have a list solely of driver IDs
//The query operates over the Driver table. The query is built up like this for each item in the UI. Changing from Linq is not an option.
query = query.Where(d => driverIds.Contains(d.Id));
How can I streamline this query so that I don't have to retrieve thousands and thousands of IDs into memory, then feed them back into SQL?

There are several ways to produce a single SQL query. All they require to keep the parts of the query of type IQueryable<T>, i.e. do not use ToList, ToArray, AsEnumerable etc. methods that force them to be executed and evaluated in memory.
One way is to create Union query containing the filtered Ids (which will be unique by definition) and use join operator to apply it on the main query:
var driverIdFilter1 = db.CarDriversManyToManyTable
.Where(cd => filter.CarIds.Contains(cd.CarId))
.Select(cd => cd.DriverId);
var driverIdFilter2 = db.DriverShopManyToManyTable
.Where(ds => filter.ShopIds.Contains(ds.ShopId))
.Select(ds => ds.DriverId);
var driverIdFilter = driverIdFilter1.Union(driverIdFilter2);
query = query.Join(driverIdFilter, d => d.Id, id => id, (d, id) => d);
Another way could be using two OR-ed Any based conditions, which would translate to EXISTS(...) OR EXISTS(...) SQL query filter:
query = query.Where(d =>
db.CarDriversManyToManyTable.Any(cd => d.Id == cd.DriverId && filter.CarIds.Contains(cd.CarId))
||
db.DriverShopManyToManyTable.Any(ds => d.Id == ds.DriverId && filter.ShopIds.Contains(ds.ShopId))
);
You could try and see which one performs better.

The answer to this question is complex and has many facets that, individually, may or may not help in your particular case.
First of all, consider using pagination. .Skip(PageNum * PageSize).Take(PageSize) I doubt your user needs to see millions of rows at once in the front end. Show them only 100, or whatever other smaller number seems reasonable to you.
You've mentioned that you need to use joins to get the data you need. These joins can be done while forming your IQueryable (entity framework), rather than in-memory (linq to objects). Read up on join syntax in linq.
HOWEVER - performing explicit joins in LINQ is not the best practice, especially if you are designing the database yourself. If you are doing database first generation of your entities, consider placing foreign-key constraints on your tables. This will allow database-first entity generation to pick those up and provide you with Navigation Properties which will greatly simplify your code.
If you do not have any control or influence over the database design, however, then I recommend you construct your query in SQL first to see how it performs. Optimize it there until you get the desired performance, and then translate it into an entity framework linq query that uses explicit joins as a last resort.
To speed such queries up, you will likely need to perform indexing on all of the "key" columns that you are joining on. The best way to figure out what indexes you need to improve performance, take the SQL query generated by your EF linq and bring it on over to SQL Server Management Studio. From there, update the generated SQL to provide some predefined values for your #p parameters just to make an example. Once you've done this, right click on the query and either use display estimated execution plan or include actual execution plan. If indexing can improve your query performance, there is a pretty good chance that this feature will tell you about it and even provide you with scripts to create the indexes you need.

It looks to me that using the instance versions of the LINQ extensions is creating several collections before you're done. using the from statement versions should cut that down quite a bit:
driveIds = (from var record in db.CarDriversManyToManyTable
where filter.CarIds.Contains(record.CarId)
select record.DriverId).Concat
(from var record in db.DriverShopManyToManyTable
where filter.ShopIds.Contains(record.ShopId)
select record.DriverId).Distinct()
Also using the groupby extension would give better performance than querying each driver Id.

Can I easily evaluate many IQueryables in a single database call using Entity Framework?

Suppose I have a collection (of arbitrary size) of IQueryable<MyEntity> (all for the same MyEntity type). Each individual query has successfully been dynamically built to encapsulate various pieces of business logic into a form that can be evaluated in a single database trip. Is there any way I can now have all these IQueryables executed in a single round-trip to the database?
For example (simplified; my actual queries are more complex!), if I had
ObjectContext context = ...;
var myQueries = new[] {
context.Widgets.Where(w => w.Price > 500),
context.Widgets.Where(w => w.Colour == 5),
context.Widgets.Where(w => w.Supplier.Name.StartsWith("Foo"))
};
I would like to have EF perform the translation of each query (which it can do indivudually), then in one database visit, execute
SELECT * FROM Widget WHERE Price > 500
SELECT * FROM Widget WHERE Colour = 5
SELECT W.* FROM Widget
INNER JOIN SUpplier ON Widget.SupplierId = Supplier.Id
WHERE Supplier.Name LIKE 'Foo%'
then convert each result set into an IEnumerable<Widget>, updating the ObjectContext in the usual way.
I've seen various posts about dealing with multiple result sets from a stored procedure, but this is slightly different (not least because I don't know at compile time how many results sets there are going to be). Is there an easy way, or do I have to use something along the lines of Does the Entity Framework support the ability to have a single stored procedure that returns multiple result sets??

No. EF deosn't have query batching (future queries). One queryable is one database roundtrip. As a workaround you can try to play with it and for example use:
string sql = ((ObjectQuery<Widget>)context.Widgets.Where(...)).ToTraceString();
to get SQL of the query and build your own custom command from all SQLs to be executed. After that you can use similar approach as with stored procedures to translate results.
Unless you really need to have each query executed separately you can also union them to single query:
context.Widgets.Where(...).Union(context.Widgets.Where(...));
This will result in UNION. If you need just UNION ALL you can use Concat method instead.

It might be late answer, hopefully it would help some one else with the same issue.
There is Entity Framework Extended Library on NuGet which provides the future queries feature (among others). I played a bit with it and it looks promising.
You can find more information here.

Entity Framework include vs where

My database structure is this: an OptiUser belongs to multiple UserGroups through the IdentityMap table, which is a matching table (many to many) with some additional properties attached to it. Each UserGroup has multiple OptiDashboards.
I have a GUID string which identifies a particular user (wlid in this code). I want to get an IEnumerable of all of the OptiDashboards for the user identified by wlid.
Which of these two Linq-to-Entities queries is the most efficient? Do they run the same way on the back-end?
Also, can I shorten option 2's Include statements to just .Include("IdentityMaps.UserGroup.OptiDashboards")?
using (OptiEntities db = new OptiEntities())
{
// option 1
IEnumerable<OptiDashboard> dashboards = db.OptiDashboards
.Where(d => d.UserGroups
.Any(u => u.IdentityMaps
.Any(i => i.OptiUser.WinLiveIDToken == wlid)));
// option 2
OptiUser user = db.OptiUsers
.Include("IdentityMaps")
.Include("IdentityMaps.UserGroup")
.Include("IdentityMaps.UserGroup.OptiDashboards")
.Where(r => r.WinLiveIDToken == wlid).FirstOrDefault();
// then I would get the dashboards through user.IdentityMaps.UserGroup.OptiDashboards
// (through foreach loops...)
}

You may be misunderstanding what the Include function actually does. Option 1 is purely a query syntax which has no effect on what is returned by the entity framework. Option 2, with the Include function instructs the entity framework to Eagerly Fetch the related rows from the database when returns the results of the query.
So option 1 will result in some joins, but the "select" part of the query will be restricted to the OptiDashboards table.
Option 2 will result in joins as well, but in this case it will be returning the results from all the included tables, which obviously is going to introduce more of a performance hit. But at the same time, the results will include all the related entities you need, avoiding the [possible] need for more round-trips to the database.

I think the Include will render as joins an you will the able to access the data from those tables in you user object (Eager Loading the properties).
The Any query will render as exists and not load the user object with info from the other tables.
For best performance if you don't need the additional info use the Any query

As has already been pointed out, the first option would almost certainly perform better, simply because it would be retrieving less information. Besides that, I wanted to point out that you could also write the query this way:
var dashboards =
from u in db.OptiUsers where u.WinLiveIDToken == wlid
from im in u.IdentityMaps
from d in im.UserGroup.OptiDashboards
select d;
I would expect the above to perform similarly to the first option, but you may (or may not) prefer the above form.

Why use "select new " in LINQ

I am very new to LINQ to SQL, so please forgive me if its a layman sort of question.
I see at many places that we use "select new" keyword in a query.
For e.g.
var orders = from o in db.Orders select new {
o.OrderID,
o.CustomerID,
o.EmployeeID,
o.ShippedDate
}
Why don't we just remove select new and just use "select o"
var orders = from o in db.Orders select o;
What I can differentiate is performance difference in terms of speed, i.e. then second query will take more time in execution than the first one.
Are there any other "differences" or "better to use" concepts between them ?

With the new keyword they are building an anonymous object with only those four fields. Perhaps Orders has 1000 fields, and they only need 4 fields.
If you are doing it in LINQ-to-SQL or Entity Framework (or other similar ORMs) the SELECT it'll build and send to the SQL Server will only load those 4 fields (note that NHibernate doesn't exactly support projections at the db level. When you load an entity you have to load it completely). Less data transmitted on the network AND there is a small chance that this data is contained in an index (loading data from an index is normally faster than loading from the table, because the table could have 1000 fields while the index could contain EXACTLY those 4 fields).
The operation of selecting only some columns in SQL terminology is called PROJECTION.
A concrete case: let's say you build a file system on top of SQL. The fields are:
filename VARCHAR(100)
data BLOB
Now you want to read the list of the files. A simple SELECT filename FROM files in SQL. It would be useless to load the data for each file while you only need the filename. And remember that the data part could "weight" megabytes, while the filename part is up to 100 characters.
After reading how much "fun" is using new with anonymous objects, remember to read what #pleun has written, and remember: ORMs are like icebergs: 7/8 of their working is hidden below the surface and ready to bite you back.

The answer given is fine, however I would like to add another aspect.
Because, using the select new { }, you disconnect from the datacontext and that makes you loose the change tracking mechanism of Linq-to-Sql.
So for only displaying data, it is fine and will lead to performance increase.
BUT if you want to do updates, it is a no go.

In the select new, we're creating a new anonymous type with only the properties you need. They'll all get the property names and values from the matching Orders. This is helpful when you don't want to pull back all the properties from the source. Some may be large (think varchar(max), binary, or xml datatypes), and we might want to exclude those from our query.
If you were to select o, then you'd be selecting an Order with all its properties and behaviours.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.