Query 3 separate data sets or 1 joined set? - c#

This question can actually be applied to any language.
It is similar to this one, but not quite the same.
I have a website application that will be displaying data from database.
Three DB tables:
tblProfessor(Id,FirstName,LastName)
tblStudent(Id,FirstName,LastName)
tblProfessorStudent(Id,StudentId,ProfessorId)
So we have Students and Professors. Students can be taught by multiple professors and professors can teach multiple students.
Two ways of querying data:
return a join of all three tables, in which case we transfer some
duplicate data.
return three sets for each of the table. I know
multiple sets of data can be returned in one call from my web
application. I'm not clear about mechanics of that call, but I think
it will be just one connection to the DB (in contrast to the similar question mentioned above).
The query in the first case:
select
ProfessoirId = p.Id
,ProfessorFirstName = p.FirstName
,ProfessorLastName = p.LastName
,StudentId = s.Id
,StudentFirstName = s.FirstName
,StudentLastName = s.LastName
from tblProfessorStudent ps
inner join tblProfessor p
on p.id = ps.ProfessorId
inner join tblStudent s
on s.id = ps.StudentId
The duplication that I am talking about is returning first and last names of student and professor per each row - combination of "Student is taught by professor" and "professor teaches students". The duplication results in extra amount of kb that needs to be transferred from DB to the app.
The query in the second case will be as simple as this:
select <columns> from tblProfessor
select <columns> from tblStudent
select <columns> from tblProfessorStudent
How should I approach querying data for my app from the performance perspective?

From a pure performance perspective, there's nothing that beats the SQL Server's ability to join data sets in T-SQL. Especially when we are talking about large data sets.
Its sole purpose in life is to manage data and data sets, and it does that where the source of the data is.
Joining "over the wire"/on the client will introduce a great deal of (network) overhead, redundant data traffic, and there's no or close to no way that fancy client algorithms can overcome this.
Of course, and as usual: YMMV, "it depends" is always applicable to my statement.

If you are concerned about performance, then you should not return all rows from your tables. Once the database grows, this will cause the application to slow down. You should filter your data to get only the rows you need to display to the user. You can also consider implementing paging, so that you don't display a lot of rows at once.

I think that what matters most in this case is how you are using the data. If you have the correct indexes implemented, SQL Server will join the tables just fine, don’t worry about it. I’m pretty sure it will be faster than running 3 selects. You said you are worried about duplicate data, but what sort of duplication? If you join the 3 tables you’ll have the real data, I mean, teachers that teach X students and students that are taught by X teachers. No duplications! So again, it depends on how you are using the result sets. Are you simply displaying a list of students and a list of teachers? In this case go with option 2. If you need to show Teacher A has the following students, then go with the join on option 1 because if you choose option 2, you will have to manipulate the ProfessorStudent data sets (which I assume has only IDs) to get the names from the other 2 datasets and this is too much trouble in my opinion.

Related

LINQ Grouping a List of Objects into Anonymous Type

I am having difficulty trying to use LINQ to query a sql database in such a way to group all objects (b) in one table associated with an object (a) in another table into an anonymous type with both (a) and a list of (b)s. Essentially, I have a database with a table of offers, and another table with histories of actions taken related to those offers. What I'd like to be able to do is group them in such a way that I have a list of an anonymous type that contains every offer, and a list of every action taken on that offer, so the signature would be:
List<'a>
where 'a is new { Offer offer, List<OfferHistories> offerHistories}
Here is what I tried initially, which obviously will not work
var query = (from offer in context.Offers
join offerHistory in context.OffersHistories on offer.TransactionId equals offerHistory.TransactionId
group offerHistory by offerHistory.TransactionId into offerHistories
select { offer, offerHistories.ToList() }).ToList();
Normally I wouldn't come to SE with this little information but I have tried many different ways and am at a loss for how to proceed.
Please try to avoid .ToList() calls, only do if really necessary. I have an important question: Do you really need all columns of OffersHistories? Because it is very expensive grouping a full object, try only grouping the necessary columns instead. If you really need all offerHistories for one offer then I'm suggesting to write a sub select (this is also cost more performance):
var query = (from offer in context.Offers
select new { offer, offerHistories = (from offerHistory in context.OffersHistories
where offerHistory.TransactionId == offer.TransactionId
select offerHistory) });
P.s.: it's a good idea to create indexes for foreign key columns, columns that are used in where and group by statements, those are going to make the query faster,

Linq query timing out, how to streamline query

Our front end UI has a filtering system that, in the back end, operates over millions of rows. It uses a an IQueryable that is built up over the course of the logic, then executed all at once. Each individual UI component is ANDed together (for example, Dropdown1 and Dropdown2 will only return rows that have both of what is selected in common). This is not a problem. However, Dropdown3 has has two types of data in it, and the checked items need to be ORd together, then ANDed with the rest of the query.
Due to the large amount of rows it is operating over, it keeps timing out. Since there are some additional joins that need to happen, it is somewhat tricky. Here is my code, with the table names replaced:
//The end list has driver ids in it--but the data comes from two different places. Build a list of all the driver ids.
driverIds = db.CarDriversManyToManyTable.Where(
cd =>
filter.CarIds.Contains(cd.CarId) && //get driver IDs for each car ID listed in filter object
).Select(cd => cd.DriverId).Distinct().ToList();
driverIds = driverIds.Concat(
db.DriverShopManyToManyTable.Where(ds => filter.ShopIds.Contains(ds.ShopId)) //Get driver IDs for each Shop listed in filter object
.Select(ds => ds.DriverId)
.Distinct()).Distinct().ToList();
//Now we have a list solely of driver IDs
//The query operates over the Driver table. The query is built up like this for each item in the UI. Changing from Linq is not an option.
query = query.Where(d => driverIds.Contains(d.Id));
How can I streamline this query so that I don't have to retrieve thousands and thousands of IDs into memory, then feed them back into SQL?
There are several ways to produce a single SQL query. All they require to keep the parts of the query of type IQueryable<T>, i.e. do not use ToList, ToArray, AsEnumerable etc. methods that force them to be executed and evaluated in memory.
One way is to create Union query containing the filtered Ids (which will be unique by definition) and use join operator to apply it on the main query:
var driverIdFilter1 = db.CarDriversManyToManyTable
.Where(cd => filter.CarIds.Contains(cd.CarId))
.Select(cd => cd.DriverId);
var driverIdFilter2 = db.DriverShopManyToManyTable
.Where(ds => filter.ShopIds.Contains(ds.ShopId))
.Select(ds => ds.DriverId);
var driverIdFilter = driverIdFilter1.Union(driverIdFilter2);
query = query.Join(driverIdFilter, d => d.Id, id => id, (d, id) => d);
Another way could be using two OR-ed Any based conditions, which would translate to EXISTS(...) OR EXISTS(...) SQL query filter:
query = query.Where(d =>
db.CarDriversManyToManyTable.Any(cd => d.Id == cd.DriverId && filter.CarIds.Contains(cd.CarId))
||
db.DriverShopManyToManyTable.Any(ds => d.Id == ds.DriverId && filter.ShopIds.Contains(ds.ShopId))
);
You could try and see which one performs better.
The answer to this question is complex and has many facets that, individually, may or may not help in your particular case.
First of all, consider using pagination. .Skip(PageNum * PageSize).Take(PageSize) I doubt your user needs to see millions of rows at once in the front end. Show them only 100, or whatever other smaller number seems reasonable to you.
You've mentioned that you need to use joins to get the data you need. These joins can be done while forming your IQueryable (entity framework), rather than in-memory (linq to objects). Read up on join syntax in linq.
HOWEVER - performing explicit joins in LINQ is not the best practice, especially if you are designing the database yourself. If you are doing database first generation of your entities, consider placing foreign-key constraints on your tables. This will allow database-first entity generation to pick those up and provide you with Navigation Properties which will greatly simplify your code.
If you do not have any control or influence over the database design, however, then I recommend you construct your query in SQL first to see how it performs. Optimize it there until you get the desired performance, and then translate it into an entity framework linq query that uses explicit joins as a last resort.
To speed such queries up, you will likely need to perform indexing on all of the "key" columns that you are joining on. The best way to figure out what indexes you need to improve performance, take the SQL query generated by your EF linq and bring it on over to SQL Server Management Studio. From there, update the generated SQL to provide some predefined values for your #p parameters just to make an example. Once you've done this, right click on the query and either use display estimated execution plan or include actual execution plan. If indexing can improve your query performance, there is a pretty good chance that this feature will tell you about it and even provide you with scripts to create the indexes you need.
It looks to me that using the instance versions of the LINQ extensions is creating several collections before you're done. using the from statement versions should cut that down quite a bit:
driveIds = (from var record in db.CarDriversManyToManyTable
where filter.CarIds.Contains(record.CarId)
select record.DriverId).Concat
(from var record in db.DriverShopManyToManyTable
where filter.ShopIds.Contains(record.ShopId)
select record.DriverId).Distinct()
Also using the groupby extension would give better performance than querying each driver Id.

One DataAdapter, two different tables

I find it really hard to explain what is going on, I'll try my best.
We need to make a really simple browser with really simple Favorites + History capabilities, and to do that we need to import them both in a DataSet and use them from there.
The knowledge that I have should be enough, but I would really like to do it more efficient en cleaner. In the database I have 3 tables, one for users, one for favorites and one for history, this is linked with FK's etc. I want a query that returns me every Fav + History url that a user has saved. This is what I have now:
SELECT u_id, u_user, h_url, f_url FROM Users, Favorites, HistoryWHERE u_id = h_id AND u_id = f_id AND u_id = 1
This isn't the result I'm looking for, I want to fill 2 comboboxes, one with favorites and one with history that both are from the person that is logged in at that moment.
I know this should work with join but inner and outer both give to little or to many results, and left and right join don't seem to work for me either, but I cant explain why. :p I'm semi-new to joins btw.
It sounds like you actually want two simple queries, one for favorites and one for history. Both queries just need to have a where clause limiting the user for whom the results are returned.
SELECT f.url
FROM Favorites f
WHERE f.userID = 1
SELECT h.url
FROM History h
WHERE h.userID = 1
If you want to populate two comboboxes, I would suggest that you retrieve two separate lists, one for each combobox. Then you can bind each list separately to its combobox.
While DataSets may not be the best choice, I think they will work for you in this situation. Your objective will be to write two SQL statements, one for each list. Then, you put the query results into DataTables which are contained in one DataSet. A DataTable can be boung to a combobox.

Why use "select new " in LINQ

I am very new to LINQ to SQL, so please forgive me if its a layman sort of question.
I see at many places that we use "select new" keyword in a query.
For e.g.
var orders = from o in db.Orders select new {
o.OrderID,
o.CustomerID,
o.EmployeeID,
o.ShippedDate
}
Why don't we just remove select new and just use "select o"
var orders = from o in db.Orders select o;
What I can differentiate is performance difference in terms of speed, i.e. then second query will take more time in execution than the first one.
Are there any other "differences" or "better to use" concepts between them ?
With the new keyword they are building an anonymous object with only those four fields. Perhaps Orders has 1000 fields, and they only need 4 fields.
If you are doing it in LINQ-to-SQL or Entity Framework (or other similar ORMs) the SELECT it'll build and send to the SQL Server will only load those 4 fields (note that NHibernate doesn't exactly support projections at the db level. When you load an entity you have to load it completely). Less data transmitted on the network AND there is a small chance that this data is contained in an index (loading data from an index is normally faster than loading from the table, because the table could have 1000 fields while the index could contain EXACTLY those 4 fields).
The operation of selecting only some columns in SQL terminology is called PROJECTION.
A concrete case: let's say you build a file system on top of SQL. The fields are:
filename VARCHAR(100)
data BLOB
Now you want to read the list of the files. A simple SELECT filename FROM files in SQL. It would be useless to load the data for each file while you only need the filename. And remember that the data part could "weight" megabytes, while the filename part is up to 100 characters.
After reading how much "fun" is using new with anonymous objects, remember to read what #pleun has written, and remember: ORMs are like icebergs: 7/8 of their working is hidden below the surface and ready to bite you back.
The answer given is fine, however I would like to add another aspect.
Because, using the select new { }, you disconnect from the datacontext and that makes you loose the change tracking mechanism of Linq-to-Sql.
So for only displaying data, it is fine and will lead to performance increase.
BUT if you want to do updates, it is a no go.
In the select new, we're creating a new anonymous type with only the properties you need. They'll all get the property names and values from the matching Orders. This is helpful when you don't want to pull back all the properties from the source. Some may be large (think varchar(max), binary, or xml datatypes), and we might want to exclude those from our query.
If you were to select o, then you'd be selecting an Order with all its properties and behaviours.

How To Join Tables from Two Different Contexts with LINQ2SQL?

I have 2 data contexts in my application (different databases) and need to be able to query a table in context A with a right join on a table in context B. How do I go about doing this in LINQ2SQL?
Why?: We are using a SaaS product for tracking our time, projects, etc. and would like to send new service requests to this product to prevent our team from duplicating data entry.
Context A: This db stores service request information. It is a third party DB and we are not able to make changes to the structure of this DB as it could have unintended non-supportable consequences downstream.
Context B: This data stores the "log" data of service requests that have been processed. My team and I have full control over this DB's structure, etc. Unprocessed service requests should find their way into this DB and another process will identify it as not being processed and send the record to the SaaS product.
This is the query that I am looking to modify. I was able to do a !list.Contains(c.swHDCaseId) initially, but this cannot handle more than 2100 items. Is there a way to add a join to the other context?
var query = (from c in contextA.Cases
where monitoredInboxList.Contains(c.INBOXES.inboxName)
//right join d in contextB.CaseLog on d.ID = c.ID....
select new
{
//setup fields here...
});
you could try using a GetTable command. I think this loads all of contextB.TableB's data first, not 100% sure on that though. I don't have an environment set up to play around in or test this out so let me know if it works =)
from a in contextA.TableA
join b in contextB.GetTable<TableB>() on a.id equals b.id
select new { a, b }
Your best bet, outside of database solutions, is to join using LINQ (to objects) after execution.
I realize this isn't the solution you were hoping for. At least at this level, you won't have to worry about the IN list limitation (.Contains)
Edit:
outside of database solutions above really points to linked server solutions where you allow the table/view from context A to exist in the database from context B.
If you cannot extract the 2 tables into List objects and then join them then you will probably have to do something database side. I would recomend creating a linked server and a view on the DB server you have control of. You can then do the join in the view and you would have a very simple LINQ query to just retrieve the view. I am njot sure how LINQtoSQL could every do a join between 2 data contexts pointing to 2 different servers.

Categories

Resources