Fresh data vs. locally cached data in LINQ to SQL

Fresh data vs. locally cached data in LINQ to SQL - c#

Okay, so I have a LINQ to SQL system set up on a WCF service. My application contains a reference to this service which it uses to collect data from an SQL database. I use a DataContext object which was generated by SQLMetal.exe.
I have two entity collections in my DataContext object, Clients and Groups. Each Client contains a field which says how many Groups it belongs to (a comma separated list of Group IDs).
In the application, I have a table of Clients. If I select one and click a button, a second table shows details of the Groups that Client is a part of.
Here's the question: When I click this button, do I go to the database for the Groups each time, or should I load the Groups when the application starts and sift through those? The latter would be quicker, but I want a concurrent solution.
The second question (I know there shouldn't be two really, but I just realised I might be confused on this issue): when I run a LINQ query on a collection in my DataContext object, am I getting the latest database data?
Thanks.

For your second question - Yes, each query against LINQ to SQL results in a SQL statement issued to the backing database. And to clarify further this is each time to attempt to enumerate a LINQ statement. I don't mean to imply that every LINQ statement is sent to the database immediately, which of course it isn't.
The first part on caching vs. querying each time is dependent upon other factors. Is it necessary? Meaning, is there a performance hit you're trying to correct? Also how "stale" can the data afford to be before it becomes a concern to your users? Those are question you'd need to take back to the application owners to decide.

As with most real world performance questions... it depends. The best way to tell is to write your application with what 'feels right' and then if performance is a concern, measure and change accordingly.
Yes, you'll be getting the latest database information.

Unless loading the groups takes a significant amount of time, don't cache.
Premature optimisation is never a good idea.
Just bear in mind that you might want to and make sure your collection of groups is nicely decoupled, optimising if you have to will be relatively simple.

Related

LinqToSQL Design Query / Worry

I wonder if somebody could point me in the right direction. I've recently started playing with LinqToSQL and love the strongly typed data objects etc.
I'm just struggling to understand the impact on database performance etc. For example, say I was developing a simple user profile page. The page shows basic information about the user, some information on their recent activity, and a list of unread notifications.
If I was developing a stored procedure for this page, I could create a single SP which returns multiple datatables covering all of the required information - resulting in a single db call.
However, using LinqToSQL, this could results in many calls - one for user info, atleast one for activity, atleast one for notifications, if I then want further info on notifications this may result in further calls - multiple db calls.
Should I be worried about the number of db calls happenning as a result of using this design pattern? Ie, are the multiple db handshakes etc going to degrade my db etc?
I'd appreciate your thoughts on this!
Thanks
David

LINQ to SQL can consume multiple results from a stored proc if you need to go that route. Unfortnately the designer has problems mapping them correctly, so you will probably need to create your mapping manually. See http://www.thinqlinq.com/Default/Using-LINQ-to-SQL-to-return-Multiple-Results.aspx.
You can configure LINQ to SQL to eagerly load the child records if you know that you're going to need them for every parent record. Use the DataLoadOptions and .LoadWith to configure it.
You can also project an object graph with multiple child collections in the Select clause of a LINQ query to reduce the number of DB hits that you make.
Ultimately, you need to check a number of options to determine which route is the best performance for your situation. It's not a one size fits all scenario.

Is it worst from a performance standpoint ? Yes, it should be. Multiple roundtrips are usually worse than single.
The real question is, do you mind? Is your application going to receive enough visits to warrant the added complexity of a stored procedure? Or do you value the simplicity of future modifications over raw performance?
In any case, if you need the performance, you can create a stored procedure and map it on your context. This will give you one single call, but return the data as objects
Here is an article explaining a bit about that option:
linq-to-sql-returning-multiple-result-sets

Joins are for lazy people?

I recently had a discussion with another developer who claimed to me that JOINs (SQL) are useless. This is technically true but he added that using joins is less efficient than making several requests and link tables in the code (C# or Java).
For him joins are for lazy people that don't care about performance. Is this true? Should we avoid using joins?

No, we should avoid developers who hold such incredibly wrong opinions.
In many cases, a database join is several orders of magnitude faster than anything done via the client, because it avoids DB roundtrips, and the DB can use indexes to perform the join.
Off the top of my head, I can't even imagine a single scenario where a correctly used join would be slower than the equivalent client-side operation.
Edit: There are some rare cases where custom client code can do things more efficiently than a straightforward DB join (see comment by meriton). But this is very much the exception.

It sounds to me like your colleague would do well with a no-sql document-database or key-value store. Which are themselves very good tools and a good fit for many problems.
However, a relational database is heavily optimised for working with sets. There are many, many ways of querying the data based on joins that are vastly more efficient than lots of round trips. This is where the versatilty of a rdbms comes from. You can achieve the same in a nosql store too, but you often end up building a separate structure suited for each different nature of query.
In short: I disagree. In a RDBMS, joins are fundamental. If you aren't using them, you aren't using it as a RDBMS.

Well, he is wrong in the general case.
Databases are able to optimize using a variety of methods, helped by optimizer hints, table indexes, foreign key relationships and possibly other database vendor specific information.

No, you shouldnt.
Databases are specifically designed to manipulate sets of data (obviously....). Therefore they are incredibly efficient at doing this. By doing what is essentially a manual join in his own code, he is attempting to take over the role of something specifically designed for the job. The chances of his code ever being as efficient as that in the database are very remote.
As an aside, without joins, whats the point in using a database? he may as well just use text files.

If "lazy" is defined as people who want to write less code, then I agree. If "lazy" is defined as people who want to have tools do what they are good at doing, I agree. So if he is merely agreeing with Larry Wall (regarding the attributes of good programmers), then I agree with him.

Ummm, joins is how relational databases relate tables to each other. I'm not sure what he's getting at.
How can making several calls to the database be more efficient than one call? Plus sql engines are optimized at doing this sort of thing.
Maybe your coworker is too lazy to learn SQL.

"This is technicaly true" - similarly, a SQL database is useless: what's the point in using one when you can get the same result by using a bunch of CSV files, and correlating them in code? Heck, any abstraction is for lazy people, let's go back to programming in machine code right on the hardware! ;)
Also, his asssertion is untrue in all but the most convoluted cases: RDBMSs are heavily optimized to make JOINs fast. Relational database management systems, right?

Yes, You should.
And you should use C++ instead of C# because of performance. C# is for lazy people.
No, no, no. You should use C instead of C++ because of performance. C++ is for lazy people.
No, no, no. You should use assembly instead of C because of performance. C is for lazy people.
Yes, I am joking. you can make faster programs without joins and you can make programs using less memory without joins. BUT in many cases, your development time is more important than CPU time and memory. Give up a little performance and enjoy your life. Don't waste your time for little little performance. And tell him "Why don't you make a straight highway from your place to your office?"

The last company I worked for didn't use SQL joins either. Instead they moved this work to application layer which is designed to scale horizontally. The rationale for this design is to avoid work at database layer. It is usually the database that becomes bottleneck. Its easier to replicate application layer than database. There could be other reasons. But this is the one that I can recall now.
Yes I agree that joins done at application layer are inefficient compared to joins done by database. More network communication also.
Please note that I'm not taking a hard stand on avoiding SQL joins.

Without joins how are you going to relate order items to orders?
That is the entire point of a relational database management system.
Without joins there is no relational data and you might as well use text files
to process data.
Sounds like he doesn't understand the concept so he's trying to make it seem they are useless. He's the same type of person who thinks excel is a database application.
Slap him silly and tell him to read more about databases. Making multiple connections and pulling data and merging the data via C# is the wrong way to do things.

I don't understand the logic of the statement "joins in SQL are useless".
Is it useful to filter and limit the data before working on it? As you're other respondants have stated this is what database engines do, it should be what they are good at.
Perhaps a lazy programmer would stick to technologies with which they were familiar and eschew other possibilities for non technical reasons.
I leave it to you to decide.

Let's consider an example: a table with invoice records, and a related table with invoice line item records. Consider the client pseudo code:
for each (invoice in invoices)
let invoiceLines = FindLinesFor(invoice)
...
If you have 100,000 invoices with 10 lines each, this code will look up 10 invoice lines from a table of 1 million, and it will do that 100,000 times. As the table size increases, the number of select operations increases, and the cost of each select operation increases.
Becase computers are fast, you may not notice a performance difference between the two approaches if you have several thousand records or fewer. Because the cost increase is more than linear, as the number of records increases (into the millions, say), you'll begin to notice a difference, and the difference will become less tolerable as the size of the data set grows.
The join, however. will use the table's indexes and merge the two data sets. This means that you're effectively scanning the second table once rather than randomly accessing it N times. If there's a foreign key defined, the database already has the links between the related records stored internally.
Imagine doing this yourself. You have an alphabetical list of students and a notebook with all the students' grade reports (one page per class). The notebook is sorted in order by the students' names, in the same order as the list. How would you prefer to proceed?
Read a name from the list.
Open the notebook.
Find the student's name.
Read the student's grades, turning pages until you reach the next student or the last page.
Close the notebook.
Repeat.
Or:
Open the notebook to the first page.
Read a name from the list.
Read any grades for that name from the notebook.
Repeat steps 2-3 until you get to the end
Close the notebook.

Sounds like a classic case of "I can write it better." In other words, he's seeing something that he sees as kind of a pain in the neck (writing a bunch of joins in SQL) and saying "I'm sure I can write that better and get better performance." You should ask him if he is a) smarter and b) more educated than the typical person that's knee deep in the Oracle or SQL Server optimization code. Odds are he isn't.

He is most certainly wrong. While there are definite pros to data manipulation within languages like C# or Java, joins are fastest in the database due to the nature of SQL itself.
SQL keeps detailing statistics regarding the data, and if you have created your indexes correctly, can very quickly find one record in a couple of million. Besides the fact that why would you want to drag all your data into C# to do a join when you can just do it right on the database level?
The pros for using C# come into play when you need to do something iteratively. If you need to do some function for each row, it's likely faster to do so within C#, otherwise, joining data is optimized in the DB.

I will say that I have run into a case where it was faster breaking the query down and doing the joins in code. That being said, it was only with one particular version of MySQL that I had to do that. Everything else, the database is probably going to be faster (note that you may have to optimize the queries, but it will still be faster).

I suspect he has a limited view on what databases should be used for. One approach to maximise performance is to read the entire database into memory. In this situation, you may get better performance and you may want to perform joins if memory for efficiency. However this is not really using a database, as a database IMHO.

No, not only are joins better optimized in database code that ad-hoc C#/Java; but usually several filtering techniques can be applied, which yields even better performance.

He is wrong, joins are what competent programmers use. There may be a few limited cases where his proposed method is more efficient (and inthose I would probably be using a Documant database) but I can't see it if you have any deceent amount of data. For example take this query:
select t1.field1
from table1 t1
join table2 t2
on t1.id = t2.id
where t1.field2 = 'test'
Assume you have 10 million records in table1 and 1 million records in table2. Assume 9 million of the records in table 1 meet the where clause. Assume only 15 of them are in table2 as well. You can run this sql statement which if properly indexed will take milliseconds and return 15 records across the network with only 1 column of data. Or you can send ten million records with 2 columns of data and separately send another 1 millions records with one column of data across the network and combine them on the web server.
Or of course you could keep the entire contents of the database on the web server at all times which is just plain silly if you have more than a trivial amount of data and data that is continually changing. If you don't need the qualities of a relational database then don't use one. But if you do, then use it correctly.

I've heard this argument quite often during my career as a software developer. Almost everytime it has been stated, the guy making the claim didn't have much knowledge about relational database systems, the way they work and the way such systems should be used.
Yes, when used incorrectly, joins seem to be useless or even dangerous. But when used in the correct way, there is a lot of potential for database implementation to perform optimizations and to "help" the developer retrieving the correct result most efficiently.
Don't forget that using a JOIN you tell the database about the way you expect the pieces of data to relate to each other and therefore give the database more information about what you are trying to do and therefore making it able to better fit your needs.
So the answer is definitely: No, JOINSaren't useless at all!

This is "technically true" only in one case which is not used often in applications (when all the rows of all the tables in the join(s) are returned by the query). In most queries only a fraction of the rows of each table is returned. The database engine often uses indexes to eliminate the unwanted rows, sometimes even without reading the actual row as it can use the values stored in indexes. The database engine is itself written in C, C++, etc. and is at least as efficient as code written by a developer.

Unless I've seriously misunderstood, the logic in the question is very flawed
If there are 20 rows in B for each A, a 1000 rows in A implies 20k rows in B.
There can't be just 100 rows in B unless there is many-many table "AB" with 20k rows with the containing the mapping.
So to get all information about which 20 of the 100 B rows map to each A row you table AB too. So this would be either:
3 result sets of 100, 1000, and 20k rows and a client JOIN
a single JOINed A-AB-B result set with 20k rows
So "JOIN" in the client does add any value when you examine the data. Not that it isn't a bad idea. If I was retrieving one object from the database than maybe it makes more sense to break it down into separate results sets. For a report type call, I'd flatten it out into one almost always.
In any case, I'd say there is almost no use for a cross join of this magnitude. It's a poor example.
You have to JOIN somewhere, and that's what RDBMS are good at. I'd not like to work with any client code monkey who thinks they can do better.
Afterthought:
To join in the client requires persistent objects such as DataTables (in .net). If you have one flattened resultset it can be consumed via something lighter like a DataReader. High volume = lot of client resources used to avoid a database JOIN.

database, requests, performance, cache

I need some input on how to design a database layer.
In my application I have a List of T. The information in T have information from multiple database tables.
There are of course multiple ways to do this.
Two ways that I think of is :
chatty database layer and cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = GetSomeValueFromCacheOrDb(dataRow["someId2"])
});
}
The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that.
Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems. Which we have to handle manually.
The good thing is that it's highly cacheable.
non chatty but not cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = dataRow["someValue"]
});
}
The problem that I see with the above is that its hard to cache, since potentially all users have unique lists. The other problem is that it will be a lot of joins which could result in a lot of reads against the database.
The good thing is that we know for sure that all information exists after the query is run (inner join etc)
non so chatty, but still cacheable
A third option could be to first loop through the data rows, and collect all necessary someId2 and then make one more database request to get all the SomeId2 values.

"The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that."
True. Could also create unnecessary contention and consume server resources maintaining locks as you iterate over a query.
"Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems."
If I take that quote, then this quote:
"The good thing is that it's highly cacheable."
Is not true, because you've cached stale data. So strike off the only advantage so far.
But to directly answer your question, the most efficient design, which seems to be what you are asking, is to use the database for what it is good for, enforcing ACID compliance and various constraints, most notably pk's and fk's, but also for returning aggregated answers to cut down on round trips and wasted cycles on the app side.
This means you either put SQL into your app code, which has been ruled to be Infinite Bad Taste by the Code Thought Police, or go to sprocs. Either one works. Putting the code into the App makes it more maintainable, but you'll never be invited to any more elegant OOP parties.

Some suggestions:
SQL is a set based language, so don't design things for iterating over loops. Even with stored procedures, still see cursors now and then when a set based query will solve the issue. So, always try and get the information with 1 query. Now sometimes this isn't possible but in the majority this will be. You can also design Views to make your querying easier if you have a schema with many tables to pull the information that is needed with one statement.
Use proxies. Let's say I have an object with 50 properties. At first you display a list of objects to the user. In this case, I would create a proxy of the most important properties and display that to the user, maybe 2 or three important ones like name, ID, etc. This cuts down on amount of information sent initially. When the user actually wants to edit or change the object, then make a second query to get the "full" object. Only get what you need. This is especially important over the web when serialization XML between the layers.
Come up with a paging strategy. Most systems work fine until they get a lot of data and then the query comes to a halt because it is reurning 1000s of data rows/records. Page early and often. If you are doing a web application, probably paging directly in the database will be the most performant because only the paged data is being sent between the layers.
Data caching depends on the data. For highly volatile data (changing all the time) caching isn't worth it. By for semi-volatile or non-volatile data, caching can be worth it, but you have to manage the cache either directly or indirectly if you are using a built in framework.
A good place to use a cache is say you have a zip codes table. Certianly, those don't change that often and you could cache those to boost performance if you had a zip code drop down in your application. This is just an example, but caching IMO depends on the type of data.

Can I improve performance by refactoring SQL commands into C# classes?

Currently, my entire website does updating from SQL parameterized queries. It works, we've had no problems with it, but it can occasionally be very slow.
I was wondering if it makes sense to refactor some of these SQL commands into classes so that we would not have to hit the database so often. I understand hitting the database is generally the slowest part of any web application For example, say we have a class structure like this:
Project (comprised of) Tasks (comprised of) Assignments
Where Project, Task, and Assignment are classes.
At certain points in the site you are only working on one project at a time, and so creating a Project class and passing it among pages (using Session, Profile, something else) might make sense. I imagine this class would have a Save() method to save value changes.
Does it make sense to invest the time into doing this? Under what conditions might it be worth it?

If your site is slow, you need to figure out what the bottleneck is before you randomly start optimizing things.
Caching is certainly a good idea, but you shouldn't assume that this will solve the problem.

Caching is almost always underutilized in ASP .NET applications. Any time you hit your database, you should look for ways to cache the results.

Serializing objects into the session can be costly in itself, but most likley faster than just hitting the database every single time. You are benefiting now from Execution Plan caching in SQL Server so it's very likely that what you're getting is optimal performance out of your stored procedure.
One option you might consider doing to increase performance is to astract your data into objects via LINQ to SQL (against your sprocs) and then use AppFabric to cache the objects.
http://msdn.microsoft.com/en-us/windowsserver/ee695849.aspx
As for your updates, you should do that directly against the sprocs, but you will also need to clear our the Cache in AppFabric for objects that are affected by the Insert/Update/Delete.
You could also do the same thing simply using the standard Cache as well, but AppFabric has some added benefits.

Use the SQL Profiler to identify your slowest queries, and see if you can improve them with some simple index changes (removing unused indexes, adding missing indexes).
You could very easily improve your application performance by an order of magnitude without changing your front-end app at all.
See http://sqlserverpedia.com/wiki/Find_Missing_Indexes

If you have look up data only, you can store it in Cache object. This will avoid the hits to DB. Only data that can be used globally should be stored in Cache.
If this data requires filtering, you can restore it from Cache, and filter the data before rendering.
Session can be used to store user specific data. But care must be taken that too much of session variables can easily cause performance problems.

This book may be helpful.
http://www.amazon.com/Ultra-Fast-ASP-NET-Build-Ultra-Scalable-Server/dp/1430223839

Storing search result for paging and sorting

I've been implementing MS Search Server 2010 and so far its really good. Im doing the search queries via their web service, but due to the inconsistent results, im thinking about caching the result instead.
The site is a small intranet (500 employees), so it shouldnt be any problems, but im curious what approach you would take if it was a bigger site.
I've googled abit, but havent really come over anything specific. So, a few questions:
What other approaches are there? And why are they better?
How much does it cost to store a dataview of 400-500 rows? What sizes are feasible?
Other points you should take into consideration.
Any input is welcome :)

You need to employ many techniques to pull this off successfully.
First, you need some sort of persistence layer. If you are using a plain old website, then the user's session would be the most logical layer to use. If you are using web services (meaning session-less) and just making calls through a client, well then you still need some sort of application layer (sort of a shared session) for your services. Why? This layer will be home to your database result cache.
Second, you need a way of caching your results in whatever container you are using (session or the application layer of web services). You can do this a couple of ways... If the query is something that any user can do, then a simple hash of the query will work, and you can share this stored result among other users. You probably still want some sort of GUID for the result, so that you can pass this around in your client application, but having a hash lookup from the queries to the results will be useful. If these queries are unique then you can just use the unique GUID for the query result and pass this along to the client application. This is so you can perform your caching functionality...
The caching mechanism can incorporate some sort of fixed length buffer or queue... so that old results will automatically get cleaned out/removed as new ones are added. Then, if a query comes in that is a cache miss, it will get executed normally and added to the cache.
Third, you are going to want some way to page your result object... the Iterator pattern works well here, though probably something simpler might work... like fetch X amount of results starting at point Y. However the Iterator pattern would be better as you could then remove your caching mechanism later and page directly from the database if you so desired.
Fourth, you need some sort of pre-fetch mechanism (as others suggested). You should launch a thread that will do the full search, and in your main thread just do a quick search with the top X number of items. Hopefully by the time the user tries paging, the second thread will be finished and your full result will now be in the cache. If the result isn't ready, you can just incorporate some simple loading screen logic.
This should get you some of the way... let me know if you want clarification/more details about any particular part.
I'll leave you with some more tips...
You don't want to be sending the entire result to the client app (if you are using Ajax or something like an IPhone app). Why? Well because that is a huge waste. The user likely isn't going to page through all of the results... now you just sent over 2MB of result fields for nothing.
Javascript is an awesome language but remember it is still a client side scripting language... you don't want to be slowing the user experience down too much by sending massive amounts of data for your Ajax client to handle. Just send the prefetched result your client and additional page results as the user pages.
Abstraction abstraction abstraction... you want to abstract away the cache, the querying, the paging, the prefetching... as much of it as you can. Why? Well lets say you want to switch databases or you want to page directly from the database instead of using a result object in cache... well if you do it right this is much easier to change later on. Also, if using web services, many many other applications can make use of this logic later on.
Now, I probably suggested an over-engineered solution for what you need :). But, if you can pull this off using all the right techniques, you will learn a ton and have a very good base in case you want to extend functionality or reuse this code.
Let me know if you have questions.

It sounds like the slow part of the search is the full-text searching, not the result retrieval. How about caching the resulting resource record IDs? Also, since it might be true that search queries are often duplicated, store a hash of the search query, the query, and the matching resources. Then you can retrieve the next page of results by ID. Works with AJAX too.
Since it's an intranet and you may control the searched resources, you could even pre-compute a new or updated resource's match to popular queries during idle time.

I have to admit that I am not terribly familiar with MS Search Server so this may not apply. I have often had situations where an application had to search through hundreds of millions of records for result sets that needed to be sorted, paginated and sub-searched in a SQL Server though. Generally what I do is take a two step approach. First I grab the first "x" results which need to be displayed and send them to the browser for a quick display. Second, on another thread, I finish the full query and move the results to a temp table where they can be stored and retrieved quicker. Any given query may have thousands or tens of thousands of results but in comparison to the hundreds of millions or even billions of total records, this smaller subset can be manipulated very easily from the temp table. It also puts less stress on the other tables as queries happen. If the user needs a second page of records, or needs to sort them, or just wants a subset of the original query, this is all pulled from the temp table.
Logic then needs to be put into place to check for outdated temp tables and remove them. This is simple enough and I let the SQL Server handle that functionality. Finally logic has to be put into place for when the original query changes (significant perimeter changes) so that a new data set can be pulled and placed into a new temp table for further querying. All of this is relatively simple.
Users are so used to split second return times from places like google and this model gives me enough flexibility to actually achieve that without needing the specialized software and hardware that they use.
Hope this helps a little.

Tim's answer is a great way to handle things if you have the ability to run the initial query in a second thread and the logic (paging / sorting / filtering) to be applied to the results requires action on the server ..... otherwise ....
If you can use AJAX, a 500 row result set could be called into the page and paged or sorted on the client. This can lead to some really interesting features .... check out the datagrid solutions from jQueryUI and Dojo for inspiration!
And for really intensive features like arbitrary regex filters and drag-and-drop column re-ordering you can totally free the server.
Loading the data to the browser all at once also lets you call in supporting data (page previews etc) as the user "requests" them ....
The main issue is limiting the data you return per result to what you'll actually use for your sorts and filters.
The possibilities are endless :)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.