I'm going through old projects at work trying to make them faster. I'm currently looking at some web APIs. One API is running particularly slow the problem is in the data service it is calling. Specifically it is in a lambda method trying to map a stored procedure result to a domain model. A simple version of the code.
public IEnumerable<DomainModelResult> GetData()
{
return this.EntityFrameworkDB.GetDataSproc().ToList()
.Select(sprocResults=>sprocResults.ToDomainModelResult())
.AsEnumerable();
}
This is a simplified version, but after profiling it I found the major hangup is in the lambda function. I am assuming this is because the EFContext is still open and some goofy entity framework stuff is happening.
Problem is I'm relatively new to Entity Framework(intern) and pretty ignorant to the inner workings of it. Could someone explain why this is so slow. I feel it should be very fast The DomainModelResult is a POCO and only setter methods are being used in ToDomainModelResult.
Edit:
I thought ToList() would do that but started to doubt myself because I couldn't think of another explanation. All the ToDomainModelResult() stuff is extremely simple. Something like.
public static DomainModelResult ToDomainModelResult(SprocResult source)
{
return new DomainModeResult
{
FirstName = source.description,
MiddleName = source._middlename,
LastName = source.lastname,
UserName = source.expr2,
Address = source.uglyName
};
}
Its just a bunch of simple setters, I think the model causing problems has 17 properties. The reason this is being done is because the project is old database first and the stored procedures have ugly names that aren't descriptive at all. Also so switching the stored procedures in dataservices is easy and doesn't break the rest of the project.
Edit:2 For some reason Using ToArray and breaking apart the linq statements makes the assignment from procedure result to domain model result extremely fast. Now the whole dataservice method is faster which is odd, I don't know where the rest of the time went.
This might be a more esoteric question than I originally thought. My question hasn't been answered but the problem is no longer there. Thanks to all the replied. I'm keeping this as unanswered for now.
Edit3: Please flag this question for removal I can't remove it. I found the problem but it is totally unrelated to my original question. I misunderstood the problem when I asked the question. The increase in speed I'm chalking up to compiler optimization and running code in the profiler. The real issues wasn't in my lambda but in a dynamic lambda called by entity framework when the context is closed or an object is accessed it was doing data validation. GetString, GetInt32, and ISDBNull were eating up the most time. So I'm assuming microsoft has optimized these methods and the only way to speed this up is possibly making some variable not nullable in the procedure. This question is misleading and so esoteric I don't think it belongs here and will just confuse people. Sorry.
You should split the code and check which one is taking time.
public IEnumerable<DomainModelResult> GetData()
{
var lst = this.EntityFrameworkDB.GetDataSproc().ToList();
return lst
.Select(sprocResults=>sprocResults.ToDomainModelResult())
.AsEnumerable();
}
I am pretty sure the GetDataSproc procedure is taking most of your time. You need to optimize the stored procedure code
Update
If possible, it is better to do more work on SQL side instead of retrieving 60,000 rows into your memory. Few possible solutions:
If you need to display this information, do paging (top and skip)
If you are doing any filtering or calculating or grouping anything after you retrieve rows in memory, do it in your stored proc
.Net side, as you are returning IEnumerable you may able to use yield on your second part, depends on your architecture
Related
A programming pattern like this comes up every so often:
int staleCount = 0;
fileUpdatesGridView.DataSource = MultiMerger.TargetIds
.Select(id =>
{
FileDatabaseMerger merger = MultiMerger.GetMerger(id);
if (merger.TargetIsStale)
staleCount++;
return new
{
Id = id,
IsStale = merger.TargetIsStale,
// ...
};
})
.ToList();
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = staleCount > 0;
I'm not sure there is a more succinct way to code this?
Even if so, is it bad practice to do this?
No, it is not strictly "bad practice" (like constructing SQL queries with string concatenation of user input or using goto).
Sometimes such code is more readable than several queries/foreach or no-side-effect Aggregate call. Also it is good idea to at least try to write foreach and no-side-effect versions to see which one is more readable/easier to prove correctness.
Please note that:
it is frequently very hard to reason what/when will happen with such code. I.e. you sample hacks around the fact of LINQ queries executed lazily with .ToList() call, otherwise that value will not be computed.
pure functions can be run in parallel, once with side effects need a lot of care to do so
if you ever need to convert LINQ-to-Object to LINQ-to-SQL you have to rewrite such queries
generally LINQ queries favor functional programming style without side-effects (and hence by convention readers would not expect side-effects in the code).
Why not just code it like this:
var result=MultiMerger.TargetIds
.Select(id =>
{
FileDatabaseMerger merger = MultiMerger.GetMerger(id);
return new
{
Id = id,
IsStale = merger.TargetIsStale,
// ...
};
})
.ToList();
fileUpdatesGridView.DataSource = result;
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = result.Any(r=>r.IsStale);
I would consider this a bad practice. You are making the assumption that the lambda expression is being forced to execute because you called ToList. That's an implementation detail of the current version of ToList. What if ToList in .NET 7.x is changed to return an object that semi-lazily converts the IQueryable? What if it's changed to run the lambda in parallel? All of a sudden you have concurrency issues on your staleCount. As far as I know, both of those are possibilities which would break your code because of bad assumptions your code is making.
Now as far as repeatedly calling MultiMerger.GetMerger with a single id, that really should be reworked to be a join as the logic for doing a join (w|c)ould be much more efficient than what you have coded there and would scale a lot better, especially if the implementation of MultiMerger is actually pulling data from a database (or might be changed to do so).
As far as calling ToList() before passing it to the Datasource, if the Datasource doesn't use all the fields in your new object, you would be (much) faster and take less memory to skip the ToList and let the datasource only pull the fields it needs. What you've done is highly couple the data to the exact requirements of the view, which should be avoided where possible. An example would be what if you all of a sudden need to display a field that exists in FileDatabaseMerger, but isn't in your current anonymous object? Now you have to make changes to both the controller and view to add it, where if you just passed in an IQueryable, you would only have to change the view. Again, faster, less memory, more flexible, and more maintainable.
Hope this helps.. And this question really should be posted of code review, not stackoverflow.
Update on further review, the following code would be much better:
var result=MultiMerger.GetMergersByIds(MultiMerger.TargetIds);
fileUpdatesGridView.DataSource = result;
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = result.Any(r=>r.TargetIsStale);
or
var result=MultiMerger.GetMergers().Where(m=>MultiMerger.TargetIds.Contains(m.Id));
fileUpdatesGridView.DataSource = result;
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = result.Any(r=>r.TargetIsStale);
I have two scenarios (examples below), both are perfectly legitimate methods of making a database request, however I'm not really sure which is best.
Example One - This is the method we generally use when building new applications.
private readonly IInterfaceName _repositoryInterface;
public ControllerName()
{
_repositoryInterface = new Repository(Context);
}
public JsonResult MethodName(string someParameter)
{
var data = _repositoryInterface.ReturnData(someParameter);
return data;
}
protected override void Dispose(bool disposing)
{
Context.Dispose();
base.Dispose(disposing);
}
public IEnumerable<ModelName> ReturnData(filter)
{
Expression<Func<ModelName, bool>> query = q => q.ParameterName.ToUpper().Contains(filter)
return Get(filter);
}
Example Two - I've recently started seeing this more frequently
using (SqlConnection connection = new SqlConnection(
ConfigurationManager.ConnectionStrings["ConnectionName"].ToString()))
{
var storedProcedureName = GetStoredProcedureName();
using (SqlCommand command = new SqlCommand(storedProcedureName, connection))
{
command.CommandType = CommandType.StoredProcedure;
command.Parameters.Add("#Start", SqlDbType.Int).Value = start;
using (SqlDataReader reader = command.ExecuteReader())
{
// DATA IS READ AND PARSED
}
}
}
Both examples use Entity Framework in some form (the first more so than the other), there are Model and Mapping files for every table which could be interrogated. The main thing the second example does over the first (regarding EF) is utilising Migrations as part of the Stored Procedure code generation. In addition, both implement the Repository pattern similar to that which is in the second link below.
Code First - MSDN
Contoso University - Tutorial
My understanding of Example One is that the repository and context are instantiated once the Controller is called. When making the call to the repository it returns the data but leaves the context intact until it is disposed of at the end of the method. Example Two on the other hand will call Dispose as soon as the database call is finished with (unless forced into memory, e.g. using .ToList() on an IEnumerable). If my understanding is not correct, please correct me where appropriate.
So my main question is what are the disadvantages and advantages of using one over the other? Example, is there a larger performance overhead of going with Example 2 compared to Example 1.
FYI: I've tried to search for an answer to the below but have been unsuccessful, so if you are of a similar question please feel free to point me in that direction.
You seem to be making a comparison like this:
Is it better to build a house or to install plumbing in the bathroom?
You can have both. You could have a repository (house) that uses data connections (plumbing) so it's not an "OR" situation.
There is no reason why the call to ReturnData doesn't use a SqlCommand under the hood.
Now, the real important difference that is worth considering is whether or not the repository holds a resource (memory, connection, pipe, file, etc) open for its lifetime, or just per data call.
The advantage of using a using is that resources are only opened for the duration of the call. This helps immensely with scaling of the app.
On the other hand there's an overhead to opening connections, so it's better - particularly for single threaded apps - to open a connection, do several tasks, and then close it.
So it really boils down to what type of app you're writing as to which approach you use.
Your second example isn't using entity framework. It seems you may have two different approaches to data access here although it is hard to tell from the repository snippet as it quite rightly hides the data access implementation. The second example is correctly using a "using" statement as you should on any object that implements IDisposable. It means you don't have to worry about calling dispose. This is using pure ADO.net which is what Entity Framework uses under the hood.
If the first example is using Entity framework you most likely have lazy loading in play in which case you need the DbContext to remain until the query has been executed. Entity Framework is an ORM tool. It too uses ADO.net under the hood to connect to the database but it also offers you alot more on top. A good book on both subjects should help you.
I found learning ADO.net first helps alot in understanding how Entity Framework retrieves info from the Database.
the using statement is good practice where ever you find an object that implements IDisposable. You can read more about that here : IDisposable the right way
In response to the change to the question - the answer still on the whole remains the same. In terms of performance - how fast are the queries returned? Does the performance of one work better than the other? Only your current system and set up can tell you that. Both approaches seem to be doing things the correct way.
I haven't worked with Migrations so not sure why you are getting ADO.net type queries integrating with your EF models but wouldn't be surprised by this functionality. Entity Framework as I have experienced it creates the queries for you and then executes them using the ADO.net objects from your second example. The key point is that you want to have the "using" block for SqlConnection and SqlCommand objects (although I don't think you need to nest them. everything inside the outer "using block will be disposed).
There is nothing stopping you putting a "using" block in your repository around the context but when it comes to lazily load the related Entities you will get an error as the context will have been disposed. If you need to make this change you can include the relevant elements in your query and do away with the lazy loading approach. There are performance gains in certain situations for doing this but again you need to balance this in terms to how your system is performing.
This question already has answers here:
What is the "N+1 selects problem" in ORM (Object-Relational Mapping)?
(19 answers)
Closed 8 years ago.
I have never heard of it, but people are refering to an issue in an application as an "N+1 problem". They are doing a Linq to SQL based project, and a performance problem has been identified by someone. I don't quite understand it - but hopefully someone can steer me.
It seems that they are trying to get a list of obects, and then the Foreach after that is causing too many database hits:
From what I understand, the second part of the source is only being loaded in the forwach.
So, list of items loaded:
var program = programRepository.SingleOrDefault(r => r.ProgramDetailId == programDetailId);
And then later, we make use of this list:
foreach (var phase in program.Program_Phases)
{
phase.Program_Stages.AddRange(stages.Where(s => s.PhaseId == phase.PhaseId));
phase.Program_Stages.ForEach(s =>
{
s.Program_Modules.AddRange(modules.Where(m => m.StageId == s.StageId));
});
phase.Program_Modules.AddRange(modules.Where(m => m.PhaseId == phase.PhaseId));
}
It seems the problem idetified is that, they expected 'program' to contain it's children. But when we refer to the child in the query, it reloads the proram:
program.Program_Phases
They're expecting program to be fully loaded and in memory, and profilder seems to indicate that program table, with all the joins is being called on each 'foreach'.
Does this make sense?
(EDIT: I foind this link:
Does linq to sql automatically lazy load associated entities?
This might answer my quetion, but .. they're using that nicer (where person in...) notation, as opposed to this strange (x => x....). So if this link Is the answer - i.e, we need to 'join' in the query - can that be done?)
In ORM terminology, the 'N+1 select problem' typically occurs when you have an entity that has nested collection properties. It refers to the number of queries that are needed to completely load the entity data into memory when using lazy-loading. In general, the more queries, the more round-trips from client to server and the more work the server has to do to process the queries, and this can have a huge impact on performance.
There are various techniques for avoiding this problem. I am not familiar with Linq to SQL but NHibernate supports eager fetching which helps in some cases. If you do not need to load the entire entity instance then you could also consider doing a projection query. Another possibility is to change your entity model to avoid having nested collections.
For performant linq first work out exactly what properties you actually care about. The one advantage that linq has performance-wise is that you can easily leave out retrieval of data you won't use (you can always hand-code something that does better than linq does, but linq makes it easy to do this without creating a library full of hundreds of classes for slight variants of what you leave out each time).
You say "list" a few times. Don't go obtaining lists if you don't need to, only if you'll re-use the same list more than once. Otherwise working one item at a time will perform better 99% of the time.
The "nice" and "strange" syntaxes as you put it are different ways to say the same thing. There are some things that can only be expressed with methods and lambdas, but the other form can always be expressed as such - indeed is after compilation. The likes of from b in someSource where b.id == 21 select b.name becomes compiled as someSource.Where(b => b.id == 21).Select(b => b.name) anyway.
You can set DataContext.LoadOptions to define exactly which entities you want loaded with which object (and best of all, set it differently in different uses quite easily). This can solve your N+1 issue.
Alternatively you might find it easer to update existing entities or insert new ones, by setting the appropriate id yourself.
Even if you don't find it easier, it can be faster in some cases. Normally I'd say go with the easier and forget the performance as far as this choice goes, but if performance is a concern here (and you say it is), then that can make profiling to see which of the two works better, worthwhile.
Anyone whos been using LINQ to SQL for any length of time will know that using a global static data context can present problems with synchronisation between the database (especially if being used by many concurrent users). For the sake of simplicity, I like to work with objects directly in memory, manipulate there, then push a context.SubmitChanges() when updates and inserts to that object and its linked counterparts are complete. I am aware this is not recommended but it also has advantages. The problem here is that any attached linked objects are not refreshed with this, and it is also not possible to refresh the collection with context.Refresh(linqobject.linkedcollection) as this does not take into account newly added and removed objects.
My question is have i missed something obvious? It seems to be madness that there is no simple way to refresh these collections without creating specific logic.
I would also like to offer a workaround which I have discovered, but I am interested to know if there are drawbacks with this approach (I have not profiled the output and am concerned that it may be generating unintended insert and delete statements).
foreach (OObject O in Program.BaseObject.OObjects.OrderBy(o => o.ID))
{
Program.DB.Refresh(System.Data.Linq.RefreshMode.OverwriteCurrentValues, O);
Program.DB.Refresh(System.Data.Linq.RefreshMode.OverwriteCurrentValues, O.LinksTable);
O.LinksTable.Assign(Program.DB.LinksTable.Where(q => q.OObject == O));}
It also seems to be possible to do things like Program.DB.Refresh(System.Data.Linq.RefreshMode.OverwriteCurrentValues, Program.DB.OObjects);
however this appears to return the entire table which is often highly inefficient. Any ideas?
Can someone help to change to following to select unique Model from Product table
var query = from Product in ObjectContext.Products.Where(p => p.BrandId == BrandId & p.ProdDelOn == null)
orderby Product.Model
select Product;
I'm guessing you that you still want to filter based on your existing Where() clause. I think this should take care of it for you (and will include the ordering as well):
var query = ObjectContext.Products
.Where(p => p.BrandId == BrandId && p.ProdDelOn == null)
.Select(p => p.Model)
.Distinct()
.OrderBy(m => m);
But, depending on how you read the post...it also could be taken as you're trying to get a single unique Model out of the results (which is a different query):
var model = ObjectContext.Products
.Where(p => p.BrandId == BrandId && p.ProdDelOn == null)
.Select(p => p.Model)
.First();
Change the & to && and add the following line:
query = query.Distinct();
I'm afraid I can't answer the question - but I want to comment on it nonetheless.
IMHO, this is an excellent example of what's wrong with the direction the .NET Framework has been going in the last few years. I cannot stand LINQ, and nor do I feel too warmly about extension methods, anonymous methods, lambda expressions, and so on.
Here's why: I have yet to see a situation where either of these things actually contribute anything to solving real-world programming problems. LINQ is ceratinly no replacement for SQL, so you (or at least the project) still need to master that. Writing the LINQ statements is not any simpler than writing the SQL, but it does add run-time processing to build an expression tree and "compile" it into an SQL statement. Now, if you could solve complex problems more easily with LINQ than with SQL directly, or if it meant you didn't need to also know SQL, and if you could trust LINQ would produce good-enough SQL all the time, it might still have been worth using. But NONE of these preconditions are met, so I'm left wondering what the benefit is supposed to be.
Of course, in good old-fashioned SQL the statement would be
SELECT DISTINCT [Model]
FROM [Product]
WHERE [BrandID] = #brandID AND [ProdDelOn] IS NULL
ORDER BY [Model]
In many cases the statements can be easily generated with dev tools and encapsulated by stored procedures. This would perform better, but I'll grant that for many things the performance difference between LINQ and the more straightforward stored procs would be totally irrelevant. (On the other hand, performance problems do have a tendency to sneak in, as we devs often work with totally unrealistic amounts of data and on environments that have little in common with those hosting our software in real production systems.) But the advantages of just not using LINQ are HUGE:
1) Fewer skills required (since you must use SQL anyway)
2) All data access can be performed in one way (since you need SQL anyway)
3) Some control over HOW to get data and not just what
4) Less chance of being rightfully accused of writing bloatware (more efficient)
Similar things can be said with respect to many of the new language features introduced since C# 2.0, though I do appreciate and use some of them. The "var" keyword with type inferrence is great for initializing locals - it's not much use getting the same type information two times on the same line. But let's not pretend this somehow helps one bit if you have a problem to solve. Same for anonymous types - nested private types served the same purpose with hardly any more code, and I've found NO use for this feature since trying it out when it was new and shiny. Extention methods ARE in fact just plain old utility methods, and I have yet to hear any good explanation of why one should use the SAME syntax for instance methods and static methods invoked on another class! This actually means that adding a method to a class and getting no build warnings or errors can break an application. (In case you doubt: If you had an extension method Bar() for your Foo type, Foo.Bar() invokes a completely different implementation which may or may not do something similar to what your extension method Bar() did the day you introduce an instance method with the same signature. It'll build and crash at runtime.)
Sorry to rant like this, and maybe there is a better place to post this than in response to a question. But I really think anyone starting out with LINQ is wasting their time - unless it's in preparation for an MS certification exam, which AFAIU is also something a bit removed from reality.