How do I improve atrocious EF4 performance when enumerating an association?

How do I improve atrocious EF4 performance when enumerating an association? - c#

I'm working on movie database app using Entity Framework 4 (database first) and it is taking 30 seconds to load about 8,200 rows into a List. There are three tables involved and when I use .Include(), performance degrades even more -- almost three minutes to load 8,200 rows. This is painful. Given I'm learning a lot of technologies at once, I'm hoping there is a simple fix. Here are the details:
Table 1 - Videos
This is a large table with 31 columns about about 7,800 rows of videos. It uses a Guid as its primary key.
Table 2 - ActorsVideos (a junction table)
This table has two columns: (1) VideoID column, and (2) ActorID column. Both columns are Guid's and are foreign keys into the Video and Actor tables, respectively. This table uses a composite primary key where both columns act as the primary key. EF4 does not model this table; however, it creates a navigation property. This table allows the user to assign any number of actors to a movie.
Table 3 - Actors
Has 16 columns about about 400 rows. Again, the primary key is a Guid.
In the code, I'm reading about 10 columns in the Videos table and then I read columns from the associated Actors table.
The C# code looks something like this:
var videos = context.Videos
foreach (var video in videos)
{
// retrieve 10 or so properties from 'video'
if (video.Actors.Count > 0)
{
foreach (var actor in video.Actors)
{
// retrieve some properties on the actor
}
}
}
I've tried adding .Include("Actors") after context.Videos and, as stated above, performance went from terrible to horrific.
I've looked at the SQL that is generated with the Include and it's about 2K of text given the number of columns that are in the video table.
Do I have to split the video table using a Master/Detail pattern? My next step is to cache the actors table and avoid the navigation/association property altogether. Any other suggestions to make this faster? It should run in less than 5-6 seconds in my opinion.
EDIT: The database is SQL Server CE 3.5.

You're asking Entity Framework to load a video along with all of it's actors, then you're doing the filtering in application code. Generally, you're pulling way more data than you need to. I would have SQL Server (or whatever DB you're using) pre-filter for you:
var videos = context.Videos;
var results = from video in videos
where video.Actors.count > 10
group video.Actors by video.VideoID into grouping
select new
{
video.VideoID,
video.Actors
};
foreach (var group in results)
{
foreach (var actor in group.Actors)
{
// do stuff
}
}
Loading ~8200 rows, along with their associated rows in the videos table should be extremely fast. I did some development at my job where I had to deal with a 70+ million row test data table with a 5 table join. This ran in something like half a minute.
But, the reason why it ran that much faster than what you're doing is because I was filtering inside of SQL Server. The equivalent "procedural" program using EF took several minutes because I was doing filtering AFTER I pulled the rows from the database.
Think of it this way: You're not only asking for every row in your database, but you're pulling in data that you don't even need to multiple times.

Try using eager loading:
var videos = context.Videos.Include(v=>v.Actors);
foreach (var video in videos)
{
foreach (var actor in video.Actors)
{
}
}
NOTE: Beware, that queries use deferred execution, which means that iterating multiple times executes queries multiple times as well. If you will be iterating multiple times use .AsEnumerable() and assign it locally prior to iterating.
Also, profiling the database to see which queries are being executed will help in determining what else you need to eagerly fetch. You may be possibly forcing EF to load unneeded entities.
If that's the case, then your query should only project (by using Select(x=> new {...stuff you need... })) the necessary data.
EDIT: According to Microsoft, Include() can't be combined with projections, so in that case you'd need to write the query differently.

Related

ASP.NET Sorting Multiple SQL Database Tables By Date Range

This past week I was tasked with moving a PHP based database to a new SQL database. There are a handful of requirements, but one of those was using ASP.Net MVC to connect to the SQL database...and I have never used ASP.Net or MVC.
I have successfully moved the database to SQL and have the foundation of the ASP site set up (after spending many hours pouring through tutorials). The issue I am having now is that one of the pages is meant to display a handful of fields (User_Name, Work_Date, Work_Description, Work_Location, etc) but the only way of grabbing all of those fields is by combining two of the tables. Furthermore, I am required to allow the user to search the combined table for any matching rows between a user inputted date range.
I have tried having a basic table set up that displays the correct fields and have implemented a search bar...but that only allows me to search by a single date, not a range. I have also tried to use GridView with its Query Builder feature to grab the data fields I needed (which worked really well), but I can't figure out how to attach textboxes/buttons to the newly made GridView. Using a single table with GridView works perfectly and using textboxes/buttons is very intuitive. I just can't seem to make the same connection with a joined view.
So I suppose my question is this: what is the best way for me to combine these two tables while also still having the ability to perform searches on the displayed data? If I could build this database from scratch I would have just made a table with the relevant data attached to it, but because this is derived from a previously made database it has 12+ years of information that I need to dump into it.
Any help would be greatly appreciated. I am kind of dead in the water here. My inexperience with these systems is getting the better of me. I could post the code that I have, but I am mainly interested in my options and then I can do the research on my own.
Thanks!

It's difficult to offer definitive answers to your questions due to the need for guesswork.
But here are some hints.
You can say WHERE datestamp >= '2017-01-01' AND datestamp < '2018-01-01' to filter all the rows in calendar year 2017. Many variations on this sort of date range filter are available.
Your first table probably has some kind of ID number on each row. Let's call it first.first_id. Your second table probably has its own id, let's call it second.second_id. And, it probably has another id that identifies a row in your first table, let's call it second.first_id. That second.first_id is called a foreign key in the second table to the first table. There can be any number of rows in your second table corresponding to your first table via this foreign key.
If this is the case you can do something like this:
SELECT first.datestamp, first.val1, first.val2, second.val1, second.val2
FROM first
JOIN second ON first.first_id = second.first_id
WHERE first.datestamp >= '2018-06-01' AND first.datestamp < '2018-07-01'
AND (first.val1 = 'some search term' OR second.val1 = 'some search term')
ORDER BY first.datestamp
This makes a virtual table by joining together your two physical tables (FROM...JOIN...).
Then it filters the rows you want from that virtual table (FROM ...).
Then it puts them in the order you want (ORDER BY...).
Finally, it chooses the columns from the virtual table you want in your result set (SELECT ...).
SQL database servers (MySQL, SQL Server, postgreSQL, Oracle and the rest) are very smart about doing this sort of thing efficiently.

Reference Table through a variable in Entity Framework

I have 5 tables in database Table1, Table2 and so on [All tables have same column name or Table Definition]. I am using Entity Framework in MVC application and C#.
First creating an object of db of Database.
Getting table data as db.Table1.ToList();.
I want to do some thing like this.
list<string> TableNames = new list<string>();
db.TableNames[1].ToList();
Now I know this won't work but is there any way I can get data without hard coding the table names as my Project will deal with 100s of tables with same column names but different data.
This is a Project for a Hospital which will receive data from different locations. Lets say for location A I am expecting 100 cases a day and right now I have 10 locations. So if I combine all this data into one which means 1000 records each day in a single day therefore overtime searching through this table will become performance sensitive.

I am writing this for those who might occur into this same dilemma.....
I had reference a table through EF so the classes got generated into the Model.
Lets say I have 3 tables of same schema tbl_Loc1, tbl_Loc2 and tblLoc3.
public void getDataFromTable(string TableName)
{
using(var ctx = new DBEntities())
{
string query ="Select * from " +TableName;
var data=ctx.tbl_Loc1.SqlQuery(query);
}
}
DBEntities is Database Connection String
In ctx.tbl_Loc1.SqlQuery(query);.............. tbl_loc1 has a class in model which will help in getting data in the same format[As all tables have the same table definition]
There is a model of tbl_Loc1 in EF whereas tbl_Loc2 and tbl_Loc3 are only in Database.
Return this data as IEnumerable list
http://www.entityframeworktutorial.net/Querying-with-EDM.aspx

I echo other commenter's thoughts that you probably can handle this all in one table with a distinguishing column (and some proper indexes on the table). What you've mentioned so far only amounts to hundreds of thousands of records, something that should still perform very well.
However, in order to do what you want the way you state it, you can use reflection to examine the properties of your db object. Any property in there that is a hashset is a property that represents a table, so you can get a list of all the hashset properties, and their names (perhaps with a few tweaks regarding pluralization), which will give you your table names.
For a more sophisticated use of metadata within EF, take a look at How I can read EF DbContext metadata programmatically?.
Also, you may find that SMO is a helpful approach to this kind of thing (nothing preventing you from using it and EF).

Create an in-memory readonly cache for data stored in SQL Server

I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?

The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...

Alternative to bulk update with Entity Framework [duplicate]

This question already has answers here:
Fastest Way of Inserting in Entity Framework
(32 answers)
Closed 7 years ago.
I have a table with 100,000+ records. Customer has asked that we encrypt the username field, copy the encrypted value to a new field, and clear the original username field. The encryption can only be performed in the application, not the database.
The existing codebase used the Entity Framework in the past for these tasks, but never for a table of this size. Legacy code looked something like:
foreach(var phone in db.Phones){
phone.Enc_Serial = Encrypt(phone.Serial);
phone.Serial = "";
}
db.SaveChanges();
Given this is a bulk update, would there be any advantage to doing this with a raw SQL command? I'm thinking that at least we wouldn't have a ton of tracked objects sitting in the DbContext consuming memory.
var idsAndSerials = db.Phones.Select(p => new { id = p.Id, serial = p.Serial };
foreach(var item in idsAndSerials ){
string sql = String.Format("Update phone set Enc_Serial ='{0}' where phoneId={1}", Encrypt(item.serial), item.id.ToString());
db.Database.ExecuteSqlCommand(sql);
}

In the example you've provided, no way. You're still iterating through each record and calling UPDATE. At least in the first example I believe those statements would get executed as a batch, and would be transactional so that all updates succeed or none of them do.

Since this is a significant update, I'd suggest creating a tracking table (on the SQL side) in which you sequentially number each of the rows to be updated (and also store the row's PK value). Also include a column in the tracking table that lets you mark the row as done (say 0 or 1). Set a foreign key into the original table via the PK value.
Update your data model on the EF side to include the new tracking table. Now you have a new table that will easily let you retrieve, say, 1K record batches to work on at a time. This won't have excessive memory consumption. The application logic can do the encrypting. As you update those records, mark the ones that are updated as "done" in your tracking table.
Get the next 1K of not done records via tracking table (use navigation properties to get the real records). Repeat.
It'll be done very fast and with no undue load. 100000+ records is not really a lot, especially if you use a divide and conquer approach (100+ batches).

Parsing and inserting bulk data. How to keep performance and do relations?

The data
I have a collection with around 300,000 vacations. Every vacation has several categories, countries, cities, activities and other subobjects. This data needs to be inserted into a MySQL / SQL Server database. I have the luxury of being able to truncate the entire database and start clean every time the parser program is run.
What I have tried
I have tried working with Entity Framework, this is also where my preference lies. To keep Entity Framework's performance up I have created a construction where 300 items are taken out of the vacations collection, parsed and inserted by Entity Framework and it's context disposed thereafter. The program finishes in a matter of minutes using this method. If I fill the context with all 300k vacations from the collection (and it's subobjects) it's a matter of hours.
int total = vacationsObjects.Count;
for (int i = 0; i < total; i += Math.Min(300, (total - i)))
{
var set = vacationsObjects.Skip(i).Take(300);
int enumerator = 0;
using (var database = InitializeContext())
{
foreach (VacationModel vacationData in set)
{
enumerator++;;
Vacations vacation = new Vacations
{
ProductId = vacationData.ExternalId,
Name = vacationData.Name,
Description = vacationData.Description,
Price = vacationData.Price,
Url = vacationData.Url,
};
foreach (string category in vacationData.Categories)
{
var existingCategory = database.Categories.Local.FirstOrDefault(c => c.CategoryName == categor);
if (existingCategory != null)
vacation.Categories.Add(existingCategory);
else
{
vacation.Categories.Add(new Category
{
CategoryName = category
});
}
}
database.Vacations.Add(vacation);
}
database.SaveChanges();
}
}
The downside (and possibly dealbreaker) with this method is figuring out the relationships. As you can see when adding a Category I check if it's already been created in the local context, and then use that. But what if it has been added in a previous set of 300? I don't want to query the database multiple times for every vacation to check whether an entity already resides within it.
Possible solution
I could keep a dictionary in memory containing the categories that have been added. I'd need to figure out how to attach these categories to the proper vacations (or vice-versa) and insert them, including their respective relations into the database.
Possible alternatives
Segregate the context and the transaction -
Purely theoretical, I do not know if I'm making any sense here. Maybe I could have EF's context keep track of all objects, and take manual control over the inserting part. I have messed around with this, trying to work with manual transaction scopes without avail.
Stored procedure -
I could write a stored procedure that handles and inserts my data. I'm not a big fan of this alternative, as I would like to keep the flexibility of switching between MySQL and SQL Server. Also, I would be in the dark as to where to begin.
Intermediary CSV file -
Instead of inserting parsed data directly into the RDMBS, I could export it into one or more CSV files and make use of importing tools such as MySQL's INFLINE.
Alternative database systems
Databases such as Azure Table Storage, MongoDB or RavenDB could be an option. However, I would prefer to stick to a traditional RDMBS due to compatibility with my skillset and tools.
I have been working on and researching this problem for a couple of weeks now. It seems like the best way of finding a solution that fits is by simply trying the different possibilities and observing the result. I was hoping that I could receive some pointers or tips from your personal experiences.

If you insert each record separately, the whole operation will take a lot of time. The bottleneck is SQL-queries between client and server. Each query takes time, so try to avoid using multiple of them. For huge amount of data it will be much better to process them locally. The best solution is to use special import tool. In MySQL you can use LOAD DATA, in MSSQL there is BULK INSERT. To import your data, you need a .css file.
To handle external keys correctly, you must populate tables manually before inserting. If destination tables are empty, you can simply create .css file with predefined primary and external keys. Otherwise you can import existing records from server, update them with your data, then export them back.

Time
Since you can afford to make only INSERTs, one suggestion is to try Entity Framework Bulk Insert extension. I have used it to save up to 200K records and it works fine. Just include in your project and write something like this:
context.BulkInsert(listOfEntities);
This should solve (or greatly improve the EF version) your problem's the time dimension
Data integrity
Keeping everything in one transaction does not sound reasonable (I expect that 300K parent records to generate at least 3M overall records), so I would try the following approach:
1) make your entities insertion using bulk insert.
2) call a stored procedure to check data integrity
If the insertion is quite long and the chance of failure is relatively big, you can load what is already loaded and have the process skip what is already loaded:
1) make smaller bulk inserts for a batch of vacation records and all its children records. Ensure that it runs in a transaction. One BULK INSERT is run atomically (no transaction needed), for several it seems tricky.
2) if the process fails, you have complete vacation data in your database (no partially imported vacation)
3) retake the process, but load existing vacation records (parents only). Using EF, a faster way is using AsNoTracking to spare the tracking overhead (which is great for large lists)
var existingVacations = context.Vacation.Select(v => v.VacationSourceIdentifier).AsNoTracking();

As suggested by Alexei, EntityFramework.BulkInsert is a very good solution if your model is supported by this library.
You can also use Entity Framework Extensions (PRO Version) which allow to use BulkSaveChanges and Bulk Operations (Insert, Update, Delete and Merge).
It's support your both provider: MySQL and SQL Server
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
// Use direct bulk operation
context.BulkInsert(customers);
Disclaimer: I'm the owner of the project Entity Framework Extensions

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.