I'm building an app using Xamarin.Forms, and I'm running into a really slow query in the data that I need to optimize if possible.
In order to understand the question I'm trying to frame, I need to do a good job of explaining the database relationships. The business software I'm trying to build allows the user to schedule employees onto crews and crews onto job operations. For the purposes of this explanation, we can ignore jobs, even though the database objects include 'job' in their name.
For scheduling employees day by day, I created a kanban that allows the user to drag employee names to the crew that they want to schedule them onto for the day, and when they are finished editing they use toolbar buttons to navigate to the next date they want to schedule. In the background, the code is creating database objects that create the link between an employee, a date, and a crew.
For scheduling operations day by day, I created a side-scrolling gantt-style scheduler that allows the user to drag operation blocks onto a crew for the day. In the background, the code is creating database objects that create the link between an operation, a date and a crew.
Here's a simple version of what the database objects look like.
public interface IEmployee
{
long? Id { get; set; }
string Name { get; }
string Password { get; set; }
}
public interface ICrewMember
{
long? Id { get; set; }
IEmployee Employee { get; set; }
bool IsLeader { get; set; }
ICrew Crew { get; set; }
DateTime Date { get; set; }
}
public interface IJobSchedule
{
long? Id { get; set; }
IOperation Operation { get; set; }
DateTime Date { get; set; }
ICrew Crew { get; set; }
}
public interface IOperation
{
long? Id { get; set; }
int Priority { get; set; }
}
So the complexity of the scenario comes when I want to find what operations an employee has been scheduled for. I have to first query to find the employee's schedule objects, create a list of crews/dates that they've been schedule for, then get a list of the job schedules that match the date/crew list, and then boil it down to a distinct list of operations (since an operation could get scheduled across multiple days). Here's my current code to do this:
public async Task<List<IOperation>> GetOperationsByEmployee(IDataService<IJobSchedule> JobScheduleRepository)
{
JobScheduleRepository = JobScheduleRepository ??
throw new ArgumentNullException(nameof(JobScheduleRepository));
var result = new List<IOperation>();
var empSchedMatches = await GetEmployeeSchedules().ConfigureAwait(false);
var jobSchedules = await GetJobSchedules(JobScheduleRepository, empSchedMatches).ConfigureAwait(false);
result = jobSchedules.Select(x => x.Operation).Distinct().ToList();
return result;
}
private async Task<IEnumerable<ICrewMember>> GetEmployeeSchedules()
{
//Get complete list of employee schedules to sort through
var allEmpSched = await CrewMemberRepository.GetItemsAsync().ConfigureAwait(false);
//Get schedules with date greater than or equal to Date for this employee
var empSchedMatches = allEmpSched.Where(x => x.Date >= Date && x.Employee == Employee);
return empSchedMatches;
}
private async Task<IEnumerable<IJobSchedule>> GetJobSchedules(IDataService<IJobSchedule> JobScheduleRepository, IEnumerable<ICrewMember> employeeSchedules)
{
//Get complete list of job schedules to sort through
var allJobSched = await JobScheduleRepository.GetItemsAsync().ConfigureAwait(false);
allJobSched = allJobSched.Where(x => x.Date >= Date && x.Crew != null && x.Operation != null);
int count = allJobSched.Count();
var result = new List<IJobSchedule>();
foreach (var empSched in employeeSchedules)
{
//For each employee schedule, there should be 1 matching job schedule
//if the crew was assigned a job for that day
var matches = allJobSched.Where(x => x.Date == empSched.Date && x.Crew == empSched.Crew);
result.AddRange(matches);
string message = $"GetJobSchedules() comparing ({count}) Job Schedules " +
$"to empSched.{empSched.Id} crew.{empSched.Crew.Id} date.{empSched.Date:M/d}";
System.Diagnostics.Debug.WriteLine(message);
}
return result;
}
In order to try to view the process, I had added in a number of different bits of code that printed the steps to the debugger, including a stopwatch. Here's the debugger output:
[0:] Method Called: GetOperationsByEmployee
[0:] GetOperationsByEmployee() executing query...
[0:] Method Called: GetEmployeeSchedules
[0:] Method Called: GetJobSchedules
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.17196 crew.3 date.2/6
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.18096 crew.3 date.2/4
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.18221 crew.3 date.2/3
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.18902 crew.3 date.2/7
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.21243 crew.3 date.1/27
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.21321 crew.3 date.1/28
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.21360 crew.3 date.1/29
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.21399 crew.3 date.1/30
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.21438 crew.3 date.1/31
[0:] GetJobSchedules() comparing (51) Job Schedules to empSched.21528 crew.3 date.2/5
[0:] Data loaded 6391 ms
So when I'm running the app using mock data, with like 10 objects in memory, it runs in ~100 ms. When I've got 50,000 objects in the real database, with 30 employees, 10 crews, 500 jobs, and 1500 operations to sort through, it takes ~7,000 ms. That comparison was probably obvious, but the point is that I need to find some way, if possible, to optimize the query. I'd like to get it closer to 1 second load time if it can be done.
As always, thanks for any help!
Edit
I'm afraid I'm not getting the answers I was hoping for, because I'm not really looking for advice on the data access side of the question, I'm looking for advice on the LINQ side of the question. I don't know if it helps to understand the data access scenario, so I'll explain briefly.
I'm coding in Xamarin.Forms, using Autofac as my dependency injector. I'm trying to use interfaces to allow the calls to the data service to be abstracted from the data service.
The data is being stored in SQL Server on a server here at the office. The app is using an API for SQL to SQLite called Zumero. Zumero syncs the requested tables off the SQL Server and deposits them into a local file on the mobile device.
I'm using Entity Framework Core to serve the data to the program, again by using interfaces and field mapping to try to abstract the calls for the database objects apart from the database objects themselves.
Edit 2
I'm going to try to re-ask the question here so that it becomes more clear what I'm looking for:
I have a SQLite file that has employees, operations, daily employee schedules, and daily operation schedules. What are some ways that I could write a query to get an employee's list of operations that they've been schedule for?
Imaginary Question:
What are Bob's currently scheduled operations?
Imaginary rows in the data tables:
Employees
Bob
Jim
Larry
Employee Schedules
Bob 1/1/2020 Concrete Crew 1
Bob 1/2/2020 Concrete Crew 1
Bob 1/3/2020 Mill Crew 2
Operation Schedules
Crew 1 1/1/2020 Concrete Operation 1
Crew 2 1/1/2020 Mill Operation 1
Crew 1 1/2/2020 Concrete Operation 1
Crew 1 1/3/2020 Concrete Operation 3
Operations
Concrete Operation 1
Mill Operation 1
Concrete Operation 3
Desired Result:
Bob currently has the following operations on the schedule: Concrete Operation 1
It's a relational database question, of sorts, because I'm asking about the best way to figure out the link from employees through employee schedules through operation schedules to operations.
Thanks for any help!
The answer is simple, you are creating entire tables to runtime and matching them via C# and it is wrong.
This is what database is made for and you must use it.
You have many choices, queries, views, stored procedures, but for sure asking the entire db and performing matches via code is a wrong way.
There are a number of things that you could do to speed up your query. There is one obvious change that I would implement(if I understand the flow enough):
//change to something where you pass your parameters as to not need to load all of your jobs every time
var allJobSched = await JobScheduleRepository.GetItemsAsync().ConfigureAwait(false);
//change to
var matchingJobSched = await JobScheduleRepository.FindMatchingJobSchedules(DateTime date, int crewId).ConfigureAwait(false);
You seem to be doing this for each object, so this refactor should be done over all of your code.
Something else you could try is to write a stored procedure for this action and leave the ORM-time out.
At least from the way I am reading this you are doing a lot of queries where you're essentially fetching the whole set of data from your database before querying in memory. This works fine if you have a few records as you've discovered, however if your front loading whole tables every request you're going to find yourself completely bottlenecked, to the point where your appliacation will eventually cease to function entirely.
There's a lot of code here but just as an example.
var allJobSched = await JobScheduleRepository.GetItemsAsync().ConfigureAwait(false);
allJobSched = allJobSched.Where(x => x.Date >= Date && x.Crew != null && x.Operation != null);
int count = allJobSched.Count();
In the above snippet first you pull all of the Job schedules from the database this will invariably involve a large over the wire data pull from your database (not good).
You'd be much better if you write something similar to the following.
var jobs= await JobScheduleRepository.GetValidJobsAfterDate(Date);
The implementation of the query itself should be sent by your repostitory to your database an handled there, you should never be pulling back huge collections of data to work from in memory (in most cases).
IF, as you insist, this must be done 100% in LINQ, which I do not recommend as you need to do a better job pre-filtering the data coming in, then your issue is that you are performing an exhaustive search for every row in the table. Instead, I recommend creating a dictionary and first loading your rows into that.
Instead of:
private async Task<IEnumerable<ICrewMember>> GetEmployeeSchedules()
{
//Get complete list of employee schedules to sort through
var allEmpSched = await CrewMemberRepository.GetItemsAsync().ConfigureAwait(false);
//Get schedules with date greater than or equal to Date for this employee
var empSchedMatches = allEmpSched.Where(x => x.Date >= Date && x.Employee == Employee);
return empSchedMatches;
}
Do this:
var dict = new Dictionary<Employee, List<IJobSchedule>>
foreach(var item in allEmpSched) {
if (dict.TryGetValue(item.Employee, out var schedules)) {
schedules.Add(item)
} else {
dict[item.Employee] = new List<IJobSchedule>() { item };
}
}
Then, using that dictionary, look up your employee. That's essentially what the database will do for you automatically if you have a properly configured index along with an appropriate SELECT clause, but there you go.
Edit: I see that you're hoping to maybe relocate this to the ORM, such that it will write the correct SQL for you. I don't know how to do that, because I don't use ORMs.
Why don't I use ORMs? There are three reasons. First, data access is not something that should be given over to a machine to build into your application. Data access is often the most-used code you'll have in the software package. Pay attention to it and design it well. Computers cannot provide an effective substitute for proper design.
Second, the SQL language itself is an abstraction upon the physical means to access the underlying data. When a SQL query is executed, the first thing that the database engine does is to come up with a plan to execute it. In essence, the SQL is being interpreted and compiled down to generated code. If you throw another code generator on top of it (the ORM), the results are naturally going to vary, and rarely does an ORM produce good results without spending quite a bit of time tweaking. Spend your time writing good SQL queries.
Finally, ORMs don't really eliminate the problem of impedance mismatch between strong object-oriented models and a relational database. You have to start by addressing the problem in the data model itself, and write your code to deal with relational objects, not deeply-nested objects. Once you do this, you'll find that writing your own queries isn't really that hard.
There are two parts to the question asked.
Question 1: How do I write the code to get from one to many to many to one?
I didn't get any answers here. I'll re-post the code I'm using.
public async Task<List<IOperation>> GetOperationsByEmployee(
IDataService<IJobSchedule> JobScheduleRepository,
DateTime Date,
IEmployee Employee)
{
var result = new List<IOperation>();
var empSchedMatches = allEmployeeSchedules.Where(x => x.Date >= Date && x.Employee == Employee);
var jobSchedules = new List<IJob>();
foreach (var empSched in empSchedMatches)
{
var matches = allJobSched.Where(x => x.Date == empSched.Date && x.Crew == empSched.Crew);
jobSchedules.AddRange(matches);
}
result = jobSchedules.Select(x => x.Operation).Distinct().ToList();
return result;
}
Question 2: How do I speed up this query?
I got a bit of haranguing over ORM choice here, but I didn't have any concrete implementation steps to improve the code. Then, I had a conversation with a programmer who explained to me that he has a ton of experience using Entity Framework with SQLite. He gave me a couple of concrete pointers, and a code sample for how to improve the database loading speed. He explained that Entity Framework tracks objects that it loads, but if the operation is reading data that doesn't need to be tracked, the loading speed can be improved by turning off the tracking. He gave me the following code sample.
internal class ReadOnlyEFDatabase : EFDatabase
{
public ReadOnlyEFDatabase(string dbPath, DbContextOptions options) : base(dbPath, options)
{
this.ChangeTracker.AutoDetectChangesEnabled = false;
this.ChangeTracker.QueryTrackingBehavior = QueryTrackingBehavior.NoTracking;
}
public override int SaveChanges()
{
throw new Exception("Attempting to save changes from a read-only connection to the database.");
}
}
Thanks to Jeremy Sheeley from zumero for this help!!!!
Related
I have two datetime pickers on my form. I want a function that will return all datetimes from a specific table (which are values of a specific column) between those two dates.
My method looks like this:
public DateTime[] GetAllArchiveDates(string username = null)
{
var result = new DateTime[0];
if (username != null)
{
result = this._context.archive.OrderBy(s => s.IssuingDate).Where(s => s.insertedBy == username).Select(s => s.issuing_date).Distinct().ToArray();
}
else
{
result = this._context.archive.OrderBy(s => s.IssuingDate).Select(s => s.issuing_date).Distinct().ToArray();
}
return result;
}
But I am getting this error:
System.NotSupportedException: 'The specified type member 'IssuingDate' is not supported in LINQ to Entities. Only initializers, entity members, and entity navigation properties are supported.'
How to do this?
The cause of your error message
You should be aware about the differences between IEnumerable and IQueryable.
An object of a class that implements IEnumerable holds everything to enumerate over the sequence of items it represents. You can ask for the first item of the sequence, and once you've got one, you can ask for the next item, until there are no more items.
On the other hand, an object of a class that implements IQueryable holds everything to ask another process to provide data to create an IEnumerable sequence. To do this, it holds an Expression and a Provider.
The Expression is a generic representation of what kind of IEnumerable must be created once you start enumerating the IQueryable.
The Provider knows who must execute the query, and it knows how to translate the Expression into a format that the executor understands, for instance SQL.
There are two kinds of LINQ statements. Those that use deferred execution, and those that don't. The deferred functions can be recognized, because they return IQueryable<TResult> (or IEnumerable). Examples are Where, Select, GroupBy, etc.
The non-deferred functions return a TResult: ToList, ToDictionary, FirstOrDefault, Max.
As long as you concatenate deferred LINQ functions, the query is not executed, only the Expression is changed. Once you start enumerating, either explicitly using GetEnumerator and MoveNext, or implicitly using foreach, ToList, Max, etc, the Expression is sent to the Provider who will translate it to SQL and execute the query. The result is represented as an IEnumerable, on which the GetEnumerator is performed.
What has this to do with my question?
Because the Expression must be translated into SQL, it can't hold anything that you invented. After all, SQL doesn't know your functions. In fact, there are a lot of standard functions that can't be used in an IQueryable. See Supported and unsupported LINQ functions
Alas you forgot to give us the archive class definition, but I think that it is not a POCO: It contains functions and properties that do more than just get / set. I think that IssuingDate is not just get / set.
For IQueryables you should keep your classes simple: use only {get; set;} during your query, nothing more. Other functions can be called after you've materialized your IQueryable into something IEnumerable which is to be executed within your local process
Back to your question
So you have a database with a table Archive with at least columns IssuingDate and InsertedBy. It seems that InsertedBy is just a string. It could be a foreign key to a table with users. This won't influence the answer very much.
Following the entity framework code first conventions this leads to the following classes
class Archive
{
public int Id {get; set;}
public DateTime IssuingDate {get; set;}
public string InsertedBy {get; set;}
...
}
public class MyDbContext : DbContext
{
public DbSet<Archive> Archives {get; set;}
}
By the way, is there a proper reason you deviate so often from Microsoft standards about naming identifiers, especially pluralization and camel casing?
Anyway, your requirement
I have two datetime pickers on my form. I want a function that will return all datetimes from a specific table (which are values of a specific column) between those two dates.
Your code seems to do a lot more, but let's first write an extension function that meets your requirement. I'll write it as an extension method of your archive class. This will keep your archive class simple (only {get; set;}), yet it adds functionality to the class. Writing it as an extension function also enables you to use these functions as if they were any other LINQ function. See Extension methods demystified
public static IQueryable<Archive> BetweenDates(this IQueryable<Archive> archives,
DateTime startDate,
DateTime endDate)
{
return archives.Where(archive => startDate <= archive.IssuingDate
&& archive.IssuingDate <= endDate);
}
If I look at your code, you don't do anything of selecting archives between dates. You do something with a userName, ordering, select distinct... It is a bit strange that you first Order all your million archives, and then decide to keep only the ten archives that belong to userName, and if you have several same issuing dates you decide to remove the duplicates. Wouldn't it be more efficient to first limit the number of issuing dates before you start ordering them?
public static IQueryable<archive> ToIssuingDatesOfUser(this IQueryable<archive> archives,
string userName)
{
// first limit the number of archives, depdning on userName,
// then select the IssuingDate, remove duplicates, and finally Order
var archivesOfUser = (userName == null) ? archives :
archives.Where(archive => archive.InsertedBy == userName);
return archivesOfUser.Select(archive => archive.IssuingDate)
.Distinct()
.OrderBy(issuingDate => issuingDate);
}
Note: until now, I only created IQueryables. So only the Expression is changed, which is fairly efficient. The database is not communicated yet.
Example of usage:
Requirement: given a userName, a startDate and an endDate, give me the unique issuingDates of all archives that are issued by this user, in ascending order
public ICollection<string> GetIssuingDatesOfUserBetweenDates(string userName,
DateTime startDate,
DateTime endDate)
{
using (var dbContext = new MyDbContext(...))
{
return dbContext.Archives
.BetweenDates(startDate, endDate)
.ToIssuingDatesOfUser(userName)
.ToList();
}
}
In a previous question I presented these models:
public class Calendar
{
public int ID { get; set; }
public ICollection<Day> Days { get; set; }
}
public class Day
{
public int ID { get; set; }
public DateTime Date { get; set; }
public int CalendarID { get; set; }
}
There is a uniqueness constraint so that you can't have more than one Day with the same Date and CalendarID.
My question now is what if I want to move all days one day into the future (or whatever). The easiest code is just a for loop like
for(Day day in days) {
day.Date = day.Date.AddDays(1);
db.Entry(day).State = EntityState.Modified;
}
await db.SaveChangesAsync();
This will fail, however, because it thinks you are creating duplicates of some of the dates, even though once they're all updated it will work out.
Calling SaveChangesAsync after each day (assuming you process the days in the correct order) would work, but seems massively inefficient for this basic task.
An alternative to updating the Date would be to transfer all the other data of each day to the next one, but this also seems inefficient and could in some cases be undesirable because it means that data is dissociated from the Day's primary key value.
Is there a way to update all the dates while keeping the uniqueness constraint?
The number of SQL UPDATE statements won't change if you call SaveChanges() for each record instead of calling it only once, but at least you'll get the correct order. There's some overhead because of state cleaning and connection management but it's not massively inefficient.
If date shifting is an isolated business transaction you could use a simpler solution instead of fighting with ORM - call a stored procedure or execute SQL directly with something similar to:
var sql = "UPDATE d SET Date = DATEADD(d, 1, Date) FROM (SELECT * FROM Day WHERE CalendarID=#calendarId ORDER BY Date DESC) d";
var updateCnt = db.Database.ExecuteSqlCommand(sql, new SqlParameter("#calendarId", calendar.Id);
if (updateCnt != days.Count)
{
//oops
}
One of the many possible solutions is removing all the records before you do the update.
You can first get your days, store them in memory.
var days = db.Day.Tolist();
Truncate the table, so they won't collide with the new list coming:
db.ExecuteCommand("TRUNCATE TABLE Day");
Do your stuff:
foreach(var day in days)
{
day.Date=day.Date.AddDays(1);
}
Insert your new list.
Now you should be able to save it:
db.SaveChanges();
This should be efficient enough since the quickest way to wipe data is to truncate, and your day objects are child objects.
HOWEVER
If a property is changing a lot, probably it's not a good idea to make it a primary key.
If you find yourself in a conflict with fundamentals, it's quite possible that you made an architectural mistake.
I strongly recommend you to change your primary key to something else, you can even roll a uniqueidentifier column to store Id.
I'm wondering if anyone else has experienced this before?
I should start by saying I'm not using ModelsBuilder for this project. I was having too many problems with it so abandoned that route.
I am however converting IPublishedContent items into Dtos within my app, using a converter class that basically maps the values. The problem I'm finding is that it's causing a massive slowdown in my code execution, especially in comparision to just getting the raw IPublishedContent collection.
To give you an example, I have a 'Job' document type. Jobs can be assigned to workers. In one of my services I need to get a collection of all jobs assigned to a worker:
public IEnumerable<IPublishedContent> GetJobsForWorker(int workerId)
{
var jobs = Umbraco.TypedContent(1234);
return jobs.Descendants("job").Where(j => j.GetPropertyValue<int>("assignedWorker") == workerId).ToList();
}
This function returns a collection of IPublishContent, which returns lightning fast, as I'd expect.
However if I try and convert the results to my job Dto class, it goes from taking 0 seconds, to around 7.. and that's just returning a collection of ~7 from ~20 or so records:
public IEnumerable<Job> GetJobsCompletedByWorker(int workerId)
{
var jobs = Umbraco.TypedContent(1234);
return jobs.Descendants("job").Where(j => j.GetPropertyValue<int>("assignedWorker") == workerId).Select(node => _jobConverter.ConvertToModel(_umbracoHelper, node)).ToList();
}
Now I'm not doing any complex processing in this converter, it's just mapping the values as such:
public class JobConverter
{
public Job ConvertToModel(UmbracoHelper umbracoHelper, IPublishedContent node)
{
if (node != null)
{
var job = new Job
{
Id = node.Id,
Property1 = node.GetPropertyValue<string>("property1"),
Property2 = node.GetPropertyValue<string>("property2")
... more properties
};
return job;
}
return null;
}
}
I'm not really sure what best practice is here? Is there something I'm missing that's causing this slowdown? I only ask because I've used ModelsBuilder before which essentially does the same thing ie. map umbraco fields to properties, and yet there's nowhere near the same delay.
Ultimately I could just use IPublishedContent, but it makes for messy code and it's far more difficult to understand.
I just wonder if anyone's been in this situation before and how they handled it?
Thanks
It turns out I actually had a helper method running on one of my properties that was querying member data, which made a database call.. hence the slowdown!
I was using servicestack.redis recently, and I need query from IRedisTypedClient. I know all data is in memory, but still want to know, is there a speed different between GetAll().Where() and GetByIds()?
GetAll() and GetByIds() are two methods provided by servicestack.redis.
Use GetAll() can continue search in result(by lambda), that means I can use some custom conditions, but I don't know, whether that will load all data from redis memory then search in IEnumable<T>, and whether the search speed will slower than GetByIds().
I just did a experiment, I stored 1 million object(ps:there is a servicestack's bug, can only store about half million object once).
queried with these two methods.
DateTime beginDate = DateTime.Now;
Debug.WriteLine("查询1开始");`
Website site = WebsiteRedis.GetByCondition(w => w.Name == "网址2336677").First();
double time = (DateTime.Now - beginDate).TotalMilliseconds;
Debug.WriteLine("耗时:" + time + "ms");
DateTime beginDate2 = DateTime.Now;
Debug.WriteLine("查询2开始");
Website site2 = WebsiteRedis.GetByID(new Guid("29284415-5de0-4781-bea4-5e01332814b2"));
double time2 = (DateTime.Now - beginDate2).TotalMilliseconds;
Debug.WriteLine("耗时:" + time2 + "ms");
Result is
GetAll().Where() - takes 19 seconds,
GetById()- take 190ms.
I guess it's because servicestack use object id as redis key, so never use GetAll().Where() as query, every object should related with id and use GetById() as query. GetAll() should use on object with less records.
You can have a look at the implementations of GetAll and GetByIds to see how they work.
GetByIds just converts all Ids to a fully qualified Key which each entry is stored under then calls GetValues() which creates a single MGET request to fetch all the values:
public IList<T> GetByIds(IEnumerable ids)
{
if (ids != null)
{
var urnKeys = ids.Map(x => client.UrnKey<T>(x));
if (urnKeys.Count != 0)
return GetValues(urnKeys);
}
return new List<T>();
}
public IList<T> GetAll()
{
var allKeys = client.GetAllItemsFromSet(this.TypeIdsSetKey);
return this.GetByIds(allKeys.ToArray());
}
GetAll fetches all the Ids from the TypeIdsSetKey (i.e. Redis SET containing all ids for that Type) then calls GetByIds().
So GetByIds is faster because it makes one less call to Redis, but together they only make 2 Redis operations.
Note they both return an in memory .NET List<T> so you can use LINQ to further filter the returned results, but it returns all results for that Type and the filtering is performed on the client so this isn't efficient for large datasets. Instead you should look at creating manual indexes using Redis SETs for common queries.
I am writing a fairly large service centered around Stanford's Folding#Home project. This portion of the project is a WCF service hosted inside of a Windows Service. With proper database indices and a dual core Core2Duo/7200rpm platter I am able to run approximately 1500 rows per second (SQL 2012 Datacenter instance). Each hour when I run this update, it takes a considerable amount of time to iterate through all 1.5 million users and add updates where necessary.
Looking at the performance profiler in SQL Server Management Studio 2012, I see that every user is being loaded via individual queries. Is there a way with EF to eagerly load a set of a given size of users, update them in memory, then save the updated users - using queries more elegant than single-select, single-update? I am currently using EF5, but if I need to move to 6 for improved performance, I will. The main source of delay on this process is waiting for database results.
Also, if there is anything I should change about the ForAll or pre-processing, feel free to mention it. The group pre-processing is very quick and dramatically increases the speed of the update by controlling each EF context's size - but if I can pre-process more and improve the overall time, I am more than willing to look into it!
private void DoUpdate(IEnumerable<Update> table)
{
var t = table.ToList();
var numberOfRowsInGroups = t.Count() / (Properties.Settings.Default.UpdatesPerContext); //Control each local context size. 120 works well on most systems I have.
//Split work groups out of the table of updates.
var groups = t.AsParallel()
.Select((update, index) => new {Value = update, Index = index})
.GroupBy(a => a.Index % numberOfRowsInGroups)
.ToList();
groups.AsParallel().ForAll(group =>
{
var ents = new FoldingDataEntities();
ents.Configuration.AutoDetectChangesEnabled = false;
ents.Configuration.LazyLoadingEnabled = true;
ents.Database.Connection.Open();
var count = 0;
foreach (var a in group)
{
var update = a.Value;
var data = UserData.GetUserData(update.Name, update.Team, ents); //(Name,Team) is a superkey; passing ents allows external context control
if (data.TotalPoints < update.NewCredit)
{
data.addUpdate(update.NewCredit, update.Sum); //basic arithmetic, very quick - may attach a row to the UserData.Updates collection. (does not SaveChanges here)
}
}
ents.ChangeTracker.DetectChanges();
ents.SaveChanges();
});
}
//from the UserData class which wraps the EF code.
public static UserData GetUserData(string name, long team, FoldingDataEntities ents)
{
return context.Users.Local.FirstOrDefault(u => (u.Team == team && u.Name == name))
?? context.Users.FirstOrDefault(u => (u.Team == team && u.Name == name))
?? context.Users.Add(new User { Name = name, Team = team, StartDate = DateTime.Now, LastUpdate = DateTime.Now });
}
internal struct Update
{
public string Name;
public long NewCredit;
public long Sum;
public long Team;
}
EF is not the solution for raw performance... It's the "easy way" to do a Data Access Layer, or DAL, but comes with a fair bit of overhead. I'd highly recommend using Dapper or raw ADO.NET to do a bulk update... Would be a lot faster.
http://www.ormbattle.net/
Now, to answer your question, to do a batch update in EF, you'll need to download some extensions and third party plugins that will enable such abilities. See: Batch update/delete EF5