Algorithm to calculate frequency and recency of an entity? - c#

I have a list of entities opened by various users.
I keep track of each access of any entity by storing access dates and times as the following:
public class Entity
{
public int Id { get; set; }
public virtual ICollection<AccessInfo> Accesses { get; set; }
= new HashSet<AccessInfo>();
}
public class AccessInfo
{
public int Id { get; set; }
public AccessInfoType Type { get; set; }
public User User { get; set; }
public DateTime DateTime { get; set; }
}
public enum AccessInfoType
{
Create,
Read,
Update,
Delete,
}
Now I'm trying to make an algorithm that filters the most wanted contacts based on both factors: recency and frequency.
I want contacts that were accessed 5 times yesterday to be prioritized over a contact that was accessed 30 times a week ago. But in the other hand, a user that was only accessed one time today is less important.
Is there an official name for this? I'm sure people have worked on a frequency calculation like this one before, and I'd like to read about this before I spend some time coding.
I thought about calculating the sum of the access dates in recent month and sort accordingly but I'm still not sure it's the right way, I'd love to learn from the experts.
return Entities
.OrderBy(c =>
c.Accesses
.Where(a => a.Employee.UserName == UserName)
.Where(a => a.DateTime > lastMonth)
.Select(a => a.DateTime.Ticks)
.Sum());

Exponential decay is what you're looking for. See this link:
http://www.evanmiller.org/rank-hotness-with-newtons-law-of-cooling.html

I would use a heuristic that assigns points to Entities for access and uses some kind of decay on those points.
For example, you could give an entity 1 point every time it is accessed, and once every day multiply all the points by a factor of 0.8

Related

Calculated fields that improve performance but need to be maintained (EF)

I have this "1 to N" model:
class Reception
{
public int ReceptionId { get; set; }
public string Code { get; set; }
public virtual List<Item> Items { get; set; }
}
class Item
{
public int ItemId { get; set; }
public string Code { get; set; }
public int Quantity { get; set; }
public int ReceptionId { get; set; }
public virtual Reception Reception { get; set; }
}
And this action, api/receptions/list
public JsonResult List()
{
return dbContext.Receptions
.Select(e => new
{
code = e.Code,
itemsCount = e.Items.Count,
quantity = e.Items.Sum(i => i.Quantity)
}).ToList();
}
which returns a list of receptions, with their number of items:
[
{code:"1231",itemsCount:10,quantity:30},
{code:"1232",itemsCount:5,quantity:70},
{code:"1234",itemsCount:30,quantity:600},
...
]
This was working fine but I'm having too many Reception's and Item's thus the query is taking too long...
So I want to speed up by adding some persisted fields to Reception:
class Reception
{
public int ReceptionId { get; set; }
public string Code { get; set; }
public virtual List<Item> Items { get; set; }
public int ItemsCount { get; set; } // Persisted
public int Quantity { get; set; } // Persisted
}
With this change, the query ends up being this:
public JsonResult List()
{
return dbContext.Receptions
.Select(e => new
{
code = e.Code,
itemsCount = e.ItemsCount,
quantity = e.Quantity
}).ToList();
}
My question is:
What's the best way to maintain these two fields?
I will gain in performance but now I will need to be more careful with the creation of Item's
Today an Item can be created, edited and deleted:
api/items/create?receptionId=...
api/items/edit?itemId=...
api/items/delete?itemId=...
I also have a tool for importing receptions via Excel:
api/items/createBulk?...
Maybe tomorrow I will have more ways of creating Item's, so the question is how do I make sure that these two new fields, ItemsCount and Quantity will be up to date always?
Should I create a method within Reception like this?
class Reception
{
...
public void UpdateMaintainedFields()
{
this.Quantity = this.Items.Sum(e => e.Quantity);
this.ItemsCount = this.Items.Count();
}
}
And then REMEMBER to call it from all the previous URL's? (items/create, items/edit, ...)
Or maybe should I have a stored procedure in the database?
What is the common practice? I know there are calculated columns but these refer to fields of the same class. Also there are indexed views, but I'm not sure if they apply well to scenarios like this.
From your code it seems to me that you do not have a layer for business logic, and everything is implemented in the controllers, this causes the problem for you that when you would have a different way (and it seems, that you mean a different controller) you have to implement this logic again and it is easy to forget, and if you do not forget, you could forget to maintain later.
So I would recommend to have a layer for business logic (like adding new items) and use it from the controllers where you want to create items.
I would also recommend write the function UpdateMaintainedFields as you asked, but call it in the business logic layer after adding the items, not in the controllers!
You could write the logic on the database also (trigger) if you can accept that you can't write unit test.
Assuming the original query cannot be improved with the correct execution plan in SQLServer, the way to update these fields is via a trigger in the DB. When an insert occurs (or possible an update if your persisted fields change according to the data) then when an insert occurs to that table, the trigger is run. It would be responsible for updating all the rows with the new values.
Obviously your insert performance would drop, but your query performance would be that of a simple index and read of a single row. Obviously you wouldn't be able to use this trick if you were to return a subset of the table, as all the quantities would be fixed.
An alternative is to hold the count and quantity sums in a separate table, or in a dummy row that holds the summed quantities as its entry for quantity. YMMV.
PS I hate how what is a SQL question has been turned in one about C# code! Learn SQL and run the queries you need directly in the DB, that will show you much more about the performance and structure of what you're looking for than getting EF involved. /rant :)
You want to store the same information duplicitly, which can lead to inconsistencies. As an inspiration, indexes are also duplicating data. How do you update them? You don't. It is all fully transparent. And I would recommend the same approach here.
Make sum table, maintained by triggers. The table would not be included in any datacontext schema, only way to read it would be through non updateable views or stored procedures. Its name should evoke, that nobody should ever touch this table directly.
You can now access your data from various frameworks and do not worry about updating anything. Database would assure the precalculated sums are always correct, as long as you do not write to the sum table on your own. In fact you can add or remove this table any time and no application would even notice.

Remove all but 1 object in list based on grouping

I have a list of objects with multiple properties in it. Here is the object.
public class DataPoint
{
private readonly string uniqueId;
public DataPoint(string uid)
{
this.uniqueId = uid;
}
public string UniqueId
{
get
{
return this.uniqueId;
}
}
public string ScannerID { get; set; }
public DateTime ScanDate { get; set; }
}
Now in my code, I have a giant list of these, hundreds maybe a few thousand.
Each data point object belongs to some type of scanner, and has a scan date. I want to remove any data points that were scanned on the same day except for the last one for a given machine.
I tried using LINQ as follows but this did not work. I still have many duplicate data points.
this.allData = this.allData.GroupBy(g => g.ScannerID)
.Select(s => s.OrderByDescending(o => o.ScanDate))
.First()
.ToList();`
I need to group the data points by scanner ID, because there could be data points scanned on the same day but on a different machine. I only need the last data point for a day if there are multiple.
Edit for clarification - By last data point I mean the last scanned data point for a given scan date for a given machine. I hope that helps. So when grouping by scanner ID, I then tried to order by scan date and then only keep the last scan date for days with multiple scans.
Here is some test data for 2 machines:
Unique ID Scanner ID Scan Date
A1JN221169H07 49374 2003-02-21 15:12:53.000
A1JN22116BK08 49374 2003-02-21 15:14:08.000
A1JN22116DN09 49374 2003-02-21 15:15:23.000
A1JN22116FP0A 49374 2003-02-21 15:16:37.000
A1JOA050U900J 80354 2004-10-05 10:53:24.000
A1JOA050UB30K 80354 2004-10-05 10:54:39.000
A1JOA050UD60L 80354 2004-10-05 10:55:54.000
A1JOA050UF80M 80354 2004-10-05 10:57:08.000
A1JOA0600O202 80354 2004-10-06 08:38:26.000
I want to remove any data points that were scanned on the same day except for the last one for a given machine.
So I assume you want to group by both ScanDate and ScannerID. Here is the code:
var result = dataPoints.GroupBy(i => new { i.ScanDate.Date, i.ScannerID })
.OrderByDescending(i => i.Key.Date)
.Select(i => i.First())
.ToList();
If I understand you correctly this is what you want.
var result = dataPoints.GroupBy(i => new { i.ScanDate.Date, i.ScannerID })
.Select(i => i.OrderBy(x => x.ScanDate).Last())
.ToList();
This groups by the scanner id and the day (SacnnerDate.Date will zero out the time portion), then for each grouping it orders by the ScanDate (since the groups are the same day this will order on the time) and takes the last. So for each day you will get one result for each scanner which has the latest ScanDate for that particular day.
Just as an aside, the class could be defined as
public class DataPoint
{
public DataPoint(string uid)
{
UniqueId = uid;
}
public string UniqueId {get; private set; }
public string ScannerID { get; set; }
public DateTime ScanDate { get; set; }
}

Should I introduce redundancy into model design

I am trying to design a new system for tracking sales. A simplistic version of my data models are:
public class Sale
{
public int SaleId { get; set; }
public DateTime CompletedDateTime { get; set; }
public virtual List<SaleItem> SaleItems { get; set; }
public decimal Total
{
get
{
return SaleItems.Sum(i => i.Price);
}
}
}
public class SaleItem
{
public int SaleItemId { get; set; }
public decimal Price { get; set; }
public int SaleId { get; set; }
public virtual Sale Sale { get; set; }
}
I am now writing some reports which total the sales value for between a specified period. I have the following code to do that:
List<Sale> dailySales = db.Sales
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) >= fromParam)
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) <= toParam)
.ToList();
decimal total = dailySales.Sum(x => x.Total);
This is working ok and giving me the expected result. I feel like this might give me problems further down the line though once large datasets get involved. I assume having to load all the Sale's into a list would become resource intensive, plus my actual implementation has tax, costs etc. associated with each SaleItem so again becomes more complex.
The following would allow me to do all the processing on the database, however it is not possible to do this as the DB does not have a representation for Total, so EF throws an error:
Decimal total = db.Sales.Sum(x=>x.Total);
Which leads me to my question. I could set me model as the following and each time I add a SaleItem, make sure I update the Total:
public class Sale
{
...
public decimal Total { get; set; }
}
This would then allow me to query the database as required, and I assume will be less resource intensive. The flip side though is that I have reduced redundancy into the database. Is the latter way, the better method of dealing with this or is there an alternative method I haven't even considered that is better?
It depends on many factors. For instance, how often will you require the "Total" amount to be available? And how many SaleItems are there usually in a Sale?
If we're talking about, say, a supermarket kind of sale where you have... say... maximum of maximums 200 items. It's quite okay to just quickly calculate it on the fly. Then again, if this ever gets mapped to a RDBMS and if you have all the SaleItems in one single table, having an index on the foreign key (which links each individual SaleItem to its Sale) is a must, otherwise performance will take a huge hit once you start to have millions of transactions to sift through.
Answering the second half of your question, having redundancy is not always a bad thing... you just need to make sure that if each Sale ever needs to get its List modified, at the end of it the Total is recalculated. It's slightly dangerous (redundancy always has this attached burden) but you just need to ensure that whatever has the potential to change the Sale, does so in a way (maybe even with a trigger in the RDBMS) that the total will be automatically recalculated.
Hope it helps!
You're right that, it's much more efective to calculate totals on the DB side instead of loading the whole list and calculating it on the application.
I think you're missing that you can make a LINQ query that gets the SUM of related children entities.
using (var ctx = new MyDbContext())
{
var totalSales = ctx.Sales
.Select(s => s.SaleItems.Sum(si => si.Price)) // Total of each Sale
.Sum(tsi => tsi); // Sum of the total of each sale
}
You can of course shape the query to bring additional information, projecting the result in an anonymous class or in a class created ad-hoc for this purpose.
Of course, this EF query will be translated into a SQL query and executed on the server side.
When you start using LINQ to EF it's not very obvious how to get what you want, but in most occassions you can do it.

Slow Query Compilation in C# with EntityFramework 4.4 When Using 100's of Id's

I was hoping I could get some help with a performance problem I'm having in EntityFramework 4.4. I'm working on converting an application that was using EDMX files over to code first and I've run into a problem when running queries with a large number of objects in the "where" clause of the LINQ query.
Here's a short overview of how everything is laid out (Entity doesn't refer to EF, it's the name given to a generic "thing" in our code):
public class ExampleDbContext : DbContext
{
public DbSet<EntityTag> EntityTags { get; set; }
public DbSet<Entity> Entities { get; set; }
public DbSet<Log> Logs { get; set; }
protected override void OnModelCreating(DbmodelBuilder modelBuilder)
{
// Fluent mappings added to modelBuilder.Configurations.Add() in here
}
}
public class EntityTag
{
public int Id { get; set; }
public virtual Entity Entity { get; set; }
public int EntityId { get; set; }
public virtual Log Deleted { get; set; }
public int? DeletedId { get; set; }
}
public class Entity
{
public int Id { get; set; }
pulic byte[] CompositeId { get; set; }
}
// Used to log when an event happens
public class Log
{
public int Id { get; set; }
public string Username { get; set; }
public DateTime Timestamp { get; set; }
}
The query I'm running that causes the problem is:
// Creates an IEnumerable<byte[]> with the keys to find
var computedKeys = CreateCompositeIDs(entityKeys);
// Run the query and find any EntityTag that isn't deleted and is in
// the computedKeys list
var result = from et in Context.EntityTags
where computedKeys.Contains(et.Entity.CompositeId) &&
et.Deleted == null
select et;
var entityTags = result.ToList();
When computedKeys contains only a few Ids (15 for example) the code and query runs quickly. When I have a large number of Ids (1600 is normal at this point and it could get higher) it takes minutes (at 500, I haven't even tried with 1500 yet) to run that query once it's enumerated with ToList(). I've also removed the computedKeys.Contains() (leaving et.Deleted) from the query with a large number of computedKeys and the query ends up running quickly.
Through debugging I've determined that creating the list of keys is fast, so that's not the problem. When I hook a profiler up to MSSQL to see the query that's generated it looks normal in that all of the CompositeId's are included in a WHERE CompositeId IN ( /* List of Ids, could be 1500 of them */) and when the query shows up in the profiler it executes in less than a second so I don't think it's a database optimization thing, either. The profiler will sit there without anything showing up for the entire time it's running aside from the last second or so when it quickly returns a result.
I hooked up dotTrace and it looks like a lot of the time is spent within System.Data.Query.PlanCompiler.JoinGraph.GenerateTransitiveEdge(JoinEdge, JoinEdge) (119,640 ms) and System.Collections.Generic.List+Enumerator1.MoveNext` (54,270 ms) is called within that method twice, I think, based on the total execution time for each of them.
I just can't seem to figure out why it's taking so long to generate the query. It doesn't seem to be any faster the second time it executes after compiling, either, so it doesn't look like it's being cached.
Thanks in advance for the help!
I was able to figure it out. Once I decided not to be held to the original query and reconsidered the result, I rewrote the query to be:
var computedKeys = CreateCompositeIDs(entityKeys);
var entityTags = (from e in Context.Entities
where computedKeys.Contains(e.CompositeId)
from et in e.Tags
select et).Distinct();
entityTags = from et in entityTags
where et.Deleted == null
select et;
return entityTags;
When I started querying the entitites directly and took advantage of the relationship to EntityTag (which I forgot to include in the original question...) via Tags and then filtered only the existing EntityTag it sped up the query to the point where it's all running in under one second.

Entity Framework Performance Issue

I am running into an interesting performance issue with Entity Framework. I am using Code First.
Here is the structure of my entities:
A Book can have many Reviews.
A Review is associated with a single Book.
A Review can have one or many Comments.
A Comment is associated with one Review.
public class Book
{
public int BookId { get; set; }
// ...
public ICollection<Review> Reviews { get; set; }
}
public class Review
{
public int ReviewId { get; set; }
public int BookId { get; set; }
public Book Book { get; set; }
public ICollection<Comment> Comments { get; set; }
}
public class Comment
{
public int CommentId { get; set; }
public int ReviewId { get; set; }
public Review Review { get; set; }
}
I populated my database with a lot of data and added the proper indexes. I am trying to retrieve a single book that has 10,000 reviews on it using this query:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.FirstOrDefault();
This particular book has 10,000 reviews. The performance of this query is around 4 seconds. Running the exact same query (via SQL Profiler) actually returns in no time at all. I used the same query and a SqlDataAdapter and custom objects to retrieve the data and it happens in under 500 milliseconds.
Using ANTS Performance Profiler it looks like a bulk of the time is being spent doing a few different things:
The Equals method is being called 50 million times.
Does anyone know why it would need to call this 50 million times and how I could increase the performance for this?
Why is Equals called 50M times?
It sounds quite suspicious. You have 10.000 reviews and 50.000.000 calls to Equals. Suppose that this is caused by identity map internally implemented by EF. Identity map ensures that each entity with unique key is tracked by the context only once so if context already has instance with the same key as loaded record from the database it will not materialize new instance and instead uses the existing one. Now how this can coincide with those numbers? My terrifying guess:
=============================================
1st record read | 0 comparisons
2nd record read | 1 comparison
3rd record read | 2 comparisons
...
10.000th record read | 9.999 comparisons
That means that each new record is compared with every existing record in identity map. By applying math to compute sum of all comparison we can use something called "Arithmetic sequence":
a(n) = a(n-1) + 1
Sum(n) = (n / 2) * (a(1) + a(n))
Sum(10.000) = 5.000 * (0 + 9.999) => 5.000 * 10.000 = 50.000.000
I hope I didn't make mistake in my assumptions or calculation. Wait! I hope I did mistake because this doesn't seem good.
Try turning off change tracking = hopefully turning off identity map checking.
It can be tricky. Start with:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.AsNoTracking()
.FirstOrDefault();
But there is a big chance that your navigation property will not be populated (because it is handled by change tracking). In such case use this approach:
var book = db.Books.Where(b => b.BookId == id).AsNoTracking().FirstOrDefault();
book.Reviews = db.Reviews.Where(r => r.BookId == id).AsNoTracking().ToList();
Anyway can you see what object type is passed to Equals? I think it should compare only primary keys and even 50M integer comparisons should not be such a problem.
As a side note EF is slow - it is well known fact. It also uses reflection internally when materializing entities so simply 10.000 records can take "some time". Unless you already did that you can also turn off dynamic proxy creation (db.Configuration.ProxyCreationEnabled).
I know this sounds lame, but have you tried the other way around, e.g.:
var reviewsAndBooks = db.Reviews.Where(r => r.Book.BookId == id)
.Include(r => r.Book);
I have noticed sometimes better performance from EF when you approach your queries this way (but I haven't had the time to figure out why).

Categories

Resources