Should I introduce redundancy into model design - c#

I am trying to design a new system for tracking sales. A simplistic version of my data models are:
public class Sale
{
public int SaleId { get; set; }
public DateTime CompletedDateTime { get; set; }
public virtual List<SaleItem> SaleItems { get; set; }
public decimal Total
{
get
{
return SaleItems.Sum(i => i.Price);
}
}
}
public class SaleItem
{
public int SaleItemId { get; set; }
public decimal Price { get; set; }
public int SaleId { get; set; }
public virtual Sale Sale { get; set; }
}
I am now writing some reports which total the sales value for between a specified period. I have the following code to do that:
List<Sale> dailySales = db.Sales
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) >= fromParam)
.Where(x => DbFunctions.TruncateTime(x.CompletedDateTime) <= toParam)
.ToList();
decimal total = dailySales.Sum(x => x.Total);
This is working ok and giving me the expected result. I feel like this might give me problems further down the line though once large datasets get involved. I assume having to load all the Sale's into a list would become resource intensive, plus my actual implementation has tax, costs etc. associated with each SaleItem so again becomes more complex.
The following would allow me to do all the processing on the database, however it is not possible to do this as the DB does not have a representation for Total, so EF throws an error:
Decimal total = db.Sales.Sum(x=>x.Total);
Which leads me to my question. I could set me model as the following and each time I add a SaleItem, make sure I update the Total:
public class Sale
{
...
public decimal Total { get; set; }
}
This would then allow me to query the database as required, and I assume will be less resource intensive. The flip side though is that I have reduced redundancy into the database. Is the latter way, the better method of dealing with this or is there an alternative method I haven't even considered that is better?

It depends on many factors. For instance, how often will you require the "Total" amount to be available? And how many SaleItems are there usually in a Sale?
If we're talking about, say, a supermarket kind of sale where you have... say... maximum of maximums 200 items. It's quite okay to just quickly calculate it on the fly. Then again, if this ever gets mapped to a RDBMS and if you have all the SaleItems in one single table, having an index on the foreign key (which links each individual SaleItem to its Sale) is a must, otherwise performance will take a huge hit once you start to have millions of transactions to sift through.
Answering the second half of your question, having redundancy is not always a bad thing... you just need to make sure that if each Sale ever needs to get its List modified, at the end of it the Total is recalculated. It's slightly dangerous (redundancy always has this attached burden) but you just need to ensure that whatever has the potential to change the Sale, does so in a way (maybe even with a trigger in the RDBMS) that the total will be automatically recalculated.
Hope it helps!

You're right that, it's much more efective to calculate totals on the DB side instead of loading the whole list and calculating it on the application.
I think you're missing that you can make a LINQ query that gets the SUM of related children entities.
using (var ctx = new MyDbContext())
{
var totalSales = ctx.Sales
.Select(s => s.SaleItems.Sum(si => si.Price)) // Total of each Sale
.Sum(tsi => tsi); // Sum of the total of each sale
}
You can of course shape the query to bring additional information, projecting the result in an anonymous class or in a class created ad-hoc for this purpose.
Of course, this EF query will be translated into a SQL query and executed on the server side.
When you start using LINQ to EF it's not very obvious how to get what you want, but in most occassions you can do it.

Related

How to populate objects with relationship from datatable?

I am having trouble designing an approach for taking data from a CSV into business objects. I'm starting by parsing the CSV and getting each row into a DataTable and that is where my mental block starts.
I've got the following classes where APDistribution is considered a child of Voucher with a 1:Many relationship:
public class Voucher
{
public string GPVoucherNumber { get; set; }
public string VendorID { get; set; }
public string TransactionDescription { get; set; }
public string Title { get; set; }
public string DocNumber { get; set; }
public DateTime DocDate { get; set; }
public decimal PurchaseAmount { get; set; }
public IEnumerable<APDistribution> Distributions { get; set; }
}
public class APDistribution
{
public string AccountNumber { get; set; }
public decimal Debit { get; set; }
public decimal Credit { get; set; }
public string DistributionReference { get; set; }
}
My CSV looks like this. Several fields can repeat representing the Voucher transaction (Vendor, Title Invoice Number, Invoice Amount, etc), and some fields are the Distribution detail (Journal Account Code, Journal Amount).
I began by thinking I could use Linq to project onto my business objects but I'm failing to see how I can structure the query to do that in one pass. I find myself wondering if I can do one query to project into a Voucher collection, one to project into an APDistribution collection, and then some sort of code to properly associate them.
I started with the following where I am grouping by the fields that should uniquely define a Voucher, but that doesn't work because the projection is dealing with an anonymous type instead of the DataRow.
var vouchers =
from row in invoicesTable.AsEnumerable()
group row by new { vendor = row.Field<string>("Vendor Code"), invoice = row.Field<string>("Vendor Invoice Number") } into rowGroup
select new Voucher
{ VendorID = rowGroup.Field<string>("Vendor Code") };
Is this achievable without introducing complex Linq that a future developer (myself included) could have difficulty understanding/maintaining? Is there a simpler approach without Linq that I'm overlooking?
The general idea is:
invoicesTable
.AsEnumerable()
.GroupBy(x=> new { row.Field<string>("Vendor Code"), row.Field<string>("Vendor Invoice Number")})
.Select(grouping =>
new Voucher
{
VendorID = grouping.First().Field<string>("VendorId") /* and so on */
Distributions = grouping.Select(somerow => new redistribution {AccountNumber = somerow.Field<string>("AccountNumber") /* and so on */}
}
But this is not the most elegant way.
You are looking for a Linq join. See the documentation here for more greater depth.
Where it appears that you are running into trouble however, is that on your 2 objects you need something for the query to compare against, like maybe adding public string VendorID { get; set; } to the APDistribution class, if possible. I would assume that the CSV files would have something that ties an APDistribution back to a Voucher, so whatever it is, make sure it's in both classes so you can relate one to the other. The name doesn't need to be the same in both classes but it should be. More importantly is that you now have something that an equality comparer can use for the join operation.
Now personally, I don't like big gnarly queries if I can break them apart and make things easier. Too much to reason about all at once, and you've indicated that you agree. So my approach is to divide and conquer as follows.
First, run queries to project the CSV data into discrete objects, like so:
var voucherRows =
from row in invoicesTable.AsEnumerable()
Select New Voucher {
VendorID = row.Field<string>("Vendor Code")
// other properties to populate
};
and
var distributionRows =
from row in distributionsTable.AsEnumerable()
Select New APDistribution {
VendorID = row.Field<string>("Vendor Code"),
// other properties to populate
};
At this point you have 2 data sets that are related in domain terms but not yet associated in code. Now you can compose the queries together in the Join query and the join starts to look a lot easier, maybe something like:
var vouchers =
from row in voucherRows
join dist in distributionRows
on row.VendorId equals dist.VendorId
into distGroup
select new Voucher
{ VendorID = row.VendorID,
// other properties to populate
Distributions = distGroup.ToList()
};
You'll have to modify the queries to your needs, but this breaks them down into 3 distinct operations that are all designed to do 1 thing, thus easier to read, reason about, debug, and modify later. If you need to group the vouchers you can at this point, but this should get you moving. Also, if needed, you can add a validation step or other processing in between the initial CSV queries and the join and you don't have to rewrite your queries, with the exception of changing some input variable names on the join.
Also, disclaimer that I did NOT build these queries in an IDE before posting so you may have some typos or missed symbols to deal with, but I'm pretty sure I have it right. Sorry in advance if you find anything aggravating.
While Linq can be cool and add efficiencies, it doesn't add value if you can't be sure the code is correct today, and can't understand it tomorrow. Maybe using Linq in this case is Premature Optimization.
Start with a non-Linq solution that is verifiably accurate without being needlessly inefficient, and then optimize later if performance becomes a problem.
Here's how I might tackle this:
var vouchers = new Dictionary<string, Voucher>();
foreach (DataRow row in invoicesTable)
{
string vendor = row.Field<string>("Vendor Code");
string invoice = row.Field<string>("Vendor Invoice Number");
string voucherKey = vendor + "|" + invoice;
if (!vouchers.ContainsKey(voucherKey))
{
vouchers.Add(voucherKey, new Voucher { VendorID = vendor, DocNumber = invoice });
}
vouchers[voucherKey].Distributions.Add(new APDistribution { AccountNumber = row.Field<string>("Journal Account Code") });
}
If this will be processing a large number of rows, you can tune this a bit by preallocating the Dictionary to an estimate of the number of unique vendors:
var vouchers = new Dictionary<string, Voucher>(invoicesTable.Rows.Count * 0.8);

Calculated fields that improve performance but need to be maintained (EF)

I have this "1 to N" model:
class Reception
{
public int ReceptionId { get; set; }
public string Code { get; set; }
public virtual List<Item> Items { get; set; }
}
class Item
{
public int ItemId { get; set; }
public string Code { get; set; }
public int Quantity { get; set; }
public int ReceptionId { get; set; }
public virtual Reception Reception { get; set; }
}
And this action, api/receptions/list
public JsonResult List()
{
return dbContext.Receptions
.Select(e => new
{
code = e.Code,
itemsCount = e.Items.Count,
quantity = e.Items.Sum(i => i.Quantity)
}).ToList();
}
which returns a list of receptions, with their number of items:
[
{code:"1231",itemsCount:10,quantity:30},
{code:"1232",itemsCount:5,quantity:70},
{code:"1234",itemsCount:30,quantity:600},
...
]
This was working fine but I'm having too many Reception's and Item's thus the query is taking too long...
So I want to speed up by adding some persisted fields to Reception:
class Reception
{
public int ReceptionId { get; set; }
public string Code { get; set; }
public virtual List<Item> Items { get; set; }
public int ItemsCount { get; set; } // Persisted
public int Quantity { get; set; } // Persisted
}
With this change, the query ends up being this:
public JsonResult List()
{
return dbContext.Receptions
.Select(e => new
{
code = e.Code,
itemsCount = e.ItemsCount,
quantity = e.Quantity
}).ToList();
}
My question is:
What's the best way to maintain these two fields?
I will gain in performance but now I will need to be more careful with the creation of Item's
Today an Item can be created, edited and deleted:
api/items/create?receptionId=...
api/items/edit?itemId=...
api/items/delete?itemId=...
I also have a tool for importing receptions via Excel:
api/items/createBulk?...
Maybe tomorrow I will have more ways of creating Item's, so the question is how do I make sure that these two new fields, ItemsCount and Quantity will be up to date always?
Should I create a method within Reception like this?
class Reception
{
...
public void UpdateMaintainedFields()
{
this.Quantity = this.Items.Sum(e => e.Quantity);
this.ItemsCount = this.Items.Count();
}
}
And then REMEMBER to call it from all the previous URL's? (items/create, items/edit, ...)
Or maybe should I have a stored procedure in the database?
What is the common practice? I know there are calculated columns but these refer to fields of the same class. Also there are indexed views, but I'm not sure if they apply well to scenarios like this.
From your code it seems to me that you do not have a layer for business logic, and everything is implemented in the controllers, this causes the problem for you that when you would have a different way (and it seems, that you mean a different controller) you have to implement this logic again and it is easy to forget, and if you do not forget, you could forget to maintain later.
So I would recommend to have a layer for business logic (like adding new items) and use it from the controllers where you want to create items.
I would also recommend write the function UpdateMaintainedFields as you asked, but call it in the business logic layer after adding the items, not in the controllers!
You could write the logic on the database also (trigger) if you can accept that you can't write unit test.
Assuming the original query cannot be improved with the correct execution plan in SQLServer, the way to update these fields is via a trigger in the DB. When an insert occurs (or possible an update if your persisted fields change according to the data) then when an insert occurs to that table, the trigger is run. It would be responsible for updating all the rows with the new values.
Obviously your insert performance would drop, but your query performance would be that of a simple index and read of a single row. Obviously you wouldn't be able to use this trick if you were to return a subset of the table, as all the quantities would be fixed.
An alternative is to hold the count and quantity sums in a separate table, or in a dummy row that holds the summed quantities as its entry for quantity. YMMV.
PS I hate how what is a SQL question has been turned in one about C# code! Learn SQL and run the queries you need directly in the DB, that will show you much more about the performance and structure of what you're looking for than getting EF involved. /rant :)
You want to store the same information duplicitly, which can lead to inconsistencies. As an inspiration, indexes are also duplicating data. How do you update them? You don't. It is all fully transparent. And I would recommend the same approach here.
Make sum table, maintained by triggers. The table would not be included in any datacontext schema, only way to read it would be through non updateable views or stored procedures. Its name should evoke, that nobody should ever touch this table directly.
You can now access your data from various frameworks and do not worry about updating anything. Database would assure the precalculated sums are always correct, as long as you do not write to the sum table on your own. In fact you can add or remove this table any time and no application would even notice.

Algorithm to calculate frequency and recency of an entity?

I have a list of entities opened by various users.
I keep track of each access of any entity by storing access dates and times as the following:
public class Entity
{
public int Id { get; set; }
public virtual ICollection<AccessInfo> Accesses { get; set; }
= new HashSet<AccessInfo>();
}
public class AccessInfo
{
public int Id { get; set; }
public AccessInfoType Type { get; set; }
public User User { get; set; }
public DateTime DateTime { get; set; }
}
public enum AccessInfoType
{
Create,
Read,
Update,
Delete,
}
Now I'm trying to make an algorithm that filters the most wanted contacts based on both factors: recency and frequency.
I want contacts that were accessed 5 times yesterday to be prioritized over a contact that was accessed 30 times a week ago. But in the other hand, a user that was only accessed one time today is less important.
Is there an official name for this? I'm sure people have worked on a frequency calculation like this one before, and I'd like to read about this before I spend some time coding.
I thought about calculating the sum of the access dates in recent month and sort accordingly but I'm still not sure it's the right way, I'd love to learn from the experts.
return Entities
.OrderBy(c =>
c.Accesses
.Where(a => a.Employee.UserName == UserName)
.Where(a => a.DateTime > lastMonth)
.Select(a => a.DateTime.Ticks)
.Sum());
Exponential decay is what you're looking for. See this link:
http://www.evanmiller.org/rank-hotness-with-newtons-law-of-cooling.html
I would use a heuristic that assigns points to Entities for access and uses some kind of decay on those points.
For example, you could give an entity 1 point every time it is accessed, and once every day multiply all the points by a factor of 0.8

Calculated Property in Entity Framework

I stuck with a problem I have been looking for a solution to, but without much luck.
Using Entity Framework Code First, I need the ability to create a calculated property that does not rely on loading all of the objects before calculating.
// Psuedo Code for what I need
public class GoodInventoryChange
{
public int GoodID { get; set; }
public double Amount { get; set; } // Amount of change
public DateTime OccurredAt { get; set; } // Timestamp of the change
public double RunningTotal { get { /* CODE TBD*/ } } // Prior record plus amount
}
All of the suggestions on how to do this that I have found require calling .ToList() or similar, which may require many 1000s of records to be loaded in order to find a single entry.
In the end, I need the ability to query for:
// Psuedo Code
int goodID = 123;
var lowestRunningTotal = (from item in Context.GoodInventoryChanges
where item.GoodID == goodID && DateTime.Now <= item.OccurredAt
orderby item.RunningTotal
select item).FirstOrDefault();
I am using RunningTotal as an example here, but I have about 15-20 fields that need to be calculated in a similar fashion.
Does anyone have any advice or direction to point me in? I know I can brute force it, but I am hoping to do it via the SQL layer of Entity Framework.
I am OK creating calculated fields in the DB if there is a nice way to map them to Entity Framework classes as well.
You can use computed columns in the database, and decorate your entity with DatabaseGenerated attribute to prevent EF to try to write back its value to the table. After load, EF will read its value when you insert or update:
[DatabaseGenerated(DatabaseGeneratedOption.Computed)]
public string YourComputedProperty { get; set; }

Entity Framework Performance Issue

I am running into an interesting performance issue with Entity Framework. I am using Code First.
Here is the structure of my entities:
A Book can have many Reviews.
A Review is associated with a single Book.
A Review can have one or many Comments.
A Comment is associated with one Review.
public class Book
{
public int BookId { get; set; }
// ...
public ICollection<Review> Reviews { get; set; }
}
public class Review
{
public int ReviewId { get; set; }
public int BookId { get; set; }
public Book Book { get; set; }
public ICollection<Comment> Comments { get; set; }
}
public class Comment
{
public int CommentId { get; set; }
public int ReviewId { get; set; }
public Review Review { get; set; }
}
I populated my database with a lot of data and added the proper indexes. I am trying to retrieve a single book that has 10,000 reviews on it using this query:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.FirstOrDefault();
This particular book has 10,000 reviews. The performance of this query is around 4 seconds. Running the exact same query (via SQL Profiler) actually returns in no time at all. I used the same query and a SqlDataAdapter and custom objects to retrieve the data and it happens in under 500 milliseconds.
Using ANTS Performance Profiler it looks like a bulk of the time is being spent doing a few different things:
The Equals method is being called 50 million times.
Does anyone know why it would need to call this 50 million times and how I could increase the performance for this?
Why is Equals called 50M times?
It sounds quite suspicious. You have 10.000 reviews and 50.000.000 calls to Equals. Suppose that this is caused by identity map internally implemented by EF. Identity map ensures that each entity with unique key is tracked by the context only once so if context already has instance with the same key as loaded record from the database it will not materialize new instance and instead uses the existing one. Now how this can coincide with those numbers? My terrifying guess:
=============================================
1st record read | 0 comparisons
2nd record read | 1 comparison
3rd record read | 2 comparisons
...
10.000th record read | 9.999 comparisons
That means that each new record is compared with every existing record in identity map. By applying math to compute sum of all comparison we can use something called "Arithmetic sequence":
a(n) = a(n-1) + 1
Sum(n) = (n / 2) * (a(1) + a(n))
Sum(10.000) = 5.000 * (0 + 9.999) => 5.000 * 10.000 = 50.000.000
I hope I didn't make mistake in my assumptions or calculation. Wait! I hope I did mistake because this doesn't seem good.
Try turning off change tracking = hopefully turning off identity map checking.
It can be tricky. Start with:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.AsNoTracking()
.FirstOrDefault();
But there is a big chance that your navigation property will not be populated (because it is handled by change tracking). In such case use this approach:
var book = db.Books.Where(b => b.BookId == id).AsNoTracking().FirstOrDefault();
book.Reviews = db.Reviews.Where(r => r.BookId == id).AsNoTracking().ToList();
Anyway can you see what object type is passed to Equals? I think it should compare only primary keys and even 50M integer comparisons should not be such a problem.
As a side note EF is slow - it is well known fact. It also uses reflection internally when materializing entities so simply 10.000 records can take "some time". Unless you already did that you can also turn off dynamic proxy creation (db.Configuration.ProxyCreationEnabled).
I know this sounds lame, but have you tried the other way around, e.g.:
var reviewsAndBooks = db.Reviews.Where(r => r.Book.BookId == id)
.Include(r => r.Book);
I have noticed sometimes better performance from EF when you approach your queries this way (but I haven't had the time to figure out why).

Categories

Resources