C# LINQ and calculations involving large datasets

C# LINQ and calculations involving large datasets - c#

This is more of a technical "how-to" or "best approach" question.
We have a current requirement to retrieve records from the database, place them into an 'in-memory' list and then perform a series of calculations on the data, i.e. maximum values, averages and some more specific custom statistics as well.
Getting the data into an 'in-memory' list is not a problem as we use NHibernate as our ORM and it does an excellent job of retrieving data from the database. The advice I am seeking is how should we best perform calculations on the resulting list of data.
Ideally I would like to create a method for each statistic, MaximumValue(), AverageValueUnder100(), MoreComplicatedStatistic() etc etc. Of course passing the required variables to each method and having it return the result. This approach would also make unit testing a breeze and provide us with excellent coverage.
Would there be a performance hit if we perform a LINQ query for each calculation or should be consolidate as many calls to each statistic method in as few LINQ queries as possible. For example it doesn't make much sense to pass the list of data to a method called AverageValueBelow100 and then pass the entire list of data to another method AverageValueBelow50 when they could effectively be performed with one LINQ query.
How can we achieve a high level of granularity and separation without sacrificing performance?
Any advice ... is the question clear enough?

Depending on the complexity of the calculation, it may be best to do it in the database. If it is signifcantly complex that you need to bring it in as objects and encur that overhead, you may want to avoid multiple iterations over your result set. you may want to consider using Aggregate. See http://geekswithblogs.net/malisancube/archive/2009/12/09/demystifying-linq-aggregates.aspx for a discussion if it. You would be able to unit test each aggregate separately, but then (potentially) project multiple aggregates within a single iteration.

I dont agree that it is best "to do it all in the database".
Well written Linq Queries will result in good SQL queries being executed against the database, which should be good enough performance wise (if you are not going to do dwh stuff). This is assuming you are using the Linq Provider for NHibernate and not Linq to Objects.
It does look good, you can change it easily and keeps your business logic in one place.
If that is too slow for your needs, you might check the SQL code created and tweak your linq queries, are try to precompile them, and in the end you can still go back to writing the beloved stored procedures - and start to spread your business logic all over the place.
Will there be a performance hit? Yeah, you might lose a few millisecs, but is that worth the price you have to pay for separating your logic?

To answer the "I would like to create a method for each statistic" concern, I would suggest you to build a kind of statistician class. Here is some pseudo code to express the idea :
class Statistician
{
public bool MustCalculateFIRSTSTATISTIC { get; set; } // Please rename me!
public bool MustCalculateSECONDSTATISTIC { get; set; } // Please rename me!
public void ProcessObject(object Object) // Replace object and Rename
{
if (MustCalculateFIRSTSTATISTIC)
CalculateFIRSTSTATISTIC(Object);
if (MustCalculateFIRSTSTATISTIC)
CalculateSECONDSTATISTIC(Object);
}
public object GetFIRSTSTATISTIC() // Replace object, Rename
{ /* ... */ }
public object GetSECONDSTATISTIC() // Replace object, Rename
{ /* ... */ }
private void CalculateFIRSTSTATISTIC(object Object) // Replace object
{ /* ... */ }
private void CalculateSECONDSTATISTIC(object Object) // Replace object
{ /* ... */ }
}
Would I have to do this, I would probably try to make it generic and use collections of delegates instead of methods, but since I don't know your context, I'll leave it to that. Also note that I only used Object members of object class, but that's only because I'm not suggesting you to use DataRows, Entities, or what not; I'll leave that to the other folks that know more then me on the subject!

Related

QueryOver – ensure adding a join table only once

I have a fairly complex use case that requires performing the same SQL tasks conditionally in various parts of the code. I wanted to duplicate as little code as possible, so I built a few static helper methods that allow me to add some JOIN statements when needed.
I know this probably could've been done a bit more cleanly with extensions, but for now my code looks something like this:
static class Foo
{
// Actually adds some filters which need additional JOINs
public static IQueryOver<Transaction, Transaction> FromRetailer(Retailer retailer, IQueryOver<Transaction, Transaction> baseQuery = null)
{
RetailLocation retailLocation = null;
return ForRetailerBase(baseQuery)
.Where(t => retailLocation.Retailer == retailer);
}
// Auxiliary method which only adds some JOINs needed in various places
public static IQueryOver<Transaction, Transaction> ForRetailerBase(IQueryOver<Transaction, Transaction> baseQuery = null)
{
if (baseQuery == null)
baseQuery = QueryOver(); // Custom method that creates a vanilla IQueryOver instance
// Add all sorts of JOINs needed to query the retailer
return baseQuery
.JoinAlias(...)
.Left.JoinAlias(...)
// and so on
;
}
}
In the business logic, I either need to actually filter by retailer (in which case I call FromRetailer(), which calls ForRetailerBase() for me), or I don't need to filter by retailer – but I still need the JOINs added by ForRetailerBase() later on for grouping. Calling ForRetailerBase() unconditionally obviously breaks things when FromRetailer() is also called.
I'm currently solving this in a very clumsy fashion, by using a boolean in the business logic in order to execute ForRetailerBase() conditionally, only if FromRetailer() isn't executed.
I realize this could be fixed on two levels: either use a more adequate pattern altogether, or add those JOINs conditionally in ForRetailerBase(), by interrogating the baseQuery object to determine whether it already has the necessary JOINs. I'd rather go with the first approach, if one is available (this part of the code is still relatively young, and I can easily refactor it) – but I'll settle for the second approach as well. Problem is, I don't know how to advance in either direction.
I also realize the superficial solution is to remove the call to ForRetailerBase() from method FromRetailer(), and calling it unconditionally from the business logic, but that's just as bad as my current solution, because it requires my business logic to know how those methods work internally.

ForRetailerBase, FromRetailer looks to me as something the business logic should not know at all. It looks like query helpers, which should be handled by a query repository.
Such repository will expose querying methods for the business, methods which would internally call your ForRetailerBase or FromRetailer as required.
This way, your business will not need knowledge of how to build your queries, and your querying logic will still be factorized, inside the repository.
Side note: your question does not really look bound to the specific technologies you are using. It looks to me more as a code design question. Maybe should you ask it on https://softwareengineering.stackexchange.com/ instead, which is meant for such questions (see its on topic page).

How to organizate custom SQl queres collection

I'm developing an app for Windows Phone with SQlite and have a lot of custom SQL queries. Like
string query = "SELECT distinct(destinations.name) as Destinations
FROM destinations, flights
WHERE destinations.d_ID = flights.d_ID
AND flights.Date = #" + date.ToShortDateString() + "#";
then run:
var result = (Application.Current as App).db.Query(query);
For working with SQlite i'm using http://dotnetslackers.com/articles/silverlight/Windows-Phone-7-Native-Database-Programming-via-Sqlite-Client-for-Windows-Phone.aspx#s2-introduction-to-sqlite-client-for-windows-phone
and theirs DBHelper
I want all Queries will be in one place, so I can quickly change them.
Wanted to ask how to do it correctly?
create one static class
create Enum or Dictionary with queries collection
create some XML or similar file with collection -
Thanks for advise

I don't think any of the approaches are valid, for the following reasons:
Create one static class
This is a God object and is considered an anti-pattern, best to stay away from it. It's just going to be a nightmare to maintain.
create Enum or Dictionary with queries collection
Instead of having a God object now, you have a God collection, and are really just implementing the same anti-pattern in a different way.
Additionally, you'll have string keys (or enum keys) and there's not a strong link between the two (what if the dictionary doesn't populate for some reason?).
create some XML or similar file with collection
It could be argued that you're doing the same thing you would be doing with a dictionary; you'd have to key the query somehow and then look it up. It's a very brittle approach.
Possible solution
I recommend that you first abstract out your data layer into logical units. Create a class for data operations which are related.
For example, if you have a few queries and operations that are related to destinations, create an interface that exposes those operations:
public interface IDestinationDataOperations
{
// Get destinations by date.
IEnumerable<string> GetDestinationsByDate(DateTime asOf);
}
Then, create a class that implements this which is specific to SQL Lite. Where you want to make the calls, the variable is of the interface type.
The benefits of this are:
If you change the implementation from SQL Lite to some other underlying data store (web service call, JSON REST call, whatever) you only have to change where you populate the interface variable (this is where dependency injection begins to be of use), as all of your calls are against the abstraction
The interface is more easily testable:
You can test the direct implementation against any test data you want
For items that rely on the interface, you can mock the interface any way you like and not have an actual database underneath for testing.
Then, for other data operations, you can wash, rinse, and repeat.
For bonus points, you can separate out the interface into a unit-of-work for writes and a repository for reads, depending on whether what best suits your needs.

Best practice lookup-tables (GetOrCreate)

First I want to explain my situation in a few sentences:
- I have a number of, what I call them, lookup-tables. This means I want all values in there to be unique. For example a lookup-table for CPU models with a name and a GUID.
public partial class CPUModel : EntityObject
{
public Guid Id {}
public String Name{}
}
The entities are safed in a SQL CE database with the help of Entity Framework and C#
I have made a CRUDManager which helps me to Select, Insert, Update and Delete Entities.
All these Operations work with a ReaderWriterLockSlim to secure against problems with multithreading.
Now there should be some sort of GetOrCreate Method where I can say GetOrCreate(cpuModelName) and this gives me a saved CPUModel, either already existing or new created. This method should also work with the ReaderWriterLockSlim.
So I would want to implement this method on CRUDManager.
Do you think I'm on the right track or would you place this directly on the CPUModel (or even somewhere else?
Thank you very much :)

CRUDManager is a good place for it. I would just look the value up (ideally via a Lookup() type function, and if it exists, select it. If not, call Create() or whatever function you have defined to create it.

It sounds like you want a dictionary. I recommend reading about it here.

How to Test Functions w/ Complex Data Interactions

Currently, I am working on system that does quite a bit of reporting-style functions that consumes many different data points and transforms them into larger, sometimes flattened outputs. Most of my app is built upon a variation of the repository pattern. Due to this, I have a suite of mock-repositories that I use for testing scenarios. The problem that I am running into is that the interaction between these data points is so complex that it is quickly become a maintenance nightmare to maintain the "mock data". Here is a mock example:
public class SomeReportingEntity
{
private IProductRepo ProductRepo;
private IManagerRepo ManagerRepo;
private ILocationRepo LocationRepo;
private IOrdersService OrdersService;
private IEmployeeRepo EmployeeRepo;
public ReportingEntity(IProductRepo ipr, IManagerRepo imr, ILocationRepo ilr, IOrdersService ios,
IEmployeeRepo ier){
//Load these to private vars...
}
//This is the function that I want to test...
public SomeReportingEntity GetManagerSalesByRegionReport()
{
//Make a complex join on all sub collections. These
//sub collections are all under test individually.
var MangerSalesByRegionItems = From x in ProductRepo.CurrentProducts()
Join y in OrdersService.FutureOrders() On ...
Join z in EmployeeRepo.ActiveEmployees() On ...
Join a in LocationRepo.GetAllRegions() On ...
Join b In ManagerRepo.GetActiveManagers On ...
Select new SomeReportingEntity() With { ... }
return MangerSalesByRegionItems.ToList();
}
}
Admittedly, this is a very contrived example but the basic idea that I want to emphasize is that I have several repositories that I am joining and I need to create many tests to ensure that this complex query does as expected. Due to the fact that the joining operations are so complex, it makes the mock data VERY difficult to keep in line - especially as I have to add more associations and test additional points. In addition, I need to be able to enter specific record states into the mocks (such as an employee lacking an assigned manager) to verify that query handles those situations appropriately.
So here are my questions:
What is the best way to "mock" this data so that it is not such a matinenance nightmare? I have had many people suggest building an in-memory database to support this.
Am I really suffering from an architecture issue here? In reporting scenarios, I find myself in this pattern quite a bit where I take many disassociated data points and merge them into a new, hybrid entity. With the onset of Linq, it is very easy to do and has high clarity of intent, but sometimes it feels like I am cheating a little.

The first thing you want to do is make centralized object that knows how to retrieve the data for different repositories. Since this is reporting only, it's easier because you don't have to worry about change tracking.
From a logistical standpoint, one thing I would consider is making a local database to hold the remote data (update periodically using agents). This would remove some of the issues of calling remote services and aggregating their data on the fly. You would also be able to pre-process some of the data at the start.
When I use the repository pattern, I couple it with the Unit Of Work pattern. The Unit of Work is the guy that does all the legwork for you. Theoretically, your UoW could bring in the data from the multiple services and present it to the repositories based on configuration.
For testing, you can use the InMemoryUnitOfWork to provide all the data in one single place.

I've been working on data-heavy project myself. What has worked for us is to use the Repository itself to hydrate objects and then serialize them to XML. We pull the XML file into our test project and use that as the starting point for our automated tests. It's nice because it ensures that your mock data looks like real data.
Our tests tend to look like this...
var object1 = XmlUtil.LoadObject1("filename1");
var object2 = XmlUtil.LoadObject2("filename2");
var result = SomeConverter.Convert(object1, object2);
Assert("somevalue", result.Property1);
If you need to do inline lookups, you can add a mock repository that would provide the same level of dependency injection.
The downside of this approach is if the data schema changes. Sometimes, a test can become obsolete if the data schema has changed. If your schema is still under a lot of flux, I would keep your automated test small until the schema settles down. Focus on unit tests until you know that the schema is relaitively stable.

You have to decide exactly what you want to test.
One way to do this might be to pretend you're using TDD. Pretend that your GetManagerSalesByRegionReport method does not exist (or actually delete it). You'll have to:
Write a failing unit test. What's the simplest thing for it to test: that you can call the method and that it doesn't throw an exception when there's nothing wrong with the data.
You'll need to create the method, empty. It should return void since your test doesn't need it to return anything.
Your test should now pass.
Add a test to ensure that a List of the appropriate type is returned, even if none of the sub-repositories have data.
You'll have to change the method to return your list type, and you'll have to change it to return null. Your test will still fail, so change it to return an empty List and it will pass.
What's left? Those are INNER joins, so you won't get any data back unless all the repositories contain at least one row. So, test for that: create a test where each repo contains one row and ensure the returned list contains the appropriate number of rows. Then, test for the appropriate properties per returned row. Then test that no data is returned if any of the repos contain no rows.
Then, maybe test what happens if some of the repos contain more than one row.
Then, I don't know what would be left to test.

Is it ok to use C# Property like this

One of my fellow developer has a code similar to the following snippet
class Data
{
public string Prop1
{
get
{
// return the value stored in the database via a query
}
set
{
// Save the data to local variable
}
}
public void SaveData()
{
// Write all the properties to a file
}
}
class Program
{
public void SaveData()
{
Data d = new Data();
// Fetch the information from database and fill the local variable
d.Prop1 = d.Prop1;
d.SaveData();
}
}
Here the Data class properties fetch the information from DB dynamically. When there is a need to save the Data to a file the developer creates an instance and fills the property using self assignment. Then finally calls a save. I tried arguing that the usage of property is not correct. But he is not convinced.
This are his points
There are nearly 20 such properties.
Fetching all the information is not required except for saving.
Instead of self assignment writing an utility method to fetch all will have same duplicate code in the properties.
Is this usage correct?

I don't think that another developer who will work with the same code will be happy to see :
d.Prop1 = d.Prop1;
Personally I would never do that.
Also it is not the best idea to use property to load data from DB.
I would have method which will load data from DB to local variable and then you can get that data using property. Also get/set logically must work with the same data. It is strange to use get for getting data from DB but to use set to work with local variable.

Properties should really be as lightweight as possible.
When other developers are using properties, they expect them to be intrinsic parts of the object (that is, already loaded and in memory).
The real issue here is that of symmetry - the property get and set should mirror each other, and they don't. This is against what most developers would normally expect.
Having the property load up from database is not recommended - normally one would populate the class via a specific method.

This is pretty terrible, imo.
Properties are supposed to be quick / easy to access; if there's really heavy stuff going on behind a property it should probably be a method instead.
Having two utterly different things going on behind the same property's getter and setter is very confusing. d.Prop1 = d.Prop1 looks like a meaningless self-assignment, not a "Load data from DB" call.
Even if you do have to load twenty different things from a database, doing it this way forces it to be twenty different DB trips; are you sure multiple properties can't be fetched in a single call? That would likely be much better, performance-wise.

"Correct" is often in the eye of the beholder. It also depends how far or how brilliant you want your design to be. I'd never go for the design you describe, it'll become a maintenance nightmare to have the CRUD actions on the POCOs.
Your main issue is the absense of separations of concerns. I.e., The data-object is also responsible for storing and retrieving (actions that need to be defined only once in the whole system). As a result, you end up with duplicated, bloated and unmaintainable code that may quickly become real slow (try a LINQ query with a join on the gettor).
A common scenario with databases is to use small entity classes that only contain the properties, nothing more. A DAO layer takes care of retrieving and filling these POCOs with data from the database and defined the CRUD actions only ones (through some generics). I'd suggest NHibernate for the ORM mapping. The basic principle explained here works with other ORM mappers too and is explained here.
The reasons, esp. nr 1, should be a main candidate for refactoring this into something more maintainable. Duplicated code and logic, when encountered, should be reconsidered strongly. If the gettor above is really getting the database data (I hope I misunderstand that), get rid of it as quickly as you can.
Overly simplified example of separations of concerns:
class Data
{
public string Prop1 {get; set;}
public string Prop2 {get; set;}
}
class Dao<T>
{
SaveEntity<T>(T data)
{
// use reflection for saving your properies (this is what any ORM does for you)
}
IList<T> GetAll<T>()
{
// use reflection to retrieve all data of this type (again, ORM does this for you)
}
}
// usage:
Dao<Data> myDao = new Dao<Data>();
List<Data> allData = myDao.GetAll();
// modify, query etc using Dao, lazy evaluation and caching is done by the ORM for performance
// but more importantly, this design keeps your code clean, readable and maintainable.
EDIT:
One question you should ask your co-worker: what happens if you have many Data (rows in database), or when a property is a result of a joined query (foreign key table). Have a look at Fluent NHibernate if you want a smooth transition from one situation (unmaintainable) to another (maintainable) that's easy enough to understand by anybody.

If I were you I would write a serialize / deserialize function, then provide properties as lightweight wrappers around the in-memory results.
Take a look at the ISerialization interface: http://msdn.microsoft.com/en-us/library/system.runtime.serialization.iserializable.aspx

This would be very hard to work with,
If you set the Prop1, and then get Prop1, you could end up with different results
eg:
//set Prop1 to "abc"
d.Prop1 = "abc";
//if the data source holds "xyz" for Prop1
string myString = d.Prop1;
//myString will equal "xyz"
reading the code without the comment you would expect mystring to equal "abc" not "xyz", this could be confusing.
This would make working with the properties very difficult and require a save every time you change a property for it to work.

As well as agreeing with what everyone else has said on this example, what happens if there are other fields in the Data class? i.e. Prop2, Prop3 etc, do they all go back to the database, each time they are accessed in order to "return the value stored in the database via a query". 10 properties would equal 10 database hits. Setting 10 properties, 10 writes to the database. That's not going to scale.

In my opinion, that's an awful design. Using a property getter to do some "magic" stuff makes the system awkward to maintain. If I would join your team, how should I know that magic behind those properties?
Create a separate method that is called as it behaves.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# LINQ and calculations involving large datasets - c#

Related

QueryOver – ensure adding a join table only once

How to organizate custom SQl queres collection

Best practice lookup-tables (GetOrCreate)

How to Test Functions w/ Complex Data Interactions

Is it ok to use C# Property like this

Categories

Resources