Best Design Pattern for Large Data processing methods

Best Design Pattern for Large Data processing methods - c#

I have an application that I am refactoring and trying to Follow some of the "Clean Code" principles. I have an application that reads data from multiple different data sources and manipulates/formats that data and inserts it into another database. I have a data layer with the associated DTO's, repositories, interfaces , and helpers for each data source as well as a business layer with the matching entities, repositories and interfaces.
My question comes down to the Import Method. I basically have one method that systematically calls each Business logic method to read, process and save the data. There are a lot of calls that need to be made and even though the Import method itself is not manipulating the data at all, the method is still extremely large. Is there a better way to process this data?
ICustomer<Customer> sourceCustomerList = new CustomerRepository();
foreach (Customer customer in sourceCustomerList.GetAllCustomers())
{
// Read Some Data
DataObject object1 = iSourceDataType1.GetDataByCustomerID(customer.ID)
// Format and save the Data
iTargetDataType1.InsertDataType1(object1)
// Read Some Data
// Format the Data
// Save the Data
//...Rinse and repeat
}

You should look into Task Parallel Library (TPL) and Dataflow
ICustomer<Customer> sourceCustomerList = new CustomerRepository();
var customersBuffer = new BufferBlock<Customer>();
var transformBlock = new TransformBlock<Customer, DataObject>(
customer => iSourceDataType1.GetDataByCustomerID(customer.ID)
);
// Build your block with TransformBlock, ActionBlock, many more...
customersBuffer.LinkTo(transformBlock);
// Add all the blocks you need here....
// Then feed the first block or use a custom source
foreach (var c in sourceCustomerList.GetAllCustomers())
customersBuffer.Post(c)
customersBuffer.Complete();

Your performance will be IO-bound, especially with the many accesses to the database(s) in each iteration. Therefore, you need to revise your architecture to minimise IO.
Is it possible to move all the records closer together (maybe in a temporary database) as a first pass, then do the record matching and formatting within the database as a second pass, before reading them out and saving them where they need to be?
(As a side note, sometimes we get carried away with DDD and OO, where everything "needs" to be an object. But that is not always the best approach.)

Related

Make sure that call api success and save myself success

Hello I found problem when I use ASP.NET MVC with EF and call Web API from other website(that have also use Entity Framework)
the problem is that
I want to make sure that both MVC SaveChanges() and Web API SaveChanges() succeed both together.
Here's my dream pseudo code
public ActionResult Operation()
{
Code Insert Update Delete....
bool testMvcSaveSuccess = db.TempSaveChanges(); //it does not have this command.
if(testMvcSaveSuccess == true)
{
bool isApiSuccess = CallApi(); //insert data to Other Web App
if(isApiSuccess == true)
{
db.SaveChanges(); //Real Save
}
}
}
From above code, if it doesn't have db.TempSaveChanges(), maybe Web API will be successful, but MVC SaveChanges() might fail.

So there is nothing like TempSaveChanges because there is something even better: Transactions.
Transaction is an IDisposable (can be used in a using block) and has methods like Commit and Rollback.
Small example:
private void TestTransaction()
{
var context = new MyContext(connectionString);
using (var transaction = context.Database.BeginTransaction())
{
// do CRUD stuff here
// here is your 'TempSaveChanges' execution
int changesCount = context.SaveChanges();
if (changesCount > 0)
// changes were made
{
// this will do the real db changes
transaction.Commit();
}
else
{
// no changes detected -> so do nothing
// could use 'transaction.Rollback();' since there are no changes, this should not be necessary
// using block will dispose transaction and with it all changes as well
}
}
}
I have extracted this example from my GitHub Exercise.EntityFramework repository. Feel free to Star/Clone/Fork...

Yes you can.
you need to overload the .Savechanges in the context class where it will be called first checked and then call the regular after.
Or create you own TempSaveChanges() in the context class call it then if successful call SaveChanges from it.

What you are referring to is known as atomicity: you want several operations to either all succeed, or none of them. In the context of a database you obtain this via transactions (if the database supports it). In your case however, you need a transaction which spans across two disjoint systems. A general-purpose (some special cases have simpler solutions) robust implementation of such a transaction would have certain requirements on the two systems, and also require additional persistence.
Basically, you need to be able to gracefully recover from a sudden stop at any point during the sequence. Each of the databases you are using are most likely ACID compliant, so you can count on each DB transaction to fulfill the atomicity requirement (they either succeed or fail). Therefore, all you need to worry about is the sequence of the two DB transactions. Your requirement on the two systems is a way to determine a posteriori whether or not some operation was performed.
Example process flow:
Operation begins
Generate unique transaction ID and persist (with request data)
Make changes to local DB and commit
Call external Web API
Flag transaction as completed (or delete it)
Operation ends
Recovery:
Get all pending (not completed) transactions from store
Check if expected change to local DB was made
Ask Web API if expected change was made
If none of the changes were made or both of the changes were made then the transaction is done: delete/flag it.
If one of the changes was made but not the other, then either revert the change that was made (revert transaction), or perform the change that was not (resume transaction) => then delete/flag it.
Now, as you can see it quickly gets complicated, specially if "determining if changes were made" is a non-trivial operation. What is a common solution to this is to use that unique transaction ID as a means of determining which data needs attention. But at this point it gets very application-specific and depends entirely on what the specific operations are. For certain applications, you can just re-run the entire operation (since you have the entire request data stored in the transaction) in the recovery step. Some special cases do not need to persist the transaction since there are other ways of achieving the same things etc.

ok so let's clarify things a bit.
you have an MVC app A1, with its own database D1
you then have an API, let's call it A2 with its own database D2.
you want some code in A1 which does a temp save in D1, then fires a call to A2 and if the response is successful then it saves the temp data from D1 in the right place this time.
based on your pseudo code, I would suggest you create a second table where you save your "temporary" data in D1. So your database has an extra table and the flow is like this:
first you save your A1 data in that table, you then call A2, data gets saved in D2, A1 receives the confirmation and calls a method which moves the data from the second table to where it should be.
Scenarios to consider:
Saving the temp data in D1 works, but the call to A2 fails. you now clear the orphan data with a batch job or simply call something that deletes it when the call to A2 fails.
The call to A2 succeeds and the call to D1 fails, so now you have temp data in D1 which has failed to move to the right table. You could add a flag to the second table against each row, which indicates that the second call to A2 succeeded so this data needs to move in the right place, when possible. You can have a service here which runs periodically and if it finds any data with the flag set to true then it moves the data to the right place.
There are other ways to deal with scenarios like this. You could use a queue system to manage this. Each row of data becomes a message, you assign it a unique id, a GUID, that is basically a CorrelationID and it's the same in both systems. Even if one system goes down, when it comes back up the data will be saved and all is good in the world and because of the common id you can always link it up properly.

Querying the write model for duplicated aggregate root property

I'm implementing CQRS pattern with Event sourcing, I'm using NServiceBus, NEventStore and NES(Connects between NSB and NEventStore).
My application will check a web service regularly for any file to be downloaded and processed. when the a file is found, a command (DownloadFile) is sent to the bus, and received by FileCommandHandler which creates a new aggregate root (File) and handle the message.
Now inside the (File aggregate root) I have to check that the content of the file doesn't match with any other file content (Since the web service guarantee that only file name is unique, and the content may be duplicated with different name), by hashing it and comparing with the list of hashed contents.
The question is where I have to save the list of hash codes? is it allowed to query the read model?
public class File : AggregateBase
{
public File(DownloadFile cmd, IFileService fileDownloadService, IClaimSerializerService serializerService, IBus bus)
: this()
{
// code to download the file content, deserialize it, and publish an event.
}
}
public class FileCommandHandler : IHandleMessages<DownloadFile>, IHandleMessages<ExtractFile>
{
public void Handle(DownloadFile command)
{
//for example, is it possible to do this (honestly, I feel it is not, since read model should always considered stale !)
var file = readModelContext.GetFileByHashCode (Hash(command.FileContent));
if (file != null)
throw new Exception ("File content matched with another already downloaded file");
// Since there is no way to query the event source for file content like:
// eventSourceRepository.Find<File>(c=>c.HashCode == Hash(command.FileContent));
}
}

Seems like you're looking for deduplication.
Your command side is where you want things to be consistent. Queries will always leave you open to race conditions. So, instead of running a query, I'd reverse the logic and actually write the hash into a database table (any db with ACID guarantees). If this write is successful, process the file. If the write of the hash fails, skip processing.
There's no point putting this logic into a handler, because retrying the message in case of failure (ie storing the hash multiple times) will not make it succeed. You'd also end up with messages for duplicate files in the error q.
A good place for the deduplication logic is likely inside your web service client. Some pseudo logic
Get file
Open transaction
Insert hash into database & catch failure (not any failure, only failure to insert)
Bus.Send message to process file if # of records inserted in step 3 is not zero
commit transaction
Some example deduplication code in NServiceBus gateway here
Edit:
Looking at their code, I actually think the session.Get<DeduplicationMessage> is unnecessary. session.Save(gatewayMessage); should be enough and is the consistency boundary.
Doing a query would make sense only if the rate of failure is high, meaning you have a lot of duplicate content files. If 99%+ of inserts succeed, the duplicates can indeed be treated as exceptions.

This depends on a lot of things ... throughput being one of them. But since you're approaching this problem in a "pull based" fashion anyway (you're querying a webservice to poll for work (downloading and analysing a file)), you could make this whole process serial without having to worry about collisions. Now that might not give the desired rate at which you want to be handling "the work", but more importantly ... have you measured? Let's sidestep that for a minute and assume that serial isn't going to work. How many files are we talking about? A few 100, 1000, ... millions? Depending on that hashes might fit into memory and could be rebuilt if/when the process should come down. There might also be an opportunity to partition your problem along the axis of time or context. Every file since the beginning of dawn or just today, or maybe this month's worth of files? Really, I think you should dig deeper in your problem space. Apart from that, this feels like an awkward problem to solve using event sourcing, but YMMV.

When you have a true uniqueness-constraint in your domain, you can make the uniqueness-tester a domain service, whose implementation is part of the infrastructure -- similar to a repository, whose interface is part of the domain and whose implementation is part of the infrastructure. For the implementation, you can then use an in-memory hash or a database that is updated/queried as needed.

How to break down large 'macro' classes?

One application I work on does only one thing, looking from outside world. Takes a file as input and after ~5 minutes spits out another file.
What happens inside is actually a sequential series of action. The application is, in our opinion, structured well because each action is like a small box, without too many dependencies.
Usually some later actions use some information from previous one and just a few can be executed in parallel - for the sake of simplicity we prefer to the execution sequential.
Now the problem is that the function that executes all this actions is like a batch file: a long list of calls to different functions with different arguments. So, looking in the code it looks like:
main
{
try
{
result1 = Action1(inputFile);
result2 = Action2(inputFile);
result3 = Action3(result2.value);
result4 = Action4(result1.value, inputFile);
... //You get the idea. There is no pattern passed paramteres
resultN = ActionN(parameters);
write output
}
catch
{
something went wrong, display the error
}
}
How would you model the main function of this application so is not just a long list of commands?

Not everything needs to fit to a clever pattern. There are few more elegant ways to express a long series of imperative statements than as, well, a long series of imperative statements.
If there are certain kinds of flexibility you feel you are currently lacking, express them, and we can try to propose solutions.
If there are certain clusters of actions and results that are re-used often, you could pull them out into new functions and build "aggregate" actions from them.
You could look in to dataflow languages and libraries, but I expect the gain to be small.

Not sure if it's the best approach, but you could have an object that would store all the results and you would give it to each method in turn. Every method would read the parameters it needs and write its result there. You could then have a collection of actions (either as delegates or objects implementing an interface) and call them in a loop.
class Results
{
public int Result1 { get; set; }
public string Result2 { get; set; }
…
}
var actions = new Action<Results>[] { Action1, Action2, … };
Results results = new Results();
foreach (var action in actions)
action(results);

You can think of implementing a Sequential Workflow from Windows Workflow

First of all, this solution is far not bad. If the actions are disjunct, I mean there are no global parameters or other hidden dependencies between different actions or between actions and the environment, it's a good solution. Easy to maintain or read, and when you need to expand the functionality, you have just to add new actions, when the "quantity" changes, you have just to add or remove lines from the macro sequence. If there's no need for change frequently the process chain: don't move!
If it's a system, where the implementation of actions don't often changes, but their order and parameters yes, you may design a simple script language, and transform the macro class into that script. This script should be maintained by someone else than you, someone who is familiar with the problem domain in the level of your "actions". So, he/she can assembly the application using script language without your assistance.
One nice approach for that kind of problem splitting is dataflow programming (a.k.a. Flow-based programming). In dataflow programming, there are pre-written components. Components are black boxes (from the view of the application developer), they have consumer (input) and producer (output) ports, which can be connected to form a processing network, which is then the application. If there're a good set of components for a domain, many applications can created without programming new components. Also, components can be built of other components (they called composite components).
Wikipedia (good starting point):
http://en.wikipedia.org/wiki/Dataflow_programming
http://en.wikipedia.org/wiki/Flow-based_programming
JPM's site (book, wiki, everything):
http://jpaulmorrison.com/fbp/
I think, bigger systems must have that split point you describe as "macro". Even games have that point, e.g. FPS games have a 3D engine and a game logic script, or there's SCUMM VM, which is the same.

How to avoid geometric slowdown with large Linq transactions?

I've written some really nice, funky libraries for use in LinqToSql. (Some day when I have time to think about it I might make it open source... :) )
Anyway, I'm not sure if this is related to my libraries or not, but I've discovered that when I have a large number of changed objects in one transaction, and then call DataContext.GetChangeSet(), things start getting reaalllly slooowwwww. When I break into the code, I find that my program is spinning its wheels doing an awful lot of Equals() comparisons between the objects in the change set. I can't guarantee this is true, but I suspect that if there are n objects in the change set, then the call to GetChangeSet() is causing every object to be compared to every other object for equivalence, i.e. at best (n^2-n)/2 calls to Equals()...
Yes, of course I could commit each object separately, but that kinda defeats the purpose of transactions. And in the program I'm writing, I could have a batch job containing 100,000 separate items, that all need to be committed together. Around 5 billion comparisons there.
So the question is: (1) is my assessment of the situation correct? Do you get this behavior in pure, textbook LinqToSql, or is this something my libraries are doing? And (2) is there a standard/reasonable workaround so that I can create my batch without making the program geometrically slower with every extra object in the change set?

In the end I decided to rewrite the batches so that each individual item is saved independently, all within one big transaction. In other words, instead of:
var b = new Batch { ... };
while (addNewItems) {
...
var i = new BatchItem { ... };
b.BatchItems.Add(i);
}
b.Insert(); // that's a function in my library that calls SubmitChanges()
.. you have to do something like this:
context.BeginTransaction(); // another one of my library functions
try {
var b = new Batch { ... };
b.Insert(); // save the batch record immediately
while (addNewItems) {
...
var i = new BatchItem { ... };
b.BatchItems.Add(i);
i.Insert(); // send the SQL on each iteration
}
context.CommitTransaction(); // and only commit the transaction when everything is done.
} catch {
context.RollbackTransaction();
throw;
}
You can see why the first code block is just cleaner and more natural to use, and it's a pity I got forced into using the second structure...

Transactions across several DAL methods from the one method in the BLL

How would you go about calling several methods in the data access layer from one method in the business logic layer so that all of the SQL commands lived in one SQL transaction?
Each one of the DAL methods may be called individually from other places in the BLL, so there is no guarantee that the data layer methods are always part of a transaction. We need this functionality so if the database goes offline in the middle of a long running process, there's no commit. The business layer is orchestrating different data layer method calls based on the results of each of the previous calls. We only want to commit (from the business layer) at the very end of the entire process.

well, firstly, you'll have to adhere to an atomic Unit of Work that you specify as a single method in your BLL. This would (for example) create the customer, the order and the order items. you'd then wrap this all neatly up inside a TransactionScope using statement. TransactionScope is the secret weapon here. below is some code that luckily enough I'm working on right now :):
public static int InsertArtist(Artist artist)
{
if (artist == null)
throw new ArgumentNullException("artist");
int artistid = 0;
using (TransactionScope scope = new TransactionScope())
{
// insert the master Artist
/*
we plug the artistid variable into
any child instance where ArtistID is required
*/
artistid = SiteProvider.Artist.InsertArtist(new ArtistDetails(
0,
artist.BandName,
artist.DateAdded));
// insert the child ArtistArtistGenre
artist.ArtistArtistGenres.ForEach(item =>
{
var artistartistgenre = new ArtistArtistGenreDetails(
0,
artistid,
item.ArtistGenreID);
SiteProvider.Artist.InsertArtistArtistGenre(artistartistgenre);
});
// insert the child ArtistLink
artist.ArtistLinks.ForEach(item =>
{
var artistlink = new ArtistLinkDetails(
0,
artistid,
item.LinkURL);
SiteProvider.Artist.InsertArtistLink(artistlink);
});
// insert the child ArtistProfile
artist.ArtistProfiles.ForEach(item =>
{
var artistprofile = new ArtistProfileDetails(
0,
artistid,
item.Profile);
SiteProvider.Artist.InsertArtistProfile(artistprofile);
});
// insert the child FestivalArtist
artist.FestivalArtists.ForEach(item =>
{
var festivalartist = new FestivalArtistDetails(
0,
item.FestivalID,
artistid,
item.AvailableFromDate,
item.AvailableToDate,
item.DateAdded);
SiteProvider.Festival.InsertFestivalArtist(festivalartist);
});
BizObject.PurgeCacheItems(String.Format(ARTISTARTISTGENRE_ALL_KEY, String.Empty, String.Empty));
BizObject.PurgeCacheItems(String.Format(ARTISTLINK_ALL_KEY, String.Empty, String.Empty));
BizObject.PurgeCacheItems(String.Format(ARTISTPROFILE_ALL_KEY, String.Empty, String.Empty));
BizObject.PurgeCacheItems(String.Format(FESTIVALARTIST_ALL_KEY, String.Empty, String.Empty));
BizObject.PurgeCacheItems(String.Format(ARTIST_ALL_KEY, String.Empty, String.Empty));
// commit the entire transaction - all or nothing
scope.Complete();
}
return artistid;
}
hopefully, you'll get the gist. basically, it's an all succeed or fail job, irrespective of any disparate databases (i.e. in the above example, artist and artistartistgenre could be hosted in two separate db stores but TransactionScope would care less about that, it works at COM+ level and manages the atomicity of the scope that it can 'see')
hope this helps
EDIT: you'll possibly find that the initial invocation of TransactionScope (on app start-up) may be slightly noticeable (i.e. in the example above, if called for the first time, can take 2-3 seconds to complete), however, subsequent calls are almost instantaneous (i.e. typically 250-750 ms). the trade off between a simple point of contact transaction vs the (unwieldy) alternatives mitigates (for me and my clients) that initial 'loading' latency.
just wanted to demonstrate that ease doesn't come without compromise (albeit in the initial stages)

What you describe is the very 'definition' of a long transaction.
Each DAL method could simply provide operations (without any specific commits). Your BLL (which is in effect where you are coordinating any calls to the DAL anyway) is where you can choose to either commit, or execute a 'savepoint'. A savepoint is an optional item which you can employ to allow 'rollbacks' within a long running transaction.
So for example, if my DAL has methods DAL1, DAL2, DAL3 are all mutative they would simply 'execute' data change operations (i.e. some type of Create, Update, Delete). From my BLL, lets assume I have BL1, and BL2 methods (BL1 is long running). BL1 invokes all the aforementoned DAL methods (i.e. DAL1...DAL3), while BL2, only invokes DAL3.
Therefore, on execution of each business logic method you might have the following:
BL1 (long-transaction) -> {savepoint} DAL1 -> {savepoint} DAL2 -> DAL3 {commit/end}
BL2 -> DAL3 {commit/end}
The idea behind the 'savepoint' is it can allow BL1 to rollback at any point if there are issues in the data operations. The long transaction is ONLY commited if all three operations successfully complete. BL2 can still call any method in the DAL, and it is responsible for controlling commits. NOTE: you could use 'savepoints' in short/regular transactions as well.

Good question. This gets to the heart of the impedance mismatch.
This is one of the strongest arguments for using stored procedures. Reason: they are designed to encapsulate multiple SQL statements in a transaction.
The same can be done procedurally in the DAL, but it results in code with less clarity, while usually resulting in moving the coupling/cohesion balance in the wrong direction.
For this reason, I implement the DAL at a higher level of abstraction than simply encapsulating tables.

just in case my comment in the original article didn't 'stick', here's what i'd added as additional info:
<-----
coincidently, just noticed another similar reference to this posted a few hours after your request. uses a similar strategy and might be worth you looking at as well:
http://stackoverflow.com/questions/494550/how-does-transactionscope-roll-back-transactions
----->

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.