Querying the write model for duplicated aggregate root property - c#

I'm implementing CQRS pattern with Event sourcing, I'm using NServiceBus, NEventStore and NES(Connects between NSB and NEventStore).
My application will check a web service regularly for any file to be downloaded and processed. when the a file is found, a command (DownloadFile) is sent to the bus, and received by FileCommandHandler which creates a new aggregate root (File) and handle the message.
Now inside the (File aggregate root) I have to check that the content of the file doesn't match with any other file content (Since the web service guarantee that only file name is unique, and the content may be duplicated with different name), by hashing it and comparing with the list of hashed contents.
The question is where I have to save the list of hash codes? is it allowed to query the read model?
public class File : AggregateBase
{
public File(DownloadFile cmd, IFileService fileDownloadService, IClaimSerializerService serializerService, IBus bus)
: this()
{
// code to download the file content, deserialize it, and publish an event.
}
}
public class FileCommandHandler : IHandleMessages<DownloadFile>, IHandleMessages<ExtractFile>
{
public void Handle(DownloadFile command)
{
//for example, is it possible to do this (honestly, I feel it is not, since read model should always considered stale !)
var file = readModelContext.GetFileByHashCode (Hash(command.FileContent));
if (file != null)
throw new Exception ("File content matched with another already downloaded file");
// Since there is no way to query the event source for file content like:
// eventSourceRepository.Find<File>(c=>c.HashCode == Hash(command.FileContent));
}
}

Seems like you're looking for deduplication.
Your command side is where you want things to be consistent. Queries will always leave you open to race conditions. So, instead of running a query, I'd reverse the logic and actually write the hash into a database table (any db with ACID guarantees). If this write is successful, process the file. If the write of the hash fails, skip processing.
There's no point putting this logic into a handler, because retrying the message in case of failure (ie storing the hash multiple times) will not make it succeed. You'd also end up with messages for duplicate files in the error q.
A good place for the deduplication logic is likely inside your web service client. Some pseudo logic
Get file
Open transaction
Insert hash into database & catch failure (not any failure, only failure to insert)
Bus.Send message to process file if # of records inserted in step 3 is not zero
commit transaction
Some example deduplication code in NServiceBus gateway here
Edit:
Looking at their code, I actually think the session.Get<DeduplicationMessage> is unnecessary. session.Save(gatewayMessage); should be enough and is the consistency boundary.
Doing a query would make sense only if the rate of failure is high, meaning you have a lot of duplicate content files. If 99%+ of inserts succeed, the duplicates can indeed be treated as exceptions.

This depends on a lot of things ... throughput being one of them. But since you're approaching this problem in a "pull based" fashion anyway (you're querying a webservice to poll for work (downloading and analysing a file)), you could make this whole process serial without having to worry about collisions. Now that might not give the desired rate at which you want to be handling "the work", but more importantly ... have you measured? Let's sidestep that for a minute and assume that serial isn't going to work. How many files are we talking about? A few 100, 1000, ... millions? Depending on that hashes might fit into memory and could be rebuilt if/when the process should come down. There might also be an opportunity to partition your problem along the axis of time or context. Every file since the beginning of dawn or just today, or maybe this month's worth of files? Really, I think you should dig deeper in your problem space. Apart from that, this feels like an awkward problem to solve using event sourcing, but YMMV.

When you have a true uniqueness-constraint in your domain, you can make the uniqueness-tester a domain service, whose implementation is part of the infrastructure -- similar to a repository, whose interface is part of the domain and whose implementation is part of the infrastructure. For the implementation, you can then use an in-memory hash or a database that is updated/queried as needed.

Related

Make sure that call api success and save myself success

Hello I found problem when I use ASP.NET MVC with EF and call Web API from other website(that have also use Entity Framework)
the problem is that
I want to make sure that both MVC SaveChanges() and Web API SaveChanges() succeed both together.
Here's my dream pseudo code
public ActionResult Operation()
{
Code Insert Update Delete....
bool testMvcSaveSuccess = db.TempSaveChanges(); //it does not have this command.
if(testMvcSaveSuccess == true)
{
bool isApiSuccess = CallApi(); //insert data to Other Web App
if(isApiSuccess == true)
{
db.SaveChanges(); //Real Save
}
}
}
From above code, if it doesn't have db.TempSaveChanges(), maybe Web API will be successful, but MVC SaveChanges() might fail.
So there is nothing like TempSaveChanges because there is something even better: Transactions.
Transaction is an IDisposable (can be used in a using block) and has methods like Commit and Rollback.
Small example:
private void TestTransaction()
{
var context = new MyContext(connectionString);
using (var transaction = context.Database.BeginTransaction())
{
// do CRUD stuff here
// here is your 'TempSaveChanges' execution
int changesCount = context.SaveChanges();
if (changesCount > 0)
// changes were made
{
// this will do the real db changes
transaction.Commit();
}
else
{
// no changes detected -> so do nothing
// could use 'transaction.Rollback();' since there are no changes, this should not be necessary
// using block will dispose transaction and with it all changes as well
}
}
}
I have extracted this example from my GitHub Exercise.EntityFramework repository. Feel free to Star/Clone/Fork...
Yes you can.
you need to overload the .Savechanges in the context class where it will be called first checked and then call the regular after.
Or create you own TempSaveChanges() in the context class call it then if successful call SaveChanges from it.
What you are referring to is known as atomicity: you want several operations to either all succeed, or none of them. In the context of a database you obtain this via transactions (if the database supports it). In your case however, you need a transaction which spans across two disjoint systems. A general-purpose (some special cases have simpler solutions) robust implementation of such a transaction would have certain requirements on the two systems, and also require additional persistence.
Basically, you need to be able to gracefully recover from a sudden stop at any point during the sequence. Each of the databases you are using are most likely ACID compliant, so you can count on each DB transaction to fulfill the atomicity requirement (they either succeed or fail). Therefore, all you need to worry about is the sequence of the two DB transactions. Your requirement on the two systems is a way to determine a posteriori whether or not some operation was performed.
Example process flow:
Operation begins
Generate unique transaction ID and persist (with request data)
Make changes to local DB and commit
Call external Web API
Flag transaction as completed (or delete it)
Operation ends
Recovery:
Get all pending (not completed) transactions from store
Check if expected change to local DB was made
Ask Web API if expected change was made
If none of the changes were made or both of the changes were made then the transaction is done: delete/flag it.
If one of the changes was made but not the other, then either revert the change that was made (revert transaction), or perform the change that was not (resume transaction) => then delete/flag it.
Now, as you can see it quickly gets complicated, specially if "determining if changes were made" is a non-trivial operation. What is a common solution to this is to use that unique transaction ID as a means of determining which data needs attention. But at this point it gets very application-specific and depends entirely on what the specific operations are. For certain applications, you can just re-run the entire operation (since you have the entire request data stored in the transaction) in the recovery step. Some special cases do not need to persist the transaction since there are other ways of achieving the same things etc.
ok so let's clarify things a bit.
you have an MVC app A1, with its own database D1
you then have an API, let's call it A2 with its own database D2.
you want some code in A1 which does a temp save in D1, then fires a call to A2 and if the response is successful then it saves the temp data from D1 in the right place this time.
based on your pseudo code, I would suggest you create a second table where you save your "temporary" data in D1. So your database has an extra table and the flow is like this:
first you save your A1 data in that table, you then call A2, data gets saved in D2, A1 receives the confirmation and calls a method which moves the data from the second table to where it should be.
Scenarios to consider:
Saving the temp data in D1 works, but the call to A2 fails. you now clear the orphan data with a batch job or simply call something that deletes it when the call to A2 fails.
The call to A2 succeeds and the call to D1 fails, so now you have temp data in D1 which has failed to move to the right table. You could add a flag to the second table against each row, which indicates that the second call to A2 succeeded so this data needs to move in the right place, when possible. You can have a service here which runs periodically and if it finds any data with the flag set to true then it moves the data to the right place.
There are other ways to deal with scenarios like this. You could use a queue system to manage this. Each row of data becomes a message, you assign it a unique id, a GUID, that is basically a CorrelationID and it's the same in both systems. Even if one system goes down, when it comes back up the data will be saved and all is good in the world and because of the common id you can always link it up properly.

What Happens if a Web Api Controller is hit in rapid succession?

I have a Web Api 2 (using C#) controller method running an asynchronous save (RESTful POST method). Our QA tester mashed the save button (client-side issue that was fixed, irrelevant to Q). Without a duplicate check, naturally this was saving duplicate entries to the db.
I implemented a duplicate check (duplicateCount just selects the number of items that are exactly like the item passed to post, should be 0, and it works, details irrelevant):
var duplicateCount = (fooCollection.CountAsync(aggregateFilter)).Result;
if (duplicateCount > 0) { return BadRequest(); }
This check works...except on the first button mash - two duplicate entries get saved, each from an individual controller hit.
So, it seems to me that the second controller hit happens before the first controller hit manages to save the item to the db, so the duplicate check passes. Is this possible?
I am more interested in the theory than the particular answer. Also, I know I can check for duplicates in the db as well, this is more of a conceptual question. The MongoDb part is really there just for completeness, I imagine it would be similar if I was doing an async save to SQL.
edit: Someone asked how I am making the call in comments. It's through RestAngular, but imo it's irrelevant because I know the controller is getting hit as many times as I hit the button. I also know for a fact that it does not create duplicates on a single hit.
Quick answer is "yes" - with the ability of a controller to be instantiated on any number of threads simultaneously it acts as though its multi-threaded. Your code is not "thread safe" in that its business operation requires an exclusive lock to be placed on some element of shared information (in this case the state of the database).
You could (I would not recommend) open a Mutex or a database transaction to force single thread behaviour, but your throughput would tank.
I personally don't face this very often because of my (possibly bad) insistence on all my entities having Guid primary keys and use of the SQL Merge command to either insert or update. This may be a helpful pattern (it doesn't matter how many times you send the same "message" to the controller - it will never save a duplicate).

Does the need to make the code simpler justify the use of wrong abstractions?

Suppose we have a CommandRunner class that runs Commands, when a Command is created it's kept in the processingQueue for proccessing, if the execution of the Command finishes with errors the Command is moved to the faultedQueue for later processing but when everything is OK the Command is moved to the archiveQueue, the archiveQueue is not going to be processed in any way
the CommandRunner is something like this
class CommandRunner
{
public CommandRunner(IQueue<Command> processingQueue,
IQueue<Command> faultedQueue,
IQueue<Command> archiveQueue)
{
this.processingQueue = processingQueue;
this.faultedQueue= faultedQueue;
this.archiveQueue= archiveQueue;
}
public void RunCommands()
{
while(processingQueue.HasItems)
{
var current = processingQueue.Dequeue();
var result = current.Run();
if(result.HasError)
curent.MoveTo(faultedQueue);
else
curent.MoveTo(archiveQueue);
...
}
}
}
The CommandeRunner recives the three dependecies as a PersistentQueue the PersistentQueue is responsible for the long term storage of the Commands and so we free the CommandRunner from handling this
And the only purpose of the archiveQueue is to keep the design homogenous, to keep the CommandRunner persistence ignorant and with few dependencies
for example we can imagine a Property like this
IEnumerable<Command> AllCommands
{
get
{
return Enumerate(archiveQueue).Union(processingQueue).Union(faultedQueue);
}
}
many portions of the class need to do so(handle the Archive as a Queue to make the code simpler as shown above)
Does it make sense to use a Queue even if it's not the best abstraction, or do I have to use another abstraction for the archive concept.
what are other alternatives to meet these requirement?
Keep in mind that code, especially running code usually gets tangled and messy as time pass. To combat this, good names, good design, and meaningful comments come into play.
If you don't going to process the archiveQueue, and it's just a storage for messages that has been successfully processed, you can always store it as a different type (list, collection, set, whatever suits your needs), and then choose one of the following two:
Keep the name archiveQueue and change the underlying type. I would leave a comment where it's defined (or injected) saying : Notice that this might not be an actual queue. Name is for consistency reasons only.
Change the name to archiveRepository or something similar, while keeping the queue type. Obviously, since it's still a queue, you'll leave a comment saying: Notice, this is actually a queue.
Another thing to keep in mind, is that if you have n people working on your code base, you'll probably get n+1 different perferences about which way it shoud be done :)
Queue is a useful structure when you need to take care about the order of items inside it. If you need in your command post process, take care about the orders commands ran, then the queue can be a good choice.
If you don't need info about the order or commands, maybe you can use a List (on System.Collections namespace).
I think your choice are good, in the same case, I'll use a queues, we have a good example with OS design principles, inside OS (on Kernel) the process are queued for execution, clearly the OS queues are more complicated because they have other variables in mind like priority, and CPU utilization, but we can learn about the use of queues like data structures in process management.

Prevent the DbContext from repeatedly trying to save bad data

I have a process that is importing an Excel Spreadhseet, and parsing the data into my data objects. The source of this data is very questionable, as we're moving our customer from a spreadsheet-based data management into a managed database system with checks for valid data.
During my import process, I do some basic sanity checks of the data to accommodate just how bad the data could be that we're importing, but I have my overall validation being done in the DbContext.
Part of what I'm trying to do is that I want to provide the Row # in the spreadsheet that the data is bad so they can easily determine what they need to fix to get the file to import.
Once I have the data from the spreadsheet (model), and the Opportunity they're working with from the database (opp), here's the pseudocode of my process:
foreach (var model in Spreadsheet.Rows) { // Again, pseudocode
if(opp != null && ValidateModel(model, opp, row)) {
// Copy properties to the database object
// This is in a Repository-layer method, not directly in my import process.
// Just written here for clarity instead of several nested method calls.
context.SaveChanges();
}
}
I can provide more of the code here if needed, but the problem comes in my DbContext's ValidateEntity() method (override of DbContext).
Again, there's nothing wrong with the code that I've written, as far as I'm aware, but if an Opportunity that failed this level of validation, then it stays as part of the unsaved objects in the context, meaning that it repeatedly tries to get validated every time the ValidateEntity() is called. This leads to a repeat of the same Validation Error message for every row after the initial problem occurs.
Is there a way to [edit]get the Context to stop trying to validate the object after it fails validation once[edit]? I know I could wait until the end and call context.SaveChanges() once at the end to get around this, but I would like to be able to match this with what row it is in the Database.
For reference, I am using Entity Framework 6.1 with a Code First approach.
EDIT Attempting to clarify further for Marc L. (including an update to the code block above)
Right now, my process will iterate through as many rows as there are in the Spreadsheet. The reason why I'm calling my Repository layer with each object to save, instead of working with an approach that only calls context.SaveChanges() once is to allow myself the ability to determine which row is the one that is causing a validation error.
I'm glad that my DbContext's custom ValidateEntity() methods are catching the validation errors, but the problem resides in the fact that it is not throwing the DbEntityValidationException for the same entity multiple times.
I would like it so that if the object fails validation once, the context no longer tries to save the object, regardless of how many times context.SaveChanges() is called.
Your question is not a dupe (this is about saving, not loaded entities) but you could follow Jimmy's advice above. That is, once an entity is added to the context it is tracked in the "added" state and the only way to stop it from re-validating is by detaching it. It's an SO-internal link, but I'll reproduce the code snippet:
dbContext.Entry(entity).State = EntityState.Detached;
However, I don't think that's the way you want to go, because you're using exceptions to manage state unnecessarily (exceptions are notoriously expensive).
Working from the information given, I'd use a more set-based solution:
modify your model class so that it contains a RowID that records the original spreadsheet row (there's probably other good reasons to have this, too)
turn off entity-tracking for the context (turns of change detection allowing each Add() to be O(1))
add all the entities
call context.GetValidationErrors() and get all your errors at once, using the aforementioned RowID to identify the invalid rows.
You didn't indicate whether your process should save the good rows or reject the file as a whole, but this will accommodate either--that is, if you need to save the good rows, detach all the invalid rows using the code above and then SaveChanges().
Finally, if you do want to save the good rows and you're uncomfortable with the set-based method, it would be better to use a new DbContext for every single row, or at least create a new DbContext after each error. The ADO.NET team insists that context-creation is "relatively cheap" (sorry I don't have a cite or stats at hand for this) so this shouldn't damage your throughput too much. Even so, it will at least remain O(n). I wouldn't blame you, managing a large context can open you up to other issues as well.

Stack Overflow, Redis, and Cache invalidation

Now that Stack Overflow uses redis, do they handle cache invalidation the same way? i.e. a list of identities hashed to a query string + name (I guess the name is some kind of purpose or object type name).
Perhaps they then retrieve individual items that are missing from the cache directly by id (which bypasses a bunch of database indexes and uses the more efficient clustered index instead perhaps). That'd be smart (the rehydration that Jeff mentions?).
Right now, I'm struggling to find a way to pivot all of this in a succinct way. Are there any examples of this kind of thing that I could use to help clarify my thinking prior to doing a first cut myself?
Also, I'm wondering where the cutoff is between using a .net cache (System.Runtime.Caching or System.Web.Caching) and going out and using redis. Or is Redis just hands down faster?
Here's the original SO question from 2009:
https://meta.stackexchange.com/questions/6435/how-does-stackoverflow-handle-cache-invalidation
A couple of other links:
https://meta.stackexchange.com/questions/69164/does-stackoverflow-use-caching-and-if-so-how/69172#69172
https://meta.stackexchange.com/questions/110320/stack-overflow-db-performance-and-redis-cache
I honestly can't decide if this is a SO question or a MSO question, but:
Going off to another system is never faster than querying local memory (as long as it is keyed); simple answer: we use both! So we use:
local memory
else check redis, and update local memory
else fetch from source, and update redis and local memory
This then, as you say, causes an issue of cache invalidation - although actually that isn't critical in most places. But for this - redis events (pub/sub) allow an easy way to broadcast keys that are changing to all nodes, so they can drop their local copy - meaning: next time it is needed we'll pick up the new copy from redis. Hence we broadcast the key-names that are changing against a single event channel name.
Tools: redis on ubuntu server; BookSleeve as a redis wrapper; protobuf-net and GZipStream (enabled / disabled automatically depending on size) for packaging data.
So: the redis pub/sub events are used to invalidate the cache for a given key from one node (the one that knows the state has changed) immediately (pretty much) to all nodes.
Regarding distinct processes (from comments, "do you use any kind of shared memory model for multiple distinct processes feeding off the same data?"): no, we don't do that. Each web-tier box is only really hosting one process (of any given tier), with multi-tenancy within that, so inside the same process we might have 70 sites. For legacy reasons (i.e. "it works and doesn't need fixing") we primarily use the http cache with the site-identity as part of the key.
For the few massively data-intensive parts of the system, we have mechanisms to persist to disk so that the in-memory model can be passed between successive app-domains as the web naturally recycles (or is re-deployed), but that is unrelated to redis.
Here's a related example that shows the broad flavour only of how this might work - spin up a number of instances of the following, and then type some key names in:
static class Program
{
static void Main()
{
const string channelInvalidate = "cache/invalidate";
using(var pub = new RedisConnection("127.0.0.1"))
using(var sub = new RedisSubscriberConnection("127.0.0.1"))
{
pub.Open();
sub.Open();
sub.Subscribe(channelInvalidate, (channel, data) =>
{
string key = Encoding.UTF8.GetString(data);
Console.WriteLine("Invalidated {0}", key);
});
Console.WriteLine(
"Enter a key to invalidate, or an empty line to exit");
string line;
do
{
line = Console.ReadLine();
if(!string.IsNullOrEmpty(line))
{
pub.Publish(channelInvalidate, line);
}
} while (!string.IsNullOrEmpty(line));
}
}
}
What you should see is that when you type a key-name, it is shown immediately in all the running instances, which would then dump their local copy of that key. Obviously in real use the two connections would need to be put somewhere and kept open, so would not be in using statements. We use an almost-a-singleton for this.

Categories

Resources