Avoid two-phase commits in a event sourced application saving BLOB data - blob

Let's assume we have an Aggregate User which has a UserPortraitImage and a Contract as a PDF file. I want to store files in a dedicated document-based store and just hold process-relevant data in the event (with a link to the BLOB data).
But how do I avoid a two-phase commit when I have to store the files and store the new event?
At first I'd store the documents and then the event; if the first transaction fails it doesn't matter, the command failed. If the second transaction fails it also doesn't matter even if we generated some dead files in the store, the command fails; we could even apply a rollback.
But could there be an additional problem?
The next question is how to design the aggregate and the event. If the aggregate only holds a reference to the BLOB storage, what is the process after a SignUp command got called?
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==> Create new User aggregate with the given BLOB storage references and store it?
Is there a better design which unburdens the aggregate of knowing that BLOB data is saved in another store? And who is responsible for storing BLOB data and forwarding the reference to the aggregate?

Sounds like you are working with something analogous to an AtomPub media-entry/media-link-entry pair. The blob is going into your data store, the meta data gets copied into the aggregate history
But how do I avoid a two-phase commit when I have to store the files and store the new event?
In practice, you probably don't.
That is to say, if the blob store and the aggregate store happen to be the same database, then you can update both in the same transaction. That couples the two stores, and adds some pretty strong constraints to your choice of storage, but it is doable.
Another possibility is that you accept that the two changes that you are making are isolated from one another, and therefore that for some period of time the two stores are not consistent with each other.
In this second case, the saga pattern is what you are looking for, and it is exactly what you describe; you pair the first action with a compensating action to take if the second action fails. So "manual" rollback.
Or not - in a sense, the git object database uses a two phase commit; an object gets copied into the object store, and then the trees get updated, and then the commit... garbage collection comes along later to discard the objects that you don't need.
who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Well, ultimately it is an infrastructure concern; does your model actually need to interact with the document, or is it just carrying a claim check that can be redeemed later?

At first I'd store the documents and then the event; if the first
transaction fails it doesn't matter, the command failed. If the second
transaction fails it also doesn't matter even if we generated some
dead files in the store, the command fails; we could even apply a
rollback. But could there be an additional problem?
Not that I can think of, aside from wasted disk space. That's what I typically do when I want to avoid distributed transactions or when they're not available across the two types of data stores. Oftentimes, one of the two operations is less important and you can afford to let it complete even if the master operation fails later.
Cleaning up botched attempts can be done during exception handling, as an out-of-band process or as part of a Saga as #VoiceOfUnreason explained.
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==>
Create new User aggregate with the given BLOB storage references and
store it?
Yes. Usually the Application layer component (Command handler in your case) acts as a coordinator betweeen the different data stores and gets back all it needs to know from one store before talking to the other or to the Domain.

Related

Fix inconsistent state right away or lazily when data is requested

Our users go through several steps of workflow - the further they go the more objects we create. We also allow users to go back to Step#1 and change one of the existing objects. Which may cause inconsistencies so we must update/delete some of the objects at Step#2. I see 2 options:
Update/delete objects from Step#2 right away. This leads to:
Operation that's supposed to be a simple PATCH of an entity field becomes complicated. And it's a shared object between multiple workflows - so we'll have to add if-statements and do different things depending on the workflow.
Circular dependencies. Operations on Step#1 have to know about objects/operations on Step#2.
On each request in Step#1 we'd have to load data for Step#2 in order to determine whether Step#2 really needs to be updated. Which slows down operations on Step#1. So to change 1 record in DB we'll have to load hundreds (or even thousands) records for Step#2.
Many actions on Step#1 may need fixing state at Step#2. So we have to ensure we don't forget anything today and in the future.
Fix Step#2 lazily - when user goes there (our current approach). Step#2 will recognize that objects are inconsistent and fix them. Which leads to just 1 place where we need to care, but:
Until user opens Step#2 - DB will contain inconsistent objects. This hasn't resulted in any problems so far. But I can imagine it may complicate future SQL migrations.
We update DB state on GET request. This one doesn't seem like that big of a deal since GET stays idempotent anyway. But still it feels awkward.
Anyone knows better approaches? Or maybe improvements to these two?
Update
I haven't found perfect solution, but eventually we implemented an improved version of #1. When updating state on Step#1 we also set a flag "need to rebuild Step#2", when UI opens Step#2 it first checks this flag and issues a PUT to rebuild the state, and only then it GETs Step#2.
This still means that DB state is inconsistent for some period of time. But at least we'll know this for sure from the flag in DB. And if needed - we could write migrations taking this flag into account. This also allows (if needed in the future) to create an async job to fix the state.
I think it is more flexible to separate the state and the context where the objects are stored. Any creation of a new object at any step is accompanied by the preservation of the invariant and consistency of context.
There are separate rules of states - these are rules for transition from one to another and available objects for creation and separate rules for the context, rules for its consistency, which is ensured every time it changes.
What about dirty data asynchronous cleanup?
Whenever user goes back to Step #1 and changes something, mark all related data as "dirty" (e.g. add links to it in "DirtyData" table) and be done for now.
Have a DataCleanup worker (e.g. separate thread or smth) that constantly looks for data to be cleaned up.
Before editing data for Step #2, check if the data is not dirty.
Depending on your logic, 3) might result in user error (e.g. user would need to repeat Step #2). If DataCleanup worker has enough resources (i.e. it processes DirtyData table almost instantaneously), that should happen only on very rare occasions. If that is not OK, you could opt for checking for dirty data on each fetch, but that could be expensive.
It sounds like you're familiar with the HTTP spec regarding GET requests, but for future readers:
Why shouldn't a GET request change data on the server?
Why is using a HTTP GET to update state on the server in a RESTful call incorrect?
For the other bullet under 2, we probably don't need a specification to agree that persisting valid data is preferable to persisting invalid data.
So what can we do for the bullets under 1 to avoid complex branching logic in a particular step and also circular dependencies? My suggestion is an event-driven design. When step #2 changes it should fire a change event. In this scenario, step #2 has no knowledge of the concrete listener(s) who may receive its events, so it remains decoupled from any complex handling logic.
There's probably no way to guarantee you don't forget anything in the future; but if every step in the workflow is defined as a listener, it forces you to consider change events to some extent every time you implement a new step.
One side note on granularity: if a step has many changes, it can batch up its events rather than fire each one individually. You can adjust the size for efficiency.
In summary, I would strongly consider the Observer design pattern.

How to add read-after-update consistency to AWS S3

I'm trying to use S3 as a datastore for an application (for reasons that are not relevant to this question, I can't use a proper DBMS).
My issue is that I want read-after-update consistency: if I try to read an object immediately after updating it, I want to be guaranteed to get the updated object.
S3 guarantees read-after-write consistency (i.e if you try to read an object immediately after creating it, you're guaranteed to succeed).
However, it doesn't guarantee read-after-update consistency (i.e if you try to read an object immediately after updating it, you may get a previous version of the object).
Additionally it has some more caveats.
What can I add to my solution so that it would have read-after-update consistency?
I've thought of a couple of options:
A write-through cache: each write updates a cache first (thinking of using mongoDB as I'm most familiar with it, and it has fast reads for recent objects), then to S3. Each read tries the cache first, and falls back to S3 if not found.
However, this solution doubles the chances of failure (i.e if mongoDB is down, or S3 is down, then my whole database is down). Ideally I'd like to have only S3 as a point of failure.
Always write new objects, never update: I thought I could work around this by utilizing the read-after-write guarantee; instead of creating object /my-bucket/obj-id.json, I'll create /my-bucket/obj-id/v1.json. When I update I'll just create V2 etc.
When reading, I'll need to list the keys with the obj-id path, and choose the latest version.
However, one of the above caveats is:
A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
Is there any way to "cheat" S3's eventual consistency? Or Am I stuck with a cache component?
Or maybe there's a third, better, idea?
thanks!

CQRS Command how to store and query entities that are not persisted to data store immediately

In CQRS, we separate Commands and Queries. As I understand it, Commands raise Domain Events that may modify Entity states while Queries return View specific DTO's directly from a data store. According to this article, the UI makes commands through a Command Bus which creates Commands that are handled by their respective CommandHandlers who then orchestrate the Domain Logic to determine the occurrence of Domain Events and persist/publish any state changes to a Repository (optionally using Event Sourcing). After being persisted, state changes are available through Queries.
Now, what if a Command creates an Entity that is not persisted/published immediately? Firstly, where is that not-yet-persisted Entity held? Is it in the Command Bus, the Command Handler, the Repository, or should a new thin application layer hold it? How should a Query gain access to it?
The problem here is that it seems like any Queries for unpersisted Entities differ significantly from those of persisted Entities, unless CQRS demands that ALL Entities be persisted upon creation, which IMO is not necessarily compatible with all Domains.
Specifically, I'm trying to build software to record training information for various Training Sessions. However, I would like it if Training Sessions were persisted manually by a Save Session button as opposed to always upon creation. I don't know where a StartNewTrainingSessionCommand would store the new Training Session so that it can be Queried, if not in the data store.
I think you understood things a bit wrong: A command is sent via a service bus to a command handler which uses the business objects to do the work. Domain events should be generated by the business (domain) objects, but sometimes the command handler does that too.
I don't see a reason for a created entity not to be saved. In your particular case, if the domain allows it, you can have a default, empty TrainingSession saved automatically then updated when the user press the Save button.
If this approach is not feasible, then simply store the input data, pretty much the view models in a temporary place (session, db) and issue the command only when the user clicks the button.

Mixing eventual consistency systems and legacy ACID systems

Are there any patterns for mixing eventual consistency systems with legacy ACID-systems?
I want to store data in some(at least two) legacy systems on the mainframe that need ACID-like transactions. Those mainframe-databases(Let us call them OldWorld) are running under the same transaction manager in the same process so the consistency of the mainframe-systems is no problem.
I have a transaction manager that can handle XA-Transactions with the mainframe-tm and the ACID-able relational database in the non-mainframe environment (let us call this NewWorld).
But I do not want to use the XA-Transaction because it often causes trouble with long running locks on the mainframe-side and in many cases i do not need all ACID-Features for both worlds. I always want a consistent mainframe(All Data in the OldWorld are consistent inside the OldWorld). The NewWorld System can handle inconsistent data(Inconsistency between New and Old) when it reads data from the mainframe-side. The operations that are used to store data at the OldWorld are easy and save “add-only operations” whom cannot fail functionally (it can fail technically, but this should always be a temporary failure).
My idea to work around the need for a distributed transaction is that i update the data in the OldWorld asynchronously and use an event sourcing data layer(in the NewWorlds) to store the information what is needed to be done in the OldWorld, using “soft-transaction-id's“ to prevent double-submitting to the OldWorld. These “soft-transaction-id's” will be generated while storing the data to the event-sourcing-data-layer for a transaction that needs to be done in the OldWorld.
I don't have the change to add my „soft-transaction-id's“ to the OldWorld-Databases but i can add a new Database that can store a „Done“-State beside the „soft-transaction-id“ and make the update of this database part of the old-world-transactions. Then another async-process can read the state-information without any locking and update the NewWorld (ex. Update relational-model with data from the event sourcing store. And marking the soft-transaction-id as done(„global-consistent“)) The Update of the OldWorld will always check if the soft-transaction-id is always committed first.
As i read through my writings i get the feeling that it's like global transaction, just with less locking. The knowledge that my update to the OldWorld will functional succeed is essential, without that you need a manually merge process, which can handle the functional conflicts. The NewWorld systems needs the functionality to handle inconsistent global state. It can be done by reading the relational-database and mimic the OldSystem DataRequests by analysing the not yet committed ( into the OldWorld-Database) event-store. For all other transactions I need to use distributed transactions with their locking behavior.

Core data : how to undo operations once managed objects are saved with context

I am trying to implement downloading of bulk data from several tables on the server.
In my case there are 16 tables. For all these tables I will be firing 10 requests to the server. This means I have done a bit of logical groupings for related tables, but it is like all tables are inter-related with each other through one or the other relationship.
I need to consider three cases while doing downloading:
Saving data to each table at local.
Managing relationships between inserted objects.
Handling situation when one of the requests fails during download, say 8th request failed.
I will be following this approach for each response:
Inserting data in managed object context.
Managing relationships by firing NSPredicate and associating the related objects.
Saving the context.
In case of a response failure, I have two options:
Next time continue from the failed response.
Revert all saved data to its previous state.
1st approach may lead to some data inconsistency, so I am going with 2nd approach.
I know that if a managed object context is not saved, we can revert the changes, but
is it possible to revert the changes, if the managed object context is
saved?
I require some useful answers from the community.
Please suggest.
Is it possible to revert the changes, if the managed object context is saved?
After saving? Maybe, but it could be tricky. If you set up a separate managed object context for your network operations, and give it an NSUndoManager, you could later on tell the undo manager to roll everything back to the previous state.
It would be simpler to just not save changes until you're finished, though. Using an undo manager doesn't really help much-- the memory needed to store up all the undo actions will at least match the memory use from keeping all of the unsaved changes around until you're finished. If you're working on a separate managed object context (whether a child context or a completely separate context), handling the error case is as simple as letting the MOC get deallocated without saving changes first.