Microservices + CQRS implementation - amazon-s3

I am working on implementing a microservice architecture using the CQRS pattern. I have a working implementation using API Gateway, Lambda and DynamoDB with one exception - the event sourcing.
Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume. This notification represents an event that took place as part of the originating HTTP request. For instance, if the user makes a HTTP POST with a complete "check patient into hospital" model then the Lambda will break that apart and publish multiple events in sequential order.
Patient Checked in (includes Patient Id, hospital id + visit id)
Room Assigned (includes room number, + visit id)
Patient tested (includes tested + visit id)
Patient checked-out (visit id)
The intent for this pattern is to provide an audit trail of all events that took place while the patient was in the hospital. This example (not what I'm actually building) would be stored in an event source that can be replayed at any time. If the VisitId was deleted across all services we could just replay the events one at a time, in order, and reproduce an exact copy of the original record. You consider all records immutable to achieve this. Each POST would push into the event source and then land in the database that would pull the data out during a HTTP GET request. It would also have subscribers that would take pieces of this data and do other things - such as a "Visit Survey" service that would listen to the Patient Checked Out event and prep a post-op survey.
I've looked at several AWS services to provide this. I know about Kinesis Data Streams but I don't like the pricing structure nor do I want to deal with shards (no autoscaling). Since my entire platform is built on consumption based pricing (Dynamo, Lambda etc) I want to keep my event source the same way. This makes it easier for me to estimate a per-user cost as I just do math based on estimated requests per month, per user.
I've been using SNS for the stream itself, delivering the notifications, and it's been great. Super fast and not had any major issues while developing it. The issue though is that this is not suitable for a replay store - only delivery of the event messages. For a replay store I thought Kinesis Firehose made a lot of sense... Send it to S3 + SNS at the same time. Turns out SNS isn't a delivery destination available. I can Put to S3 myself and then publish to SNS but that seems like duplicate work in the code base when I can setup an S3 trigger to fire a Lambda and just have another small Lambda that reacts to the Event landing in S3 and do the insert into the DynamoDB. I've seen that this can be much slower though than just publishing through SNS. I'm also not sure about retry policies on the Put event. This simplifies retries though as I can just re-use the code in the triggered Lambda to replay all events in a bucket path.
I could just PutObject and then Publish to SNS within the same HTTP POST Lambda. If the SNS Publish fails though then I now have an object in S3 that was never published. I'd have to write a different Lambda to handle the fixing and publishing. Not the end of the world - either-way I have two Lambdas to deploy. I'm just not sure which way makes more sense in this pattern with AWS services.
Has anyone done something similar and have any recommendations? Am I working my way into a technical hole that will be difficult to manage later? I'm open to other paths as well if I can keep it to a consumption based pricing model. Thanks!

Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume.
You'll want to be a little bit careful here -- there are at least two different definitions of "event sourcing" running around.
If you care about event sourcing, in the sense usually coupled with CQRS (Greg Young, et al), then your events are your book of record. The important complication this introduces is that your service needs to be able to lock the "event stream" when making changes to it (without that lock, you run into "lost edit" scenarios and have to clean up the mess).
So the "pointer to your current changes" needs to live in something that has transactions. DynamoDB should be fine for this (based on my memory of the event sourcing break out room at re:Invent 2017). In theory, you could have the lock in dynamo, which contains a pointer to an immutable document stored in S3. I haven't been able to persuade myself that the trade offs justify the complexity, but as best I can tell there's nothing in that architecture that violates physics and causality.
If your operations team isn't happy with Dynamo, another reasonable option is RDS; choose your preferred relational data engine, deploy an event storage schema to it, and off you go.
As for the pub sub part, I believe you to be on the right track with SNS. It's the right choice for "fanning out" messages from a publisher to multiple consumers. Yes, it doesn't support replay, but that's fine -- replay can happen by pulling events from the book of record. See the later parts of Greg Young's Polyglot Data talk. Yes, sometimes you will get messages on both the push channel and the pull channel, but that's fine; you already signed up for idempotent message handling when you decided a distributed architecture was a good idea.
Why the need to store a pointer in DynamoDB?
Because S3 doesn't offer you any locking; which means that on the unhappy path, where two copies of your logic are trying to write different versions of your data, you end up victim to the lost edit problem.
You could manage the situation with optimistic locking - something analogous to HTTP's conditional PUT; but S3 (last time I checked) doesn't support conditional modification.
You could use S3 as an object store for immutable documents, but now you need some mechanism to determine which document in S3 is the "current" one. If you try to implement that in S3, you run into the same lost edit problem all over again.
So you need a different tool to handle that part of the problem; some tool that is suitable for "state succession". So DynamoDB fits there.
If you are using DynamoDB for locking, can you also use it for event storage? I don't have enough laps to feel confident that I know the answer there. For small problems, I'm mostly confident that the answer is yes. For large problems...?
Possibly useful discussions:
Rich Hickey; The Language of the System
Kenneth Truyers; Git as a NoSql Database


Need suggestions: Send multiple images to backend, perform upload operation in backend, send response

I need some best practice guidelines for a backend service in a scenario like this one:
UI sends multiple images for uploading to the backend service
Backend service receives all of the images and processes upload to storage one by one
There can be failure in 1 or multiple image upload
My question is how do I send the response towards UI if my backend service is unable to upload 1 or more file(s).
One way can be to send failed and successful image link together in a JSON response body. So the UI knows about the failure and handles it in its own way.
Another way can be to send only the successfully uploaded images' link which is the best case scenario.
Any suggestions will be welcomed with some reference links.
Use an Orchestrator - something specific that can coordinate multiple actions and provide a meaningful result back to the caller.
This might be as simple as a component sitting in the UI that orchestrates calls to the backend. The UI component and the backend service might be designed as parts of a cohesive solution, or the UI component might simply act as a type of client/proxy/facade to some random backend service.
UI calls the orchestrator with references to all the images it needs uploading.
The orchestrator works through the items, uploading each as you prefer (sequentially or in parallel, etc). For each file, handle errors however you prefer - e.g. try once and die gracefully on failure; put errors into a queue or some other mechanism for retry (how many times is up to you); etc.
Based on rules internal to the orchestrator, return status to the caller.
For potentially long-running processes (like file uploads) make sure the call to the orchestrator is asynchronous.
Rather than only returning "complete" result at the end, the orchestrator might provide a simple status back, allowing callers to get some idea of where processing is at. For example, you might have a call-back (from the orchestrator to it's caller) that simply emits very simple statuses like: processing, failed and complete. A more complex solution would be for the orchestrator to return more specific info like %complete and detailed error info.
Have a look at how the big cloud providers do complex file uploads by reading their documentation and studying their API's.
I need some best practice guidelines for a backend service
In no particular order:
Keep it as simple as possible - generally, the fewer moving parts the better. E.g. pay attention to the Single Responsibility Principle (SRP).
Clean up after yourself. If the upload service generates any data - make sure you have a clean-up process so you don't end up with mountains of un-needed data lying around, especially stuff like image files. If you design an upload solution that maintains state (which is independent of what happens to the images once they are uploaded) then you'll be storing data which probably won't be needed once the images are all processed.
Think about support - not just developer debugging but also operational support. Getting your solution into production is not the end result, it's just the beginning.
If designing this solution across teams (e.g. frontend and backend teams) make sure both teams are involved in the design. If the backend team can't provide a solution that works for the frontend team then it's not going to end well.
Think about the likely error scenarios and how can you handle them.
This isn't really just a question of best practice, as there are multiple ways you could implement it, more than one of which could be valid. This is actually an architecture and design question, with more than one valid answer, hence I don't think it fits as a Stack Overflow question and you will not get references to any one correct approach.
That said, by way of an answer I will outline what I think you need. At a very high level, and not necessarily in this order but taking these factors into account, I would:
Design the UI process flow. For example, you may decide that the user process will have several stages:
User selects first image for upload;
User selects each subsequent image for upload;
User presses some kind of "Go" button after selecting all images;
System now uploads the batch, and user receives a response confirming success or otherwise;
User has option to click through to detailed success/error details.
Design the required success/error reports
Design the data needed to support the overall functionality
Provide one or more APIs giving the upload function and the report function(s) the CRUD access they need to this data
If you hit any specific technical issues at any stage, then please post a new questions accordingly as you go.
As to the point you mentioned, how to send the UI response, there is more than one valid way but I would return a basic success/falure response initially, containing only minimal details such as number of successes, and return more details in further messages in response to user actions (such as clicking through to detailed success/error details), at which point I would retrieve the requested error details from the database.
As I said at the start of my answer, I don't think your question can be answered just in terms of best practices, as it's a whole architecture and design question, but I hope my answer helps you along this path.

Maintain Consistency in Microservices [duplicate]

What is the best way to achieve DB consistency in microservice-based systems?
At the GOTO in Berlin, Martin Fowler was talking about microservices and one "rule" he mentioned was to keep "per-service" databases, which means that services cannot directly connect to a DB "owned" by another service.
This is super-nice and elegant but in practice it becomes a bit tricky. Suppose that you have a few services:
a frontend
an order-management service
a loyalty-program service
Now, a customer make a purchase on your frontend, which will call the order management service, which will save everything in the DB -- no problem. At this point, there will also be a call to the loyalty-program service so that it credits / debits points from your account.
Now, when everything is on the same DB / DB server it all becomes easy since you can run everything in one transaction: if the loyalty program service fails to write to the DB we can roll the whole thing back.
When we do DB operations throughout multiple services this isn't possible, as we don't rely on one connection / take advantage of running a single transaction.
What are the best patterns to keep things consistent and live a happy life?
I'm quite eager to hear your suggestions!..and thanks in advance!
This is super-nice and elegant but in practice it becomes a bit tricky
What it means "in practice" is that you need to design your microservices in such a way that the necessary business consistency is fulfilled when following the rule:
that services cannot directly connect to a DB "owned" by another service.
In other words - don't make any assumptions about their responsibilities and change the boundaries as needed until you can find a way to make that work.
Now, to your question:
What are the best patterns to keep things consistent and live a happy life?
For things that don't require immediate consistency, and updating loyalty points seems to fall in that category, you could use a reliable pub/sub pattern to dispatch events from one microservice to be processed by others. The reliable bit is that you'd want good retries, rollback, and idempotence (or transactionality) for the event processing stuff.
If you're running on .NET some examples of infrastructure that support this kind of reliability include NServiceBus and MassTransit. Full disclosure - I'm the founder of NServiceBus.
Update: Following comments regarding concerns about the loyalty points: "if balance updates are processed with delay, a customer may actually be able to order more items than they have points for".
Many people struggle with these kinds of requirements for strong consistency. The thing is that these kinds of scenarios can usually be dealt with by introducing additional rules, like if a user ends up with negative loyalty points notify them. If T goes by without the loyalty points being sorted out, notify the user that they will be charged M based on some conversion rate. This policy should be visible to customers when they use points to purchase stuff.
I don’t usually deal with microservices, and this might not be a good way of doing things, but here’s an idea:
To restate the problem, the system consists of three independent-but-communicating parts: the frontend, the order-management backend, and the loyalty-program backend. The frontend wants to make sure some state is saved in both the order-management backend and the loyalty-program backend.
One possible solution would be to implement some type of two-phase commit:
First, the frontend places a record in its own database with all the data. Call this the frontend record.
The frontend asks the order-management backend for a transaction ID, and passes it whatever data it would need to complete the action. The order-management backend stores this data in a staging area, associating with it a fresh transaction ID and returning that to the frontend.
The order-management transaction ID is stored as part of the frontend record.
The frontend asks the loyalty-program backend for a transaction ID, and passes it whatever data it would need to complete the action. The loyalty-program backend stores this data in a staging area, associating with it a fresh transaction ID and returning that to the frontend.
The loyalty-program transaction ID is stored as part of the frontend record.
The frontend tells the order-management backend to finalize the transaction associated with the transaction ID the frontend stored.
The frontend tells the loyalty-program backend to finalize the transaction associated with the transaction ID the frontend stored.
The frontend deletes its frontend record.
If this is implemented, the changes will not necessarily be atomic, but it will be eventually consistent. Let’s think of the places it could fail:
If it fails in the first step, no data will change.
If it fails in the second, third, fourth, or fifth, when the system comes back online it can scan through all frontend records, looking for records without an associated transaction ID (of either type). If it comes across any such record, it can replay beginning at step 2. (If there is a failure in step 3 or 5, there will be some abandoned records left in the backends, but it is never moved out of the staging area so it is OK.)
If it fails in the sixth, seventh, or eighth step, when the system comes back online it can look for all frontend records with both transaction IDs filled in. It can then query the backends to see the state of these transactions—committed or uncommitted. Depending on which have been committed, it can resume from the appropriate step.
I agree with what #Udi Dahan said. Just want to add to his answer.
I think you need to persist the request to the loyalty program so that if it fails it can be done at some other point. There are various ways to word/do this.
1) Make the loyalty program API failure recoverable. That is to say it can persist requests so that they do not get lost and can be recovered (re-executed) at some later point.
2) Execute the loyalty program requests asynchronously. That is to say, persist the request somewhere first then allow the service to read it from this persisted store. Only remove from the persisted store when successfully executed.
3) Do what Udi said, and place it on a good queue (pub/sub pattern to be exact). This usually requires that the subscriber do one of two things... either persist the request before removing from the queue (goto 1) --OR-- first borrow the request from the queue, then after successfully processing the request, have the request removed from the queue (this is my preference).
All three accomplish the same thing. They move the request to a persisted place where it can be worked on till successful completion. The request is never lost, and retried if necessary till a satisfactory state is reached.
I like to use the example of a relay race. Each service or piece of code must take hold and ownership of the request before allowing the previous piece of code to let go of it. Once it's handed off, the current owner must not lose the request till it gets processed or handed off to some other piece of code.
Even for distributed transactions you can get into "transaction in doubt status" if one of the participants crashes in the midst of the transaction. If you design the services as idempotent operation then life becomes a bit easier. One can write programs to fulfill business conditions without XA. Pat Helland has written excellent paper on this called "Life Beyond XA". Basically the approach is to make as minimum assumptions about remote entities as possible. He also illustrated an approach called Open Nested Transactions (http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper142.pdf) to model business processes. In this specific case, Purchase transaction would be top level flow and loyalty and order management will be next level flows. The trick is to crate granular services as idempotent services with compensation logic. So if any thing fails anywhere in the flow, individual services can compensate for it. So e.g. if order fails for some reason, loyalty can deduct the accrued point for that purchase.
Other approach is to model using eventual consistency using CALM or CRDTs. I've written a blog to highlight using CALM in real life - http://shripad-agashe.github.io/2015/08/Art-Of-Disorderly-Programming May be it will help you.

Message types : how much information should messages contain?

We are currently starting to broadcast events from one central applications to other possibly interested consumer applications, and we have different options among members of our team about how much we should put in our published messages.
The general idea/architecture is the following :
In the producer application :
the user interacts with some entities (Aggregate Roots in the DDD sense) that can be created/modified/deleted
Based on what is happening, Domain Events are raised (ex : EntityXCreated, EntityYDeleted, EntityZTransferred etc ... i.e. not only CRUD, but mostly )
Raised events are translated/converted into messages that we send to a RabbitMQ Exchange
in RabbitMQ (we are using RabbitMQ but I believe the question is actually technology-independent):
we define a queue for each consuming application
bindings connect the exchange to the consumer queues (possibly with message filtering)
In the consuming application(s)
application consumes and process messages from its queue
Based on Enterprise Integration Patterns we are trying to define the Canonical format for our published messages, and are hesitating between 2 approaches :
Minimalist messages / event-store-ish : for each event published by the Domain Model, generate a message that contains only the parts of the Aggregate Root that are relevant (for instance, when an update is done, only publish information about the updated section of the aggregate root, more or less matching the process the end-user goes through when using our application)
small message size
very specialized message types
close to the "Domain Events"
problematic if delivery order is not guaranteed (i.e. what if Update message is received before Create message ? )
consumers need to know which message types to subscribe to (possibly a big list / domain knowledge is needed)
what if consumer state and producer state get out of sync ?
how to handle new consumer that registers in the future, but does not have knowledge of all the past events
Fully-contained idempotent-ish messages : for each event published by the Domain Model, generate a message that contains a full snapshot of the Aggregate Root at that point in time, hence handling in reality only 2 kind of messages "Create or Update" and "Delete" (+metadata with more specific info if necessary)
idempotent (declarative messages stating "this is what the truth is like, synchronize yourself however you can")
lower number of message formats to maintain/handle
allow to progressively correct synchronization errors of consumers
consumer automagically handle new Domain Events as long as the resulting message follows canonical data model
bigger message payload
less pure
Would you recommend an approach over the other ?
Is there another approach we should consider ?
Is there another approach we should consider ?
You might also consider not leaking information out of the service acting as the technical authority for that part of the business
Which roughly means that your events carry identifiers, so that interested parties can know that an entity of interest has changed, and can query the authority for updates to the state.
for each event published by the Domain Model, generate a message that contains a full snapshot of the Aggregate Root at that point in time
This also has the additional Con that any change to the representation of the aggregate also implies a change to the message schema, which is part of the API. So internal changes to aggregates start rippling out across your service boundaries. If the aggregates you are implementing represent a competitive advantage to your business, you are likely to want to be able to adapt quickly; the ripples add friction that will slow your ability to change.
what if consumer state and producer state get out of sync ?
As best I can tell, this problem indicates a design error. If a consumer needs state, which is to say a view built from the history of an aggregate, then it should be fetching that view from the producer, rather than trying to assemble it from a collection of observed messages.
That is to say, if you need state, you need history (complete, ordered). All a single event really tells you is that the history has changed, and you can evict your previously cached history.
Again, responsiveness to change: if you change the implementation of the producer, and consumers are also trying to cobble together their own copy of the history, then your changes are rippling across the service boundaries.

Raise an event or send a command?

We've created a web application that is an a e-book reader. So one thing to keep in mind is that the domain is not exactly that of reading a physical book. We are now trying to gather users' reading behavior by storing information about e-book pages accessed by our users. Since this information goes to a data warehouse we thought raising an event from the bookcontroller is the right way to do it.
But we are not sure if it should be a publish or a send since there is really only one consumer to this event and that is our business intelligence team. We've also read that it is not advisable to publish from the web app (http://www.make-awesome.com/2010/10/why-not-publish-nservicebus-messages-from-a-web-application/). So now the alternative is to use bus.Send(RecordPageAccessedCommand)
But the above command does not change our application state in anyway. So is it truly a command? I have a feeling that the mistake we are making is using NServiebus's features (Publish,Send) and trying to equate it with what a command or event is.
Please let me know what the solution to this is.
Based on the information you provided, I would recommend "sending" to your endpoint.
Sending a command implies that the endpoint handling the message should do something. In your case, recording that the page was accessed is the thing the endpoint should do.
Publishing an event implies that you are notifying 0..n subscribers that something occurred. You could publish an event from your command handler if some other service in your system was interested in the fact that a page was accessed. The key point here is that it's not a "fact" until you've recorded it.
I've found that consumers tend to grow once data is available. Having the ability to publish an event from your command handler will make it trivial to notify new consumers without changing/risking your existing code base.
The RecordPageAccessedCommand is a command as it is commanding the system to do something, in this case, record that a page has been accessed.
If I've understood your scenario correctly. A message should be sent from your controller to the "Business intelligence Team Service" telling the system to record that a page has been accessed. This service would store this information and would be the owner/technical authority of this information.
No other services should store or require this information in its pure form, they can however subscribe to events from this service, in highly contrived scenario for example, when a user reads 1000 pages the "Business intelligence Team Service" can publish an event that a 1000 pages have been read ie Bus.Publish(), which may be handled by a billing service that gives a discount for the user on their next purchase.
The data warehouse can have access to this information stored in your "Business intelligence Team Service" as it would fall under IT/OPS.

Is Message Queuing the right strategy for a high-bandwidth data feed?

I have a huge network of data-collection servers which generate a large volume of real-time data.
In the past I've provided partners with the ability to get this data in near-real-time using HTTP GET's. But for many reasons I'm eager to ditch this.
So yeah... I'm eager to build out a new distribution system and I was thinking that a Message Queuing System was the way to go.
I need to be able to distribute data from my sources to a number of different partners. Some partners receive all of it, others just get a portion. And, if a partner gets disconnected, they need to be able to reconnect and not miss any data. (Although, for the sake of disk and memory I'd like their queued messages to expire after hour or so)
Lastly I need the system to be able to handle tens of thousands of enqueue's per minute.
Do you think Message Queuing is an appropriate scheme?
I was looking at using RabbitMQ. Is it difficult to maintain?
Thanks Very Much!
I cannot tell you if it is the right strategy in your specific case, but message products are indeed used in high message rate systems every day.
Much of the investment world uses various products, both commercial (Tibco) and Open source (ZeroMQ) to name just two, to handle market data from exchanges and other sources. These are likely at least as active as your data sensors are.
The publish/subscribe model, where some receivers want some messages and some receivers want all, along with late-join or other so-called guaranteed messaging are indeed standard features on most of these products.
So do go ahead and investigate products, I have not used RabbitMQ myself, so cannot comment on it specifically, however with a minimal abstraction layer, you should be able to insulate yourself from too many platform specific calls, and therefore allow you to swap message-bus implementers if the need arises. (You may even want to build such a shim as part of a proof-of-concept to test out more than one product for your specific purpose. You get experience in multiple products, flesh out the facade layer, and get up to speed on the products)
Good Luck