Resource locking and business logic - locking

Consider the following situation:
There is an update request on Entity A, to create sub-entity A.B. there might be many B's on A, each B has unique email address.
The entity A is a shared entity, and the same request can happen in multiple servers in parallel (scalable micro-service).
In order to create A.B we have to verify that B does not already exist as sub entity on A (according to B's email address).
The service which handles this update request should lock A(by it's unique id) in order to make the update safe.
My questions are more conceptual than technical:
Does locking the resource A in this case is part of the business logic of this update task?
Would you consider putting the resource lock in a separate middleware than the one which handles the verify and update procedure?
(the other option is to treat the lock as part of the business logic and put it directly in the middleware responsible for the business logic.)

The technical implementation of the chosen solution to contention problems is obviously not business logic, but choosing the right solution requires business knowledge.
What I mean by this is that you must understand how the business works in order to determine the right approach to protect the integrity of the data in concurrency scenarios. How often concurrency conflicts will occur? Can conflicts be resolved automatically? What should be conflicting? Not only that, but the business may very well accept eventual consistency over strong consistency.
In short, the mechanisms put in place to protect the data integrity in concurrency scenarios shouldn't be part of the domain. These would probably go either in the application service layer or in the infrastructure layer, but the business experts must be involved in the discussions regarding how concurrency conflicts should be resolved and how these affects the business.

Locking is not a business related issue (unless your business is building distributed databases), and so should never be considered part of the business logic.
Further, you should not be implementing distributed locking yourself, but should be relying on a packaged solution, that is preferably part of your data persistence solution.
Here's an article on how to do this with Redis discussing an algorithm called Redlock. Here's a blog post linking to articles about building concensus in Cassandra. And, here's a link about concurrency in Mongo. As you'll see from these articles, distributed locking is a big and complex issue that you probably don't want to tackle yourself.

Related

2 different not shared databases for the same microservice good approach?

Context:
Microservices architecture, DDD, CQRS, Event driven.
SQL database.
I have an use case, where I have to store a record when a entity state is updated. I'm afraid that the quantity of records could be huge, and I was thinking that maybe an sql table is not the right place to store it. Also this records are used every now and then, and probably not by the service domain.
Could be a good practice to store it in another database(firestore, mongo, cassandra...) so it doesn't affect the performance and the scope of this service?
Thanks!
Could be a good practice to store it in another database(firestore, mongo, cassandra...) so it doesn't affect the performance and the scope of this service?
Part of the benefit of using microservices is that you are hiding implementation details. As such, you are able to store/process data by whatever means is required or available without the need to broadcast that implementation to external services.
That said, from a technical standpoint, it is worth considering transaction boundaries. When writing to a single database, it is possible to commit transactions easily. Once you are writing to different databases within the same transaction, you can run into situations where one write might succeed while another one might fail.
My recommendation is to make sure you write to only one of those databases at a time. Use a two-phase commit to ensure that the second database is written to. In this way, you can avoid lost data and get the benefit of using a more efficient data store.

Is it okay to have more than one repository for an aggregate in DDD?

I've read this question about something similar but it didn't quite solve my problem.
I have an application where I'm required to use data from an API. Problem is there are performance and technical limitations to doing this. The performance limitations are obvious. The technical limitations lie in the fact that the API does not support some of the more granular queries I need to make.
I decided to use MySQL as a queryable cache.
Since the data I needed to retrieve from the API did not change very often, I settled on refreshing the cache once a day, so I didn't need any complicated mapper that checked if we had the data in the cache and if not fell back to the API. That was my first design, but I realized that wasn't very practical when the API couldn't support most of the queries I needed to make anyway.
Now I have a set of two mappers for every aggregate. One for MySQL and one for the API.
My problem is now how I hide the complexities of persistence from the domain, and the fact that it seems that I need multiple repositories.
Ideally I would have an interface that both mappers adhered to, but as previously disclosed that's not possible.
Is it okay to have multiple repositories, one for each mapper?
Is it okay to have more than one repository for an aggregate in DDD?
Short answer: yes.
Longer answer: you won't find any suggestion of multiple repository in the original book by Evans. As he described things, the domain model would have one representation of the aggregate, and the repository abstraction provided consumers with the illusion that the aggregate was stored in an in-memory collection.
Largely, this makes sense -- you are trying to ensure that writes to data within the aggregate boundary are consistent, so you need a single authority for change.
But... there's no particular reason that reads need to travel through the same code path as writes. Welcome to the world of cqrs. What that gives you immediately is the idea that the in memory representation for reads might need to be optimized differently from the in memory representation used for writes.
In its more general form, you get the idea that the concept that you are modeling might have different representations for each use case.
For your case, where it is sometimes appropriate to read from the RDBMS, sometimes from the API, sometimes both, this isn't quite an exact match -- the repository interface hides the implementation details from the consumer, but you still have to bother with the implementation.
One thing you might look at is your requirements; how fresh does the data need to be in each use case? A constraint that is often relaxed in the CQRS pattern is the idea that the effects of writes are immediately available for reading. The important question to ask would be, if the data hasn't been cached yet, can you simply report "data not available" without hitting the API?
If so, then use cases that access the cached data need only a single repository implementation.
If you are using external API to read and modify data, you can cache them locally to be faster in reads, but I would avoid to have a domain repository.
From the domain perspective it seems that you need a service to query (or just a Query in CQRS implementation) for some data, that you can do with a service, that internally can call some remote API or read from a local cache (mysql, whatever).
When you read your local cache you can develop a repository to decouple your logic from the db implementation, but this is a different concept from a domain repository, it is just a detail of your technical implementation, that has nothing to do with your domain.
If the remote service start offering the query you need you will change the implementation of how your query is executed, calling the remote API instead of the db, but your domain model should not change.
A domain repository is used to load and persist your aggregates, meanwhile if you are working with external aggregates (in a different context, subdomain) you need to interact with them using services.

DDD: Repository to read and Unit Of Work to write?

After going through multiple stack overflow posts and blog articles, I came to this decision that we need UnitOfWork design pattern to maintain transactional integrity while writing the domain objects to their respective repositories.
However, we do not need such integrity while reading/searching the repository. Given that, is it a good design to separate the purposes of repositories and unit of works, with the former to be used only for reading domain objects and the later to be used only to create/write/refresh/delete domain objects?
Eric Evans, Domain Driven Design:
Implementation (of a repository) will vary greatly, depending on the technology being used for persistence and the infrastructure you have. The ideal is to hide all the inner workings from the client (although not from the developer of the client), so that the client code ill be the same whether the data is stored in an object database, a relational database, or simply held in memory....
The possibilities of implementation are so diverse that I can only list some concerns to keep in mind....
Leave transaction control to the client. Although the REPOSITORY will insert and delete from the database, it will ordinarily not commit anything. It is tempting to commit after saving, for example, but the client presumably has the context to correctly initiate and commit units of work. Transaction management will be simpler if the REPOSITORY keeps its hands off.
That said; I call your attention in particular to an important phrase in the above discussion: the database. The underlying assumption here being that all of the aggregates being modified are stored in such a way that the unit of work can be committed atomically.
When that's not the case -- for example, if you are storing aggregates in a document store that doesn't promise atomic updates of multiple documents, then you may want to consider making this separation explicit in your model, rather than trying to disguise the fact that you are trying to coordinate multiple commits.
It is entirely reasonable to use one set of repositories for your read use cases, which are distinct from those used in your write use cases. In other words, when we have different semantics, then we should have a different interface, the implementations of which can be tuned as necessary.

Domain Driven Design - Creating general purpose entities vs. Context specific Entities

Situation
Suppose you have Orders and Clients as entities in your application. In one aggregate, the Order entity is considered to be the root but you also want to make use of the Client entity for simple things. In another the Client is the root entity and the Order entity is touched ever so lightly.
An example:
Let's say that in the Order aggregate I use the Client only to read details like name, address, build order history and not to make the client do client specific business logic. (like persistence, passwords resets and back flips..).
On the other hand, in the Client aggregate I use the Order entity to report on the client's buying habbits, order totals, order counting, without requiring advanced order functionality like order processing, updating, status changes, etc.
Possible solution
I believe the better solution is to create the entities for each aggregate specific to the aggregate context, because making them full featured (general purpose) and ready for any situation and usage seems like overkill and could potentially become a maintenance nightmare. (and potentially memory intensive)
Question
What is the DDD recommended way of handling this situation?
What is your take on the matter?
The basic driver for these decisions should be the ubiquitous language, and consequently the real world domain you're modeling. If both works in a specific domain, I'd favor separation over god-classes for maintainability reasons.
Apart from separating behavior into different aggregates, you should also take care that you don't mix different bounded contexts. Depending on the requirements of your domain, it could make sense to separate the Purchase Context from the Reporting Context (to extend on your example).
To decide on a context design, context maps are a helpful tool.
You are one the right track. In DDD, entities are not merely containers encapsulating all attributes related to a "subject" (for example: a customer, or an order). This is a very important concept that eludes a lot of people. An entity in DDD represents an operation boundary, thus only the data necessary to perform the operation is considered to be a part of the entity. Exactly which data to include in an entity can be difficult to consider because some data is relevant in a different use-cases. Here are some tips when analyzing data:
Analyze invariants, things that must be considered when applying validation rules and that can not be out of sync should be in the same aggregate.
Drop the database-thinking, normalization is not a concern of DDD
Just because things look the same, it doesn't mean that they are. For example: the current shipping address registered on a customer is different from the shipping address which a specific order was shipped to.
Don't look at reads. Reading, like creating a report or populating av viewmodel/dto/whatever has nothing to do with operation boundaries and can typically be a 360 deg view of the data. In fact don't event use your domain model when returning reads, use a different architectural stack.

How to create a multi-tenant database with shared table structures?

Our software currently runs on MySQL. The data of all tenants is stored in the same schema. Since we are using Ruby on Rails we can easily determine which data belongs to which tenant. However there are some companies of course who fear that their data might be compromised, so we are evaluating other solutions.
So far I have seen three options:
Multi-Database (each tenant gets its own - nearly the same as 1 server per customer)
Multi-Schema (not available in MySQL, each tenant gets its own schema in a shared database)
Shared Schema (our current approach, maybe with additional identifying record on each column)
Multi-Schema is my favourite (considering costs). However creating a new account and doing migrations seems to be quite painful, because I would have to iterate over all schemas and change their tables/columns/definitions.
Q: Multi-Schema seems to be designed to have slightly different tables for each tenant - I don't want this. Is there any RDBMS which allows me to use a multi-schema multi-tenant solution, where the table structure is shared between all tenants?
P.S. By multi I mean something like ultra-multi (10.000+ tenants).
However there are some companies of
course who fear that their data might
be compromised, so we are evaluating
other solutions.
This is unfortunate, as customers sometimes suffer from a misconception that only physical isolation can offer enough security.
There is an interesting MSDN article, titled Multi-Tenant Data Architecture, which you may want to check. This is how the authors addressed the misconception towards the shared approach:
A common misconception holds that
only physical isolation can provide an
appropriate level of security. In
fact, data stored using a shared
approach can also provide strong data
safety, but requires the use of more
sophisticated design patterns.
As for technical and business considerations, the article makes a brief analysis on where a certain approach might be more appropriate than another:
The number, nature, and needs of the
tenants you expect to serve all affect
your data architecture decision in
different ways. Some of the following
questions may bias you toward a more
isolated approach, while others may
bias you toward a more shared
approach.
How many prospective tenants do you expect to target? You may be nowhere
near being able to estimate
prospective use with authority, but
think in terms of orders of magnitude:
are you building an application for
hundreds of tenants? Thousands? Tens
of thousands? More? The larger you
expect your tenant base to be, the
more likely you will want to consider
a more shared approach.
How much storage space do you expect the average tenant's data to occupy?
If you expect some or all tenants to
store very large amounts of data, the
separate-database approach is probably
best. (Indeed, data storage
requirements may force you to adopt a
separate-database model anyway. If so,
it will be much easier to design the
application that way from the
beginning than to move to a
separate-database approach later on.)
How many concurrent end users do you expect the average tenant to support?
The larger the number, the more
appropriate a more isolated approach
will be to meet end-user requirements.
Do you expect to offer any per-tenant value-added services, such
as per-tenant backup and restore
capability? Such services are easier
to offer through a more isolated
approach.
UPDATE: Further to update about the expected number of tenants.
That expected number of tenants (10k) should exclude the multi-database approach, for most, if not all scenarios. I don't think you'll fancy the idea of maintaining 10,000 database instances, and having to create hundreds of new ones every day.
From that parameter alone, it looks like the shared-database, single-schema approach is the most suitable. The fact that you'll be storing just about 50Mb per tenant, and that there will be no per-tenant add-ons, makes this approach even more appropriate.
The MSDN article cited above mentions three security patterns that tackle security considerations for the shared-database approach:
Trusted Database Connections
Tenant View Filter
Tenant Data Encryption
When you are confident with your application's data safety measures, you would be able to offer your clients a Service Level Agrement that provides strong data safety guarantees. In your SLA, apart from the guarantees, you could also describe the measures that you would be taking to ensure that data is not compromised.
UPDATE 2: Apparently the Microsoft guys moved / made a new article regarding this subject, the original link is gone and this is the new one: Multi-tenant SaaS database tenancy patterns (kudos to Shai Kerer)
Below is a link to a white-paper on Salesforce.com about how they implement multi-tenancy:
http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf
They have 1 huge table w/ 500 string columns (Value0, Value1, ... Value500). Dates and Numbers are stored as strings in a format such that they can be converted to their native types at the database level. There are meta data tables that define the shape of the data model which can be unique per tenant. There are additional tables for indexing, relationships, unique values etc.
Why the hassle?
Each tenant can customize their own data schema at run-time without having to make changes at the database level (alter table etc). This is definitely the hard way to do something like this but is very flexible.
My experience (albeit SQL Server) is that multi-database is the way to go, where each client has their own database. So although I have no mySQL or Ruby On Rails experience, I'm hoping my input might add some value.
The reasons why include :
data security/disaster recovery. Each companies data is stored entirely separately from others giving reduced risk of data being compromised (thinking things like if you introduce a code bug that means something mistakenly looks at other client data when it shouldn't), minimizes potential loss to one client if one particular database gets corrupted etc. The perceived security benefits to the client are even greater (added bonus side effect!)
scalability. Essentially you'd be partitioning your data out to enable greater scalability - e.g. databases can be put on to different disks, you could bring multiple database servers online and move databases around easier to spread the load.
performance tuning. Suppose you have one very large client and one very small. Usage patterns, data volumes etc. can vary wildly. You can tune/optimise easier for each client should you need to.
I hope this does offer some useful input! There are more reasons, but my mind went blank. If it kicks back in, I'll update :)
EDIT:
Since I posted this answer, it's now clear that we're talking 10,000+ tenants. My experience is in hundreds of large scale databases - I don't think 10,000 separate databases is going to be too manageable for your scenario, so I'm now not favouring the multi-db approach for your scenario. Especially as it's now clear you're talking small data volumes for each tenant!
Keeping my answer here as anyway as it may have some use for other people in a similar boat (with fewer tenants)
As you mention the one database per tenant is an option and does have some larger trade-offs with it. It can work well at smaller scale such as a single digit or low 10's of tenants, but beyond that it becomes harder to manage. Both just the migrations but also just in keeping the databases up and running.
The per schema model isn't only useful for unique schemas for each, though still running migrations across all tenants becomes difficult and at 1000's of schemas Postgres can start to have troubles.
A more scalable approach is absolutely having tenants randomly distributed, stored in the same database, but across different logical shards (or tables). Depending on your language there are a number of libraries that can help with this. If you're using Rails there is a library to enfore the tenancy acts_as_tenant, it helps ensure your tenant queries only pull back that data. There's also a gem apartment - though it uses the schema model it does help with the migrations across all schemas. If you're using Django there's a number but one of the more popular ones seems to be across schemas. All of these help more at the application level. If you're looking for something more at the database level directly, Citus focuses on making this type of sharding for multi-tenancy work more out of the box with Postgres.