How to query multiple aggregates efficiently with DDD? - repository

When I need to invoke some business method, I need to get all aggregate roots related to the operation, even if the operation is as primitive as the one given below (just adding item into a collection). What am I missing? Or is CRUD-based approach where you run one single query including table joins, selects and insert at the end - and database engine does all the work for you - actually better in terms of performance?
In the code below I need to query separate aggregate root (which creates another database connection and sends another select query). In real world applications I have been querying a lot more than one single aggregate, up to 8 for a single business action. How can I improve performance/query overhead?
Domain aggregate roots:
class Device
{
Set<ParameterId> parameters;
void AddParameter(Parameter parameter)
{
parameters.Add(parameter.Id);
}
}
class Parameter
{
ParameterId Id { get; }
}
Application layer:
class DeviceApplication
{
private DeviceRepository _deviceRepo;
private ParameterRepository _parameterRepo;
void AddParameterToDevice(string deviceId, string parameterId)
{
var aParameterId = new ParameterId(parameterId);
var aDeviceId = new DeviceId(deviceId);
var parameter = _parameterRepo.FindById(aParameterId);
if (parameter == null) throw;
var device = _deviceRepo.FindById(aDeviceId);
if (device == null) throw;
device.AddParameter(parameter);
_deviceRepo.Save(device);
}
}
Possible solution
I've been told that you can pass just an Id of another aggregate like this:
class Device
{
void AddParameter(ParameterId parameterId)
{
parameters.Add(parameterId);
}
}
But IMO it breaks incapsulation (by explicitely emphasizing term ID into the business), also it doesn't prevent from pasting wrong or otherwise incorrect identity (created by user).
And Vaughn Vernon gives examples of application services that use the first approach (passing whole aggregate instance).

The short answer is - don't query your aggregates at all.
An aggregate is a model that exposes behaviour, not data. Generally, it is considered a code smell to have getters on aggregates (ID is the exception). This makes querying a little tricky.
Broadly speaking there are 2 related ways to go about solving this. There are probably more but at least these don't break the encapsulation.
Option 1: Use domain events -
By getting your domain (aggregate roots) to emit events which illustrate the changes to internal state you can build up tables in your database specifically designed for querying. Done right you will have highly performant, denormalised queryable data, which can be linearly scaled if necessary. This makes for very simple queries. I have an example of this on this blog post: How to Build a Master-Details View when using CQRS and Event Sourcing
Option 2: Infer query tables -
I'm a huge fan of option 1 but if you don't have an event sourced approach you will still need to persist the state of your aggregates at some point. There are all sorts of ways to do this but you could plug into the persistence pipeline for your aggregates a process whereby you extract queryable data into a read model for use with your queries.
I hope that makes sense.

If you figured out that having RDBMS query with joins will work in this case - probably you have wrong aggregate boundaries.
For example - why would you need to load the Parameter in order to add it to the Device? You already have the identity of this Parameter, all you need to do is to add this id to the list of references Parameters in the Device. If you do it in order to satisfy your ORM - you're most probably doing something wrong.
Also remember that your aggregate is the transactional boundary. You really would want to complete all database operations inside one transaction and one connection.

Related

DDD Aggregate boundaries one to some and not one to many between entities in one aggregate

I've watched a tutorial about DDD in which it says that if I have aggregate root SnackMachine which has more than 30 child elements the child elements should be in separate aggregate. For example, SnackMachine has lots of PurshaseLog (more than 30) and it is better for PurshaseLog to be in a separate aggregate. Why is that?
The reason for limiting the overall size of an aggregate is because you always load the full aggregate into memory and you always store the full aggregate transactionally. A very large aggregate would cause technical problems.
That said, there is no such "30 child elements" rule in aggregate design and it sounds arbitrary as a rule. For example, having fewer very large child elements could be technically worse than having 30 very light child elements. A good way of storing aggregates is as json documents, given that you'll always read and write the documents as atomic operations. If you think it this way, you'll realise that an aggregate design that implies a very large or even ever-growing child collection will eventually cause problems. A PurhaseLog sounds like an ever-growing collection.
The second part of the rule that says "put it in a separate aggregate" is also not correct. You don't create aggregates because you need to store some data and it doesn't fit into an existing aggregate. You create aggregates because you need to implement some business logic and this business logic will need some data, so you put both things together in an aggregate.
So, although what you explain in your question are things to take into consideration when designing aggregates to avoid having technological problems, I'd suggest you put your attention to the actual responsibilities of the aggregate.
In your example, what are the responsibilities of the SnackMachine? Does it really need the (full) list of PurchaseLogs? What operations will the SnackMachine expose? Let's say that it exposes PurchaseProduct(productId) and LoadProduct(productId, quantity). To execute its business logic, this aggregate would need a list of products and keep count of their available quantity, but it wouldn't need to store the purchase log. Instead, at every Purchase, it could publish an event ProductPurchased(SnackMachineId, ProductId, Date, AvailableQuantity). Then external systems could subscribe to this event. One subscriber could register the PurchaseLog for reporting purposes and another subscriber could send someone to reload the machine when the stock was lower than X.
If PurchaseLog is not its own aggregate then it implies that it can only be retrieved or added as part of the child collection of SnackMachine.
Therefore, each time you want to add a PurchaseLog, you'd retrieve the SnackMachine with its child PurchaseLogs, add the PurchaseLog to its collection. Then save changes on your unit of work.
Did you really need to retrieve 30+ purchase logs which are redundant for the purpose of the use case of creating a new purchase log?
Application Layer - Option 1 (PurchaseLog is an owned entity of SnackMachine)
// Retrieve the snack machine from repo, along with child purchase logs
// Assuming 30 logs, this would retrieve 31 entities from the database that
// your unit of work will start tracking.
SnackMachine snackMachine = await _snackMachineRepository.GetByIdAsync(snackMachineId);
// Ask snack machine to add a new purchase log to its collection
snackMachine.AddPurchaseLog(date, quantity);
// Update
await _unitOfWork.SaveChangesAsync();
Application Layer - Option 2 (PurchaseLog is an aggregate root)
// Get a snackmachine from the repo to make sure that one exists
// for the provided id. (Only 1 entity retrieved);
SnackMachine snackMachine = await _snackMachineRepository.GetByIdAsync(snackMachineId);
// Create Purhcase log
PurchaseLog purchaseLog = new(
snackMachine,
date,
quantity);
await _purchaseLogRepository.AddAsync(purchaseLog);
await _unitOfWork.SaveChangesAsync()
PurchaseLog - option 2
class PurchaseLog
{
int _snackMachineId;
DateTimne _date;
int _quantity;
PurchaseLog(
SnackMachine snackMachine,
DateTime date,
int quantity)
{
_snackMachineId = snackMachine?.Id ?? throw new ArgumentNullException(nameof(snackMachine));
_date = date;
_quantity = quantity;
}
}
The second option follows the contours of your use case more accurately and also results in a lot less i/o with the database.

Select random N records from GraphQL query

I am building a simple quiz app that will allow a user to choose various categories and generate a 5 question quiz to test their knowledge. I have a long list of questions setup in AppSync accessible via GraphQL. However, as that list keeps growing, it doesn’t make sense for me to pull these to the client and randomly select there.
Does GraphQL support choosing random 5 from a query? Such that, serverside, I can select just 5 records at random?
query listAll {
listQuestions(filter: {
topic: {
contains: "chocolate"
}
}) {
items {
question
answer
}
}
}
I have thought about other approaches such as randomly assigning each record a number and filtering on this, but this would not be random each time.
Any ideas?
Does GraphQL support choosing random 5 from a query?
Not directly, no. Most of the more "interesting" things you might imagine doing in an SQL query, even simpler things like "return only the first 10 records" or "has a family name of 'Jones'", aren't directly supported in GraphQL. You have to build this sort of thing out of the primitives it gives you.
Such that, serverside, I can select just 5 records at random?
Most GraphQL server implementations support resolver functions which are arbitrary code called when a field value is requested. You could write a schema like
type Query {
listQuestions(filter: QuestionFilter, random: Int): [Question!]!
}
and get access to the arguments in the resolver function.
It looks like AppSync has its own resolver system. It's not obvious to me from paging through the documentation that it supports a "pick n at random" method; it seems to be mostly designed as a facade around database storage, and most databases aren't optimized for this kind of query.
David is right about writing this logic inside a resolver (as a GraphQL way).
If you are using AWS AppSync, you can use a Lambda resolver and attach it to the query, so you can write the logic to pick random values inside of the Lambda so it's part of the GraphQL response. This is one way of doing this.

Efficient Querying Data With Shared Conditions

I have multiple sets of data which are sourced from an Entity Framework code-first context (SQL CE). There's a GUI which displays the number of records in each query set, and upon changing some set condition (e.g. Date), the sets all need to recalculate their "count" value.
While every set's query is slightly different in some way, most of them share common conditions in some way. A simple example:
RelevantCustomers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Customer")
RelevantSuppliers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Supplier")
So the thing is, there's enough of these demanding queries, that each time the user changes some condition (e.g. SelectedDate), it takes a really long time to recalculate the number of records in each set.
I realise that part of the reason for this is the need to query through, for example, the transactions each time to check what is really the same condition for both RelevantCustomers and RelevantSuppliers.
So my question is that, given these sets share common "base conditions" which depend on the same sets of data, is there some more efficicent way I could be calculating these sets?
I was thinking something with custom generic classes like this:
QueryGroup<People>(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0)
{
new Query<People>("Customers", P=>P.Type=="Customer"),
new Query<People>("Suppliers", P=>P.Type=="Supplier")
}
I can structure this just fine, but what I'm finding is that it makes basically no difference to the efficiency as it still needs to repeat the "shared condition" for each set.
I've also tried pulling the base condition data out as a static "ToList()" first, but this causes issues when running into navigation entities (i.e. People.Addresses don't get loaded).
Is there some method I'm not aware of here in terms of efficiency?
Thanks in advance!
Give something like this a try: Combine "similar" values into fewer queries, then separate the results afterwards. Also, use Any() rather than Count() for exists check. Your updated attempt goes part-way, but will still result in 2x hits to the database. Also, when querying it helps to ensure that you are querying against indexed fields, and those indexes will be more efficient with numeric IDs rather than strings. (I.e. a TypeID of 1 vs. 2 for "Customer" vs. "Supplier") Normalized values are better for indexing and lead to smaller records, at the cost of extra verbose queries.
var types = new string[] {"Customer", "Supplier"};
var people = People.Where(p => types.Contains(p.Type)
&& p.Transactions.Any(t => t.Date > selectedDate)).ToList();
var relevantCustomers = people.Where(p => p.Type == "Customer").ToList();
var relevantSuppliers = people.Where(p => p.Type == "Supplier").ToList();
This results in just one hit to the database, and the Any should be more perform-ant than fetching an entire count. We split the customers and suppliers after the fact from the in-memory set. The caveat here is that any attempt to access details such as transactions etc. on customers and suppliers would result in lazy-load hits since we didn't eager load them. If you need entire entity graphs then be sure to .Include() relevant details, or be more selective on the data extracted from the first query. I.e. select anonymous types with the applicable details rather than just the entity.

NHibernate Filtering data best practices

I have the following situation:
User logs in, opens an overview of all products, can only see a list of products where a condition is added, this condition is variable. Example: WHERE category in ('catA', 'CatB')
Administrator logs in, opens an overview of all products, he can see all products no filter applied.
I need to make this as dynamically as possible. My data access classes are using Generics for most of the time.
I've seen filters but my conditions are very variable, so i don't see this as scalable enough.
We use NH filters for something similar, and it works fine. If no filter needs to be applied, you can omit setting any value for the filter. We use these filters for more basic stuff, filters that are applied nearly 100% of the time, fx deleted objects filters, client data segregating, etc. Not sure what scalability aspect you're looking for?
For more high level and complex filtering, we use a custom class that manipulates a repository root. Something like the following:
public IQueryOver<TIn, TOut> Apply(IQueryOver<TIn, TOut> query)
{
return query.Where(x => ... );
}
If you have an IoC container integrated with your NH usage, something like this can easily be generalized and plugged into your stack. We have these repository manipulators that do simple where clauses, and others that generate complex where clauses that reference domain logic and others that joins a second table on and filters on that.
You could save all categories in an category list and pass this list to the query. If the list is not null and contains elements you can work with the following:
List<string> allowedCategoriesList = new List<string>();
allowedCategoriesList.Add(...);
...
.WhereRestrictionOn(x => x.category).IsIn(allowedCategoriesList)
It's only important to skip this entry if you do not have any filters (so, you want to see all entries without filtering), as you will otherwise see not one single result.

Single Instance VS PerCall in WCF

There are a lot of posts saying that SingleInstance is a bad design. But I think it is the best choice in my situation.
In my service I have to return a list of currently logged-in users (with additional data). This list is identical for all clients. I want to retrieve this list from database every 5 seconds (for example) and return a copy of it to the client, when needed.
If I use PerCall instancing mode, I will retrieve this list from database every single time. This list is supposed to contain ~200-500 records, but can grow up to 10 000 in the future. Every record is complex and contains about 10 fields.
So what about performance? Is it better to use "bad design" and get list once or to use "good approach" and get list from database on every call?
So what about performance? Is it better to use "bad design" and get
list once or to use "good approach" and get list from database on
every call?
Performance and good design are NOT mutually exclusive. The problem with using a single instance is that it can only service a single request at a time. So all other requests are waiting on it to finish doing it's thing.
Alternatively you could just leverage a caching layer to hold the results of your query instead of coupling that to your service.
Then your code might look something like this:
public IEnumerable<BigDataRecord> GetBigExpensiveQuery(){
//Double checked locking pattern is necessary to prevent
// filling the cache multiple times in a multi-threaded
// environment
if(Cache["BigQuery"] == null){
lock(_bigQueryLock){
if(Cache["BigQuery"] == null){
var data = DoBigQuery();
Cache.AddCacheItem(data, TimeSpan.FromSeconds(5));
}
}
}
return Cache["BigQuery"];
}
Now you can have as many instances as you want all accessing the same Cache.