Say I have two aggregates A and B. I'm using a factory method on A to create B. I also have a requirement that A cannot produce more than x instances of B.
It looks natural to have the following implementation:
A.createB() {
if (total> x)
raise an error
total++
return new B()
}
But that would violate the rule of modifying two aggregates: creating B and modifying A.
If I try to comply to this rule, I would:
1. Create B in A and raise an event like BCreated.
2. Update A's total count in the next transaction by handling the BCreated event.
To me, in this particular example, this looks like a wierd workaround, since after calling the createB() method on A, I leave it in inconsistent state.
Am I missing something?
For the sake of simplicity I would treat those aggregate changes as a unit of work. The only advice is that you have to deal with race conditions.
Now if you want to see an eventual solution as an example you will have to model an aggregate modeling the transaction of A and B changes: BCreation
A.requestBCreation changes A state and emits an event (BCreationAllowed) BCreation reacts to and then BCreation dispatches a command to create B and handles its consequence domain event BCreated let's say, or BCreationRejected. Aggregate BCreation listens to any of these event and so on. It might be an complicated and overdesigned solution. You will also have to deal with race conditions and so 'syncronize' the process aggregate.
Everything would be so much easier if you use the actor model.
Am I missing something?
You aren't missing anything easy, no.
The general term for the problem you face is set validation -- if you are going to maintain an invariant over a set of data, then all operations that modify that data must pass through the same single lock.
When all the world is a single relational database, that lock may be implicit -- the database itself is processing all transactions in some (logically) serialized order, so with some care you can be certain that the invariant is maintained because, down at the storage level, each transaction is all or nothing.
But if you distribute that data among two databases, all bets are off.
Another way of thinking about it: if your transaction can only work when all of the different "aggregates" are stored in the same database, that's an indication that what you really have is a larger aggregate, implicit and hidden in your implementation details -- and it is going to be more expensive to scale.
Commonly, we can instead relax the invariant somewhat -- make a best effort at maintaining the invariant, but also detecting violations and defining a protocol to compensate.
Although the "one aggregate per transaction" is a rule it probably will not kill you to be pragmatic and ignore it in certain situations. In fact, I would argue that there are going to be cases where it just isn't practical or even possible to get by any other way.
That being said you definitely should do your utmost to stick to that guideline. Your case is not uncommon. Stock levels and airline tickets (also stock levels, really) come to mind, for instance.
The only way to split the operations into two distinct steps would be to track the process. For this you need a process manager and you may even need some messaging but that is all plumbing.
To get past the issue you would need to "reserve" the creation in the first step using, say, some correlation identifier. That could then be saved in transaction A:
// begin tx (application layer)
if (A.Reserve(id))
{
// we're good
bus.Send(new RegisterBCommand
{
Id = id,
TheIdForA = theId
// other properties
}); // perhaps using Shuttle.Esb
}
// commit tx (application layer)
The next step would register the B entity and perhaps published BRegisteredEvent that could continue the process.
Just another point: you typically would only have A.CreateB() if both A and B live in the same bounded context. Another way to achieve something slightly similar would be by using an integration bounded context (say your orchestration BC) and then have CreateB() as an extension method on A where A and B are in separate BCs but the orchestration layer makes use of both domains. The other route is a plain factory or just new-ing it up in your application/domain service.
Related
So my map reduce operation sums up a list of micro payments into a lump sum that I owe a particular user. The user_id ends up being the _id. I also store an array ids of the micro payments that need to be paid. The output goes into a permeant collection called payments.
The output looks like this for one document
{ "_id" : ObjectId("4f48855606164f4765000004"), "value" : { "payment" : "5.0", "conversions" : [ ObjectId("4f5bd23baa113e964700000e") ] } }
I'd kind of like to track these payments so I was thinking about just building a mongoid document around the payments collection. I kind of know it can be done but I haven't really seen anyone doing it so it makes me think there must be a better way.
Also one problem with this approach is I'm making the payments every month so the _id being the user_id is going to conflict. Additionally I think there is a possible transaction problem because I need to update the micro payments to a different state so I know not to pay them ever again and what happens if one of the payments fails? These state change via state_machine if that makes any difference.
Should I be accessing Map Reduce output as a Mongoid Document?
Sure you can definitely do this. That's kind of the reason the M/R is output to a collection rather than just "some file".
Also one problem with this approach is I'm making the payments every month so the _id being the user_id is going to conflict.
So clearly, the output of your M/R is important data. Do not leave this data in a collection that could be "hammered" by a future M/R. Instead, rename the collection you have created, or run a for loop that manually appends the data to a "keeper" collection.
In the "keeper" collection change the _id to something like _id: { uid: ObjectId('...'), month: "201203" }. You may also want to "fan out" the values field into several fields. And you will need to add a field for transaction ID.
Also remember that MongoDB uses "fire & forget" writes by default. These are low safety. You have financial data, so ensure that you are following all of the best practices for availability and data safety:
Journaling On
Replica Sets (with secondary data center)
Ensure that all writes to this collection/db are done with w: majority and journal: true. This will slow down DB throughput on this operation as these writes can take a few hundred milliseconds.
Database passwords
Non-standard MongoDB port, IP white-listing (usual DB security)
what happens if one of the payments fails?
This a non-trivial problem and way too complicated to explain here. Instead, see this document on the two-phase commit with MongoDB.
Note that two-phase commit requires MongoDB's findAndModify command. You will have to learn how to handle this with Mongoid.
We have:
users, each of which has events, each of which has several properties (time, type etc.). Our basic use case is to fetch all events of a given user in a given time-span.
We've been considering the following alternatives in Cassandra for the Events column-family. All alternatives share: key=user_id (UUID), column_name = event_time
column_value = serialized object of event properties. Will need to read/write all the properties every time (not a problem), but might also be difficult to debug (can't use Cassandra command-line client easily)
column is actually a super column, sub-columns are separate properties. Means reading all events(?) every time (possible, though sub-optimal). Any other cons?
column_value is a row-key to another CF, where the event properties are stored. Means maintaining two tables -> complicates calls + reads/writes are slower(?).
Anything we're missing? Any standard best-practice here?
Alternative 1 : Why go to Cassandra if you are to store serialized object ? MongoDB or a similar product would perform better on this task if I get it wright (never actually tried a document base NoSQL, so correct me if I'm wrong on this one). Anyway, I tried this alternative once in MySQL 6 years ago and it is still painful to maintain today.
Alternative 2 : Sorry, I didn't had to play with super colunm yet. Would use this only if I had to show frequently many information on many users (i.e. much more than just their username and a few qualifiers) and their respective events in one query. Also could make query based on a given time-span a bit tricky if there are conditions on the user itself too, since a user's row is likely to have event's columns that fit in the span an other columns that doesn't.
Alternative 3 : Would defenitly be my choice in most cases. You are not likely to write events and create a user in the same transaction, so no worry for consistency. Use the username itself as a standard event column (don't forget to index it) so your calls will be pretty fast. More on this type of data model at http://www.datastax.com/docs/0.8/ddl/index.
Yes it's a two call read, but it do is two different families of data anyway.
As for a best-practices, the field is kinda new, not sure there are any widely approved yet.
I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)
In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.
I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.
An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.
I have a situation where I want to do some DB-related operations in a Java application (e.g. on Eclipse). I use MySQL as a RDBMS and Hibernate as an ORM provider.
I retreive all records using embedded SQL in Java:
//Define conncections ...etc
ResultSet result = myStmt.executeQuery("SELECT * FROM employees");
// iterator
I retreive all records using Hibernate ORM / JPQL:
// Connections,Entity Manager....etc
List result = em.createQuery("SELECT emp FROM Employees emp").getResultList();
// iterator
I know that the RDMS is located on secondary-memory (DISK). The question is, when I get both results back. Where are the employees actually? On the secondary (SM) or on main-memory (MM)?
I want to have at the end two object populations for further testing, one operating on the SM and one on the MM? How is this possible?
Thanks
Frank
Your Java Objects are real Java Objects, they are in (to use your term) MM, at least for a while. The beauty of the Hbernate/JPA programming model is that while in MM you can pretty much treat the objects as if they were any other Java object, make a few changes to them etc. And then at some agreed time Hibernate's persistence mechansim gets them bask to, SM (disk).
You will need to read up on the implications of Sessions and Transactions in order to understand when the transitions between MM and SM occur, and also very importantly, what happens if two users want to work with the same data at the same time.
Maybe start here
It is also possible to create objects in MM that are (at least for now) not related to any data on disk - these are "transient" objects, and also to "disconnect" data in memeory from what's on disk.
My bottom line here is that Hibernate/JPA does remove much grunt work from persistence coding, but it cannot hide the complexity of scale, as your data volumes increase, your data model's complexity grows and your user's actions contend for data you need to do serious thinking. Hibernate allows you to achive good things, but it can't do that thinking for you, you have to make careful choices as your problem domain gets more complex.
We are porting an old application that used a hierarchical database to a relational web app, and are trying to figure out the best way to port configuration switches (Y/N values).
Our old system had 256 distinct switches (per client) that were each stored as a bit in one of 8 32-bit data fields. Each client would typically have ~100 switches set. To read or set a switch, we'd use bitwise arithmetic using a #define value. For example:
if (a_switchbank4 & E_SHOW_SALARY_ON_CHECKS) //If true, print salary on check
We were debating what approach to store switches in our new relational (MS-SQL) database:
Put each switch in its own field
Pros: fast and easy read/write/access - 1 row per client
Cons: seems kludgey, need to change schema every time we add a switch
Create a row per switch per client
Pros: unlimited switches, no schema changes necessary w/ new switches
Cons: slightly more arduous to pull data, lose intellisense w/o extra work
Maintain bit fields
Pros: same code can be leveraged, smaller XML data transmissions between machines
Cons: doesn't make any sense to our developers, hard to debug, too easy to use wrong 'switch bank' field for comparison
I'm leaning towards #1 ... any thoughts?
It depends on a few factors such as:
How many switches are set for each client
How many switches are actually used
How often switches are added
If I had to guess (and I would be guessing) I'd say what you really want are tags. One table has clients, with a unique ID for each, another has tags (the tag name and a unique ID) and a third has client ID / tag ID pairs, to indicate which clients have which tags.
This differs from your solution #2 in that tags are only present for the clients where that switch is true. In other words, rather than storing a client ID, a switch ID, and a boolean you store just a client ID and a switch ID, but only for the clients with that switch set.
This takes up about one third the space over solution number two, but the real advantage is over solutions one and three: indexing. If you want to find out things like which clients have switches 7, 45, and 130 set but not 86 or 14, you can do them efficiently with a single index on a tag table, but there's no practical way to do them with the other solutions.
You could think about using database views to give you the best of each solution.
For example store the data as one row per switch, but use a view that pivots the switches (rows) into columns where this is more convenient.
I would go with option #2, one row per flag.
However, I'd also consider a mix of #1 and #2. I don't know your app, but if some switches are related, you could group those into tables where you have multiple columns of switches. You could group them based on use or type. You could, and would probably still have a generic table with one switch per row, for ones that don't fit into the groups.
Remember too if you change the method, you may have a lot of application code to change that relys on the existing method of storing the data. Whether you should change the method may depend on exactly how hard it will be and how many hours it will take to change everything associated. I agree with Markus' solution, but you do need to consider exactly how hard refactoring is going to be and whether your project can afford the time. The refactoring book I've been reading would suggest that you maintain both for a set time period with triggers to keep them in synch while you then start fixing all the references. Then on a set date you drop the original (and the triggers) from the database. This allows you to usue the new method going forth, but gives the flexibility that nothing will break before you get it fixed, so you can roll out the change before all references are fixed. It requires discipline however as it is easy to not get rid of the legacy code and columns because everything is working and you are afraid not to. If you are in the midst of total redesign where everything will be tested thougroughly and you have the time built into the project, then go ahead and change everything all at once.
I'd also lean toward option 1, but would also consider an option 4 in some scenarios.
4- Store in dictionary of name value pairs. Serialize to database.
I would recommend option 2. It's relatively straightforward to turn a list of tags/rows into a hash in the code, which makes it fairly easy to check variables. Having a table with 256+ columns seems like a nightmare.
One problem with option #2 is that having a crosstab query is a pain:
Client S1 S2 S3 S4 S5 ...
A X X
B X X X
But there are usually methods for doing that in a database-specific way.