DDD - Persistence Model and Domain Model - oop

I am trying to learn domain-driven design (DDD), and I think I got the basic idea. But there is something confusing me.
In DDD, are the persistence model and domain model different things? I mean, we design our domain and classes with only domain concerns in mind; that's okay. But after that when we are building our repositories or any other data persistence system, should we create another representation of our model to use in persistence layer?
I was thinking our domain model is used in persistence too, meaning our repositories return our domain objects from queries. But today, I read this post, and I'm a little confused:
Just Stop It! The Domain Model Is Not The Persistence Model
If that's true what would be the advantage of having separate persistence objects from domain objects?

Just think of it this way, the domain model should be dependent upon nothing and have no infrastructure code within it. The domain model should not be serializable or inherit from some ORM objects or even share them. These are all infrastructure concerns and should be defined separate from the domain model.
But, that is if you're looking for going for pure DDD and your project values scalability and performance over speed of initial development. Many times, mixing infrastructure concerns with your "domain model" can help you achieve great strides in speed at the cost of scalability. The point is, you need to ask yourself, "Are the benefits of pure DDD worth the cost in the speed of development?". If your answer is yes, then here is the answer to your question.
Let's start with an example where your application begins with a domain model and it just so happens that the tables in the database match your domain model exactly. Now, your application grows by leaps and bounds and you begin to experience performance issues when querying the database. You have applied a few well thought out indexes, but your tables are growing so rapidly that it looks like you may need to de-normalize your database just to keep up. So, with the help of a dba, you come up with a new database design that will handle your performance needs, but now the tables are vastly different from the way they were before and now chunks of your domain entities are spread across multiple tables rather than it being one table for each entity.
This is just one example, but it demonstrates why your domain model should be separate from your persistence model. In this example, you don't want to break out the classes of your domain model to match the changes you made to the persistence model design and essentially change the meaning of your domain model. Instead, you want to change the mapping between your new persistence model and the domain model.
There are several benefits to keeping these designs separate such as scalability, performance, and reaction time to emergency db changes, but you should weigh them against the cost and speed of initial development. Generally, the projects that will gain the most benefit from this level of separation are large-scale enterprise applications.
UPDATE FOR COMMENTATORS
In the world of software development, there is Nth number of possible solutions. Because of this, there exists an indirect inverse relationship between flexibility and initial speed of development. As a simple example, I could hard code logic into a class or I could write a class that allows for dynamic logic rules to be passed into it. The former option would have a higher speed of development, but at the price of a lower degree of flexibility. The latter option would have a higher degree of flexibility, but at the cost of a lower speed of development. This holds true within every coding language because there is always Nth number of possible solutions.
Many tools are available that help you increase your initial development speed and flexibility. For example, an ORM tool may increase the speed of development for your database access code while also giving you the flexibility to choose whatever specific database implementations the ORM supports. From your perspective, this is a net gain in both time and flexibility minus the cost of the tool (some of which are free) which may or may not be worth it to you based on the cost of development time relative to the value of the business need.
But, for this conversation in coding styles, which is essentially what Domain Driven Design is, you have to account for the time it took to write that tool you're using. If you were to write that ORM tool or even write your database access logic in such a way that it supports all of the implementations that tool gives you, it would take much longer than if you were to just hard-code the specific implementation you plan on using.
In summary, tools can help you to offset your own time to production and price of flexibility, often by distributing the cost of that time to everyone who purchases the tool. But, any code including the code that utilizes a tool, will remain affected by the speed/flexibility relationship. In this way, Domain Driven Design allows for greater flexibility than if you were entangle your business logic, database access, service access, and UI code all together, but at the cost of time to production. Domain Driven Design serves Enterprise level applications better than small applications because Enterprise level applications tend to have a greater cost for the initial development time in relation to business value and because they are more complex, they are also more subject to change requiring greater flexibility at a reduced cost in time.

In DDD, are persistence model and domain model different things?
In DDD you have the domain model and the repository. That's it! If inside the repository you will persist the domain model directly OR if you will convert it to a persistence model before persisting it, it's up to you! It's a matter of design, your design.
The domain doesn't care about how models are saved. It's an implementation detail of the repository and it doesn't matter for the domain. That's the entire purpose of Repositories: encapsulate persistence logic & details inside it.
But as developers we know it's not always possible to build a domain 100% immune from persistence interference, even they being different things. Here in this post I detail some Pros & Cons of having the domain model completely free and isolated from the persistence model.

In DDD, are persistence model and domain model different things?
Yes, but that does not necessarily imply a different set of classes to explicitly represent the persistence model.
If using a relational database for persistence an ORM such as NHibernate can take care of representing the persistence model through mappings to domain classes. In this case there are no explicit persistence model classes. The success of this approach depends on that mapping capabilities of the ORM. NHibernate, for example, can support an intermediate mapping class through component mappings. This allows the use of an explicit persistence model class when the need arises.
If using a document database for persistence, there is usually even less need for a persistence model since the domain model only needs to be serializable in order to be persisted.
Therefore, use an explicit persistence model class when there is a complex mapping that cannot be attained with ORM mappings to the domain model. The difference between the domain model and the persistence model remains regardless of implementation.

Related

Data model design guide lines with GEODE

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.
As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.
If my queries are mostly read intensive which requires joins across
tables (2-5 tables), what's the best way to deal with the same with in
memory grid?
Case 1:
Separate regions for all tables in your database and then do a similar join using OQL as you do in database?
Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.
Modeling 1-to-many and many-many relationship using object graph?
Case 2:
If you know how your join queries look like, create a view model per join query having equi join characteristics.
Confusion:
(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]
(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement
So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.
Changes in database can still be managed by event listeners, but what's the recommendations for that?
Thanks,
Dharam
I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).
Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.
In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.
For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.
Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.
Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.
The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.
Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).
You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.
There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.
It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.
No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).
But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.
Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

Are there any REAL advantages to NoSQL over RDBMS for structured data on one machine?

So I've been trying hard to figure out if NoSQL is really bringing that much value outside of auto-sharding and handling UNSTRUCTURED data.
Assuming I can fit my STRUCTURED data on a single machine OR have an effective 'auto-sharding' feature for SQL, what advantages do any NoSQL options offer? I've determined the following:
Document-based (MongoDB, Couchbase, etc) - Outside of it's 'auto-sharding' capabilities, I'm having a hard time understanding where the benefit is. Linked objects are quite similar to SQL joins, while Embedded objects significantly bloat doc size and causes a challenge regarding to replication (a comment could belong to both a post AND a user, and therefore the data would be redundant). Also, loss of ACID and transactions are a big disadvantage.
Key-value based (Redis, Memcached, etc) - Serves a different use case, ideal for caching but not complex queries
Columnar (Cassandra, HBase, etc ) - Seems that the big advantage here is more how the data is stored on disk, and mostly useful for aggregations rather than general use
Graph (Neo4j, OrientDB, etc) - The most intriguing, the use of both edges and nodes makes for an interesting value-proposition, but mostly useful for highly complex relational data rather than general use.
I can see the advantages of Key-value, Columnar and Graph DBs for specific use cases (Caching, social network relationship mapping, aggregations), but can't see any reason to use something like MongoDB for STRUCTURED data outside of it's 'auto-sharding' capabilities.
If SQL has a similar 'auto-sharding' ability, would SQL be a no-brainer for structured data? Seems to me it would be, but I would like the communities opinion...
NOTE: This is in regards to a typical CRUD application like a Social Network, E-Commerce site, CMS etc.
If you're starting off on a single server, then many advantages of NoSQL go out the window. The biggest advantages to the most popular NoSQL are high availability with less down time. Eventual consistency requirements can lead to performance improvements as well. It really depends on your needs.
Document-based - If your data fits well into a handful of small buckets of data, then a document oriented database. For example, on a classifieds site we have Users, Accounts and Listings as the core data. The bulk of search and display operations are against the Listings alone. With the legacy database we have to do nearly 40 join operations to get the data for a single listing. With NoSQL it's a single query. With NoSQL we can also create indexes against nested data, again with results queried without Joins. In this case, we're actually mirroring data from SQL to MongoDB for purposes of search and display (there are other reasons), with a longer-term migration strategy being worked on now. ElasticSearch, RethinkDB and others are great databases as well. RethinkDB actually takes a very conservative approach to the data, and ElasticSearch's out of the box indexing is second to none.
Key-value store - Caching is an excellent use case here, when you are running a medium to high volume website where data is mostly read, a good caching strategy alone can get you 4-5 times the users handled by a single server. Key-value stores (RocksDB, LevelDB, Redis, etc) are also very good options for Graph data, as individual mapping can be held with subject-predicate-target values which can be very fast for graphing options over the top.
Columnar - Cassandra in particular can be used to distribute significant amounts of load for even single-value lookups. Cassandra's scaling is very linear to the number of servers in use. Great for heavy read and write scenarios. I find this less valuable for live searches, but very good when you have a VERY high load and need to distribute. It takes a lot more planning, and may well not fit your needs. You can tweak settings to suite your CAP needs, and even handle distribution to multiple data centers in the box. NOTE: Most applications do emphatically NOT need this level of use. ElasticSearch may be a better fit in most scenarios you would consider HBase/Hadoop or Cassandra for.
Graph - I'm not as familiar with graph databases, so can't comment here (beyond using a key-value store as underlying option).
Given that you then comment on MongoDB specifically vs SQL ... even if both auto-shard. PostgreSQL in particular has made a lot of strides in terms of getting unstrictured data usable (JSON/JSONB types) not to mention the power you can get from something like PLV8, it's probably the most suited to handling the types of loads you might throw at a document store with the advantages of NoSQL. Where it happens to fall down is that replication, sharding and failover are bolted on solutions not really in the box.
For small to medium loads sharding really isn't the best approach. Most scenarios are mostly read so having a replica-set where you have additional read nodes is usually better when you have 3-5 servers. MongoDB is great in this scenario, the master node is automagically elected, and failover is pretty fast. The only weirdness I've seen is when Azure went down in late 2014, and only one of the servers came up first, the other two were almost 40 minutes later. With replication any given read request can be handled in whole by a single server. Your data structures become simpler, and your chances of data loss are reduced.
Again in my own example above, for a mediums sized classifieds site, the vast majority of data belongs to a single collection... it is searched against, and displayed from that collection. With this use case a document store works much better than structured/normalized data. The way the objects are stored are much closer to their representation in the application. There's less of a cognitive disconnect and it simply works.
The fact is that SQL JOIN operations kill performance, especially when aggregating data across those joins. For a single query for a single user it's fine, even with a dozen of them. When you get to dozens of joins with thousands of simultaneous users, it starts to fall apart. At this point you have several choices...
Caching - caching is always a great approach, and the less often your data changes, the better the approach. This can be anything from a set of memcache/redis instances to using something like MongoDB, RethinkDB or ElasticSearch to hold composite records. The challenge here comes down to updating or invalidating your cached data.
Migrating - migrating your data to a data store that better represents your needs can be a good idea as well. If you need to handle massive writes, or very massive read scenarios no SQL database can keep up. You could NEVER handle the likes of Facebook or Twitter on SQL.
Something in between - As you need to scale it depends on what you are doing and where your pain points are as to what will be the best solution for a given situation. Many developers and administrators fear having data broken up into multiple places, but this is often the best answer. Does your analytical data really need to be in the same place as your core operational data? For that matter do your logins need to be tightly coupled? Are you doing a lot of correlated queries? It really depends.
Personal Opinions Ahead
For me, I like the safety net that SQL provides. Having it as the central store for core data it's my first choice. I tend to treat RDBMS's as dumb storage, I don't like being tied to a given platform. I feel that many people try to over-normalize their data. Often I will add an XML or JSON field to a table so additional pieces of data can be stored without bloating the scheme, specifically if it's unlikely to ever be queried... I'll then have properties in my objects in the application code that store in those fields. A good example may be a payment... if you are currently using one system, or multiple systems (one for CC along with Paypal, Google, Amazon etc) then the details of the transaction really don't affect your records, why create 5+ tables to store this detailed data. You can even use JSON for primary storage and have computed columns derived and persisted from that JSON for broader query capability and indexing where needed. Databases like postgresql and mysql (iirc) offer direct indexing against JSON data as well.
When data is a natural fit for a document store, I say go for it... if the vast majority of your queries are for something that fits better to a single record or collection, denormalize away. Having this as a mirror to your primary data is great.
For write-heavy data you want multiple systems in play... It depends heavily on your needs here... Do you need fast hot-query performance? Go with ElasticSearch. Do you need absolute massive horizontal scale, HBase or Cassandra.
The key take away here is not to be afraid to mix it up... there really isn't a one size fits all. As an aside, I feel that if PostgreSQL comes up with a good in the box (for the open-source version) solution for even just replication and automated fail-over they're in a much better position than most at that point.
I didn't really get into, but feel I should mention that there are a number of SaaS solutions and other providers that offer hybrid SQL systems. You can develop against MySQL/MariaDB locally and deploy to a system with SQL on top of a distributed storage cluster. I still feel that HBase or ElasticSearch are better for logging and analitical data, but the SQL on top solutions are also compelling.
More: http://www.mongodb.com/nosql-explained
Schema-less storage (or schema-free). Ability to modify the storage (basically add new fields to records) without having to modify the storage 'declared' schema. RDBMSs require the explicit declaration of said 'fields' and require explicit modifications to the schema before a new 'field' is saved. A schema-free storage engine allows for fast application changes, just modify the app code to save the extra fields, or rename the fields, or drop fields and be done.
Traditional RDBMS folk consider the schema-free a disadvantage because they argue that on the long run one needs to query the storage and handling the heterogeneous records (some have some fields, some have other fields) makes it difficult to handle. But for a start-up the schema-free is overwhelmingly alluring, as fast iteration and time-to-market is all that matter (and often rightly so).
You asked us to assume that either the data can fit on a single machine, OR your database has an effective auto-sharding feature.
Going with the assumption that your SQL data has an auto-sharding feature, that means you're talking about running a cluster. Any time you're running a cluster of machines you have to worry about fault-tolerance.
For example, let's say you're using the simplest approach of sharding your data by application function, and are storing all of your user account data on server A and your product catalog on server B.
Is it acceptable to your business if server A goes down and none of your users can login?
Is it acceptable to your business if server B goes down and no one can buy things?
If not, you need to worry about setting up data replication and high-availability failover. Doable, but not pleasant or easy for SQL databases. Other types of sharding strategies (key, lookup service, etc) have the same challenges.
Many NoSQL databases will automatically handle replication and failovers. Some will do it out of the box, with very little configuration. That's a huge benefit from an operational point of view.
Full disclosure: I'm an engineer at FoundationDB, a NoSQL database that automatically handles sharding, replication, and fail-over with very little configuration. It also has a SQL layer so you you don't have to give up structured data.

Representing a DAG (directed acyclic graph)

I need to store dependencies in a DAG. (We're mapping a new school curriculum at a very fine grained level)
We're using rails 3
Considerations
Wider than it is deep
Very large
I estimate 5-10 links per node. As the system grows this will increase.
Many reads, few writes
most common are lookups:
dependencies of first and second degree
searching/verifying dependencies
I know SQL, I'll consider NoSQL.
Looking for pointers to good comparisons of implementation options.
Also interested in what we can start with fast, but will be less painful to transition to something more robust/scalable later.
I found this example of modeling a directed acyclic graph in SQL:
http://www.codeproject.com/KB/database/Modeling_DAGs_on_SQL_DBs.aspx?msg=3051183
I think the upcoming version (beta at the moment) of the Ruby bindings for the graph database Neo4j should be a good fit. It's for use with Rails 3. The underlying data model uses nodes and directed relationships/edges with key/value style attributes on both. To scale read-mostly architectures Neo4j uses a master/slave replication setup.
You could use OrientDB as graph database. It's highly optimized for relationships since are stored as link and not JOIN. Load of bidirectional graph with 1,000 vertices needs few milliseconds.
The language binding for Rails is not yet available, but you can use it with HTTP RESTful calls.
You might want to take a look at the act_as_dag gem.
https://github.com/resgraph/acts-as-dag
Also some good writing on Dags with SQL for people that might need some background on this.
http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o

Distributed Database Computing - Is it really possible within the RDBMS paradigm?

I am asking this in the context of NoSQL - which achieves scalability and performance without being expensive.
So, if I needed to achieve massively parallel distributed computing across databases ...
What are the various methodologies available today (within the RDBMS paradigm) to achieve distributed computing with high-scalability?
Does database clustering & mirroring contribute in any way towards distributed computing?
I guess you are asking about scalability of RDBMS databases. Talking about NoSQL databases based on ( amazon dynamo, BigTable ) are a whole another topic. I am talking about HBase, Cassandra etc. There are also commerical products like Oracle Coherence thats more like a distributed cache and key value store , to put it crudely.
going back to rdbms,
Sharding
to scale RDBMS one can do cusstom sharding. Sharding is a technique where you have multiple table is possibly multiple hosts. And then you decide in a certain fashion to assign certain rows to certain tables. For example you can say that rows 1-1M goes to table1, 1M-2M goes to table2 etc. But, this is a difficult process from an administration point of view. A lot of large scale websites scale by relying on sharding. Other techniques worth mentioning are partioning and mysql federation and mysql cluster.
MPP databases
Then there are databases are there very RDBMS which does distribution and scaling for you. Terradata is the most successful of these companies. I believe they used postgres core code at some point. A significant number of fortune 500 companies and a lot of the airlines use Terradata. But, its ridiculously expensive. There are newer companies like greenplum, vertica, netezza.
Unless you're a very big company with extreme scalability requirements, you can horizontally and ACID scale up your DB by building a cluster of identical RDBMS instances and synchronizing them with JTA transactions.
Take a look to this Java/JDBC based article the JEPLayer framework is used but you can use straight JDBC and JTA code.
Within the RDBMS paradigm: Sharding.
Outside the RDBMS paradigm: Key-value stores.
My pick: (I come from an RDBMS background) Key-value stores of the tabluar type - HBase.
Within the RDBMS paradigm, sharding will not get you far.
Use the RDBMS paradigm to design your model, to get your project up and running.
Use tabular key-value stores to SCALE OUT.
Sharding:
A good way to think about sharding is to see it as user-account-oriented
DB design.
The all schema entities touched by a user-account are kept on one host.
The assignment of user to host happens when the user creates an account.
The least loaded host gets that user.
When that user signs on after account creation, he gets connected
to the host that has his data.
Each host has a set of user accounts.
The problem with this approach is that if the host gets hosed,
a fraction of users will be blacked out.
The solution to this is have a replicated standby host that
becomes the primary when the primary host encounters problems.
Also, it's a fairly rigid setup for processes where the design does
not change dramatically.
From the user standpoint, I've noticed that web sites
with a sharded DB backend are not as quick to "turn on a dime"
to create different business models on their platform.
Contrast this with web sites that have truly distributed
key-value stores. These businesses can host any range of
services. Their platform is just that - a platform.
It's not relational and it does have an API interface,
but it just seems to work.

Any good literature on join performance vs systematic denormalization?

As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.
Specifically I want information about :
Performance or normalisation versus denormalisation.
Scalability of normalized vs denormalized system.
Maintainability issues of denormalization.
model consistency issues with denormalization.
A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.
End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.
Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).
My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.
---- EDIT
Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.
a thought: you have a clear impedence mis-match, a data access layer that allows access to only one table? Stop right there, this is simply inconsistent with optimal use of a relational database. Relational databases are designed to do complex queries really well. To have no option other than return a single table, and presumably do any joining in the bausiness layer, just doesn't make sense.
For justification of normalisation, and the potential consistency costs you can refer to all the material from Codd onwards, see the Wikipedia article.
I predict that benchmarking this kind of stuff will be a never ending activity, special cases will abound. I claim that normalisation is "normal", people get good enough performance fro a clean database deisgn. Perhaps an approach might be a survey: "How normalised is your data? Scale 0 to 4."
As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.
DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.
There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.
But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.
Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.