Any good literature on join performance vs systematic denormalization? - sql

As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.
Specifically I want information about :
Performance or normalisation versus denormalisation.
Scalability of normalized vs denormalized system.
Maintainability issues of denormalization.
model consistency issues with denormalization.
A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.
End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.
Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).
My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.
---- EDIT
Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.

a thought: you have a clear impedence mis-match, a data access layer that allows access to only one table? Stop right there, this is simply inconsistent with optimal use of a relational database. Relational databases are designed to do complex queries really well. To have no option other than return a single table, and presumably do any joining in the bausiness layer, just doesn't make sense.
For justification of normalisation, and the potential consistency costs you can refer to all the material from Codd onwards, see the Wikipedia article.
I predict that benchmarking this kind of stuff will be a never ending activity, special cases will abound. I claim that normalisation is "normal", people get good enough performance fro a clean database deisgn. Perhaps an approach might be a survey: "How normalised is your data? Scale 0 to 4."

As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.
DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.
There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.
But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.
Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.

Related

How is a graph database different to a graph represented in a relational database?

I can represent a graph trivially in a relational database with two tables: vertex and edge. Richer structure like "properties" and "labels" (in Neo4j terminology) can be represented as more tables. Have I misunderstood, or does a graph database like Neo4j allow me to represent anything that is not easily representable relationally?
I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. Have I misunderstood, or does a graph query language like Cypher provide greater expressivity than SQL?
The relational model of a graph is stored and queried efficiently, AFAIK. Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?
My relational database provides ACID guarantees, and allows me to write fairly expressive constraints on my graph data (and even more constraints if I break out the single vertex table into a properly normalized schema). Have I misunderstood, or does a graph database provide some guarantees or verify some kind of correctness properties that are not available in my relational database?
I am struggling to see how a graph database such as Neo4j is anything but a subset of the relational model. (Apologies for using Neo4j as representative of all graph databases here; it's the only one I've looked at.)
In short: Is graph database ⊆ relational database?
Is One a Subset of the Other?
Definitely no; both are eventually modeled on the mathematical concepts of relations or graphs. Both models being super-general, there is basically no information content that you can't represent using either one. This means that while they might differ in many syntactic sugar ways, and in the way they encourage you to model/think of data (just like programming languages differ) they both have the same "expressive power".
What you describe in your question is one way of modeling a graph (vertex and edge tables). That implementation of a graph is a subset of what relational can express. Similarly, I could mock up tables and rows using a graph database, but I would have chosen a particular implementation - this wouldn't demonstrate that relational data is a subset of graph data.
So the first insight is that they have roughly equal expressive power. You can model anything in either. So the real question you should be asking is why would you choose one over the other?
Why Would you Choose One Over The Other?
All databases exist to facilitate data access. Simply put, you store it so that you can get at the data. But exactly how do you need to get at the data? There are many different access patterns. The design space for databases in general is enormous. Any time a database makes a certain decision, that tends to automatically make it better at some things, worse at others. For example, when you create an index in a relational database, you've just sped up reads -- but you've degraded the performance of writes, because the index has to be maintained.
So, when approaching the question, "Graph or Relational?" - you should first figure out what does your data look like, and what do your data access patterns look like. If you knew what those things were, then you could evaluate a bunch of databases, see the choices they've made, and pick the one that's a good fit for what you need. And then if a DBMS made a choice that would make certain access patterns difficult, buggy, or slow -- you could avoid that DBMS for that data set.
It's (Partly) About Data Access Patterns
Graph databases tend to be better than relational when the data being stored is a graph, when the data access pattern involves a lot of graph traversal, or both. (See this other answer I wrote for a more in-depth discussion of why this is). That link there also provides the answer to your specific question: "Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?"
You say: I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. -- So technically this is true, but let's take an example to see why relational might not be good enough. Say I have a graph (in RDBMS, a table of nodes, a table of edges, with a join key between them). Let's say I pick out one node, and I want to identify everything that is between 6 and 8 hops away from that node. Here's the cypher to do that:
match (myChosenNode {id: 'foo'})-[r:relationshipType*6..8]->(y) return y;
I really want to see you write that up as SQL. It's possible, but it's hard and complicated. And it will also perform like a dog, because of the sheer quantity of joining you'll be doing on non-trivial quantities of data.
ACID
OK now on the ACID guarantees, Neo4J provides transactions with ACID guarantees. The answer will be different for different graph databases though, particularly the ones implemented on top of Hadoop/HBase. YMMV there, so check the fine print with each database.
It is true that there are a number of features of RDBMS that you typically won't find in graph databases, examples being triggers and certain kinds of constraints. As a long-time RDMBS nerd myself, I'm not so happy about those things being missing, I think they are valuable.
Summary
What this mostly boils down to for me, and many other engineers I work with is:
What is your data?
What are your access patterns?
If your data is a graph, or your access patterns involve a lot of graph traversal, you should probably use a graph DB. If your data is more tabluar, or your access patterns are more oriented around bulk scans, then you should use RDBMS. At the end of the day, they're two different tools with different niches. If you use them in their area of strength, you'll be happy. If you use RDBMS to model a graph just "because you can", you'll suffer. If you use a graph database to do a lot of bulk scans of every node in every graph, you'll suffer. Like most of tech, it's just about using the right tool for the job.

access sql database as nosql (couchbase)

I hope to access sql database as the way of nosql key-value pairs/document.
This is for future upgrade if user amount increases a lot,
I can migrate from sql to nosql immediately while application code changes nothing.
Of course I can write the api/solution by myself, just wonder if there is any person has done same thing as I said before and published the solution.
Your comment welcome
While I agree with everything scalabilitysolved has said, there is an interesting feature in the offing for Postgres, scheduled for the 9.4 Postgres release, namely, jsonb: http://www.postgresql.org/docs/devel/static/datatype-json.html with some interesting indexing and query possibilities. I mention this as you tagged Mongodb and Couchbase, both of which use JSON (well, technically BSON in Mongodb's case).
Of course, querying, sharding, replication, ACID guarantees, etc will still be totally different between Postgres (or any other traditional RDBMS) and any document-based NoSQL solution and migrations between any two RDBMS tends to be quite painful, let alone between an RDBMS and a NoSQL data store. However, jsonb looks quite promising as a potential half-way house between two of the major paradigms of data storage.
On the other hand, every release of MongoDB brings enhancements to the aggregation pipeline, which is certainly something that seems to appeal to people used to the flexibility that SQL offers and less "alien" than distributed map/reduce jobs. So, it seems reasonable to conclude that there will continue to be cross pollination.
See Explanation of JSONB introduced by PostgreSQL for some further insights into jsonb.
No no no, do not consider this, it's a really bad idea. Pick either a RDBMS or NoSQL solution based upon how your data is modelled and your usage patterns. Converting from one to another is going to be painful and especially if your 'user amount increases a lot'.
Let's face it, either approach would deal with a large increase in usage and both would benefit more from specific optimizations to their database then simply swapping because one 'scales more'.
If your data model fits RDBMS and it needs to perform better than analyze your queries, check your indexes are optimized and look into caching and better data pattern access.
If your data model fits a NoSQL database then as your dataset grows you can add additional nodes (Couchbase),caching expensive map reduce jobs and again optimizing your data pattern access.
In summary, pick either SQL or NoSQL dependent on your data needs, don't just assume that NoSQL is a magic bullet as with easier scaling comes a much less flexible querying model.

Comparison of Relational Databases and Graph Databases

Can someone explain to me the advantages and disadvantages for a relation database such as MySQL compared to a graph database such as Neo4j?
In SQL you have multiple tables with various ids linking them. Then you have to join to connect the tables. From the perspective of a newbie why would you design the database to require a join rather than having the connections explicit as edges from the start as with a graph database. Conceptually it would make no sense to a newbie. Presumably there is a very technical but non-conceptual reason for this?
There actually is conceptual reasoning behind both styles. Wikipedia on the relational model and graph databases gives good overviews of this.
The primary difference is that in a graph database, the relationships are stored at the individual record level, while in a relational database, the structure is defined at a higher level (the table definitions).
This has important ramifications:
A relational database is much faster when operating on huge numbers
of records. In a graph database, each record has to be examined
individually during a query in order to determine the structure of
the data, while this is known ahead of time in a relational database.
Relational databases use less storage space, because they don't have
to store all of those relationships.
Storing all of the relationships at the individual-record level only makes sense if there is going to be a lot of variation in the relationships; otherwise you are just duplicating the same things over and over. This means that graph databases are well-suited to irregular, complex structures. But in the real world, most databases require regular, relatively simple structures. This is why relational databases predominate.
The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths.
This manifests itself in unexpected and unhelpful ways for a RDBMS user. For example when trying to emulate path operations (e.g. friends of friends) by recursively joining in a relational database, query latency grows unpredictably and massively as does memory usage, not to mention that it tortures SQL to express those kinds of operations. More data means slower in a set-based database, even if you can delay the pain through judicious indexing.
As Dan1111 hinted at, most graph databases don't suffer this kind of join pain because they express relationships at a fundamental level. That is, relationships physically exist on disk and they are named, directed, and can be themselves decorated with properties (this is called the property graph model, see: https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model). This means if you chose to, you could look at the relationships on disk and see how they "join" entities. Relationships are therefore first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.
So why should you care? For two reasons:
Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the join bomb.
Graph databases make modelling and querying much more pleasant meaning faster development and fewer WTF moments. For example expressing friend-of-friend for a typical social network in Neo4j's Cypher query language is just MATCH (me)-[:FRIEND]->()-[:FRIEND]->(foaf) RETURN foaf.
Dan1111 has already given an answer flagged as correct. A couple of additional points are worth noting in passing.
First, in almost every implementation of graph databases, the records are "pinned" because there are an unknown number of pointers pointing at the record in its current location. This means that a record cannot be shuffled to a new location without either leaving a forwarding address at the old location or breaking an unknown number of pointers.
Theoretically, one could shuffle all the records at once and figure out a way to locate and repair all the pointers. In practice this is an operation that could take weeks on a large graph database, during which time the database would have to be off the air. It's just not feasible.
By contrast, in a relational database, records can be reshuffled on a fairly large scale, and the only thing that has to be done is to rebuild any indexes that have been affected. This is a fairly large operation, but nowhere near as large as the equivalent for a graph database.
The second point worth noting in passing is that the world wide web can be seen as a gigantic graph database. Web pages contain hyperlinks, and hyperlinks reference, among other things, other web pages. The reference is via URLs, which function like pointers.
When a web page is moved to a different URL without leaving a forwarding address at the old URL, an unknown number of hyperlinks will become broken. These broken links then give rise to the dreaded, "Error 404: page not found" message that interrupts the pleasure of so many surfers.
With a relational database we can model and query a graph by using foreign keys and self-joins. Just because RDBMS’ contain the word relational does not mean that they are good at handling relationships. The word relational in RDBMS stems from relational algebra and not from relationship. In an RDBMS, the relationship itself does not exist as an object in its own right. It either needs to be represented explicitly as a foreign key or implicitly as a value in a link table (when using a generic/universal modelling approach). Links between data sets are stored in the data itself.
The more we increase the search depth in a relational database the more self-joins we need to perform and the more our query performance suffers. The deeper we go in our hierarchy the more tables we need to join and the slower our query gets. Mathematically the cost grows exponentially in a relational database. In other words the more complex our queries and relationships get the more we benefit from a graph versus a relational database. We don’t have performance problems in a graph database when navigating the graph. This is because a graph database stores the relationships as separate objects. However, the superior read performance comes at the cost of slower writes.
In certain situations it is easier to change the data model in a graph database than it is in an RDBMS, e.g. in an RDBMS if I change a table relationship from 1:n to m:n I need to apply DDL with potential downtime.
RDBMS has on the other hand advantages in other areas, e.g. aggregating data or doing timestamped version control on data.
I discuss some of the other pros and cons in my blog post on graph databases for data warehousing
While the relational model can easily represent the data that is contained in a graph model, we face two
significant problems in practice:
SQL lacks the syntax to easily perform graph traversal, especially
traversals where the depth is unknown or unbounded. For instance,
using SQL to determine friends of your friends is easy enough, but
it is hard to solve the “degrees of separation” problem.
Performance degrades quickly as we traverse the graph. Each level of traversal
adds significantly to query response time.
Reference: Next Generation Databases
Graph databases are worth investigating for the use cases that they excel in, but I have had some reason to question some assertions in the responses above. In particular:
A relational database is much faster when operating on huge numbers of records (dan1111's first bullet point)
Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the join bomb. (Jim Webber's first bullet point)
In other words the more complex our queries and relationships get the more we benefit from a graph versus a relational database. (Uli Bethke's 2nd paragraph)
While these assertions may well have merit, I have yet to find a way to get my specific use case to align with them.
Reference: Graph Database or Relational Database Common Table Extensions: Comparing acyclic graph query performance
Relational Databases are much more efficient in storing tabular data. Despite the word “relational” in their name, relational databases are much less effective at storing or expressing relationships between stored data elements.
The term 'relational' in relational databases relates more to relating columns within a table, not relating information in different tables. Relationships between columns exist to support set operations. So as Database grows in millions or billions records it becomes extremely slow to retrieve data from relational databases.
Unlike a relational database, a graph database is structured entirely around data relationships. Graph databases treat relationships not as a schema structure but as data, like other values.
It is very fast to retrieve data from graph databases.
From a relational database standpoint, you could think of this as pre-materializing JOINs once at insertion time instead of computing them for every query. Because the data is structured entirely around data relationships, real-time query performance can be achieved no matter how large or connected the dataset gets.
The graph databases take more storage space compared to relational database.

What is the resource impact from normalizing a database?

When taking a database from a relatively un-normalized form and normalizing it, what, if any, changes in resource utilization might one expect?
For example, normalization often means more tables get created from fewer which means the database now has a higher number of tables, but many of them are quite small, allowing the often used ones to fit into memory better.
The higher number of tables also means that more joins are needed (potentially) to get at the data that was abstracted out, so one would expect some sort of impact from the higher number of joins the system needs to do.
So, what impact on resource usage (ie. what will change) does normalizing an un-normalized database have?
Edit:
To add a bit of context, I have an existing (ie. legacy) database with over 300 horrible tables. About 1/2 of the data is TEXT and the other half is either char fields or integers. There are no constraints of any kind. The reason I ask is primarily to get more information for convincing others that things need to change and that there won't be a decrease in performance or maintainability. Unfortunately, those I have to convince know just enough about the performance benefits of a de-normalized database to want to avoid normalization as much as possible.
This can not really be answered in a general manner, as the impact will vary heavily depending on the specifics of the database in question and the apps using it.
So you basically stated the general expectations concerning the impact:
Overall memory demands for storage should go down, as redundant data gets removed
CPU needs might go up, as queries might get more expensive (Note that in many cases queries on a normalized database will actually be faster, even if they are more complex, as there are more optimization options for the query engine)
Development resource needs might go up, as developers might need to construct more elaborate queries (But on the other hand, you need less development effort to maintain data integrity)
So the only real answer is the usual: it depends ;)
Note: This assumes that we are talking about cautious and intentional denormalization. If you are referring to the 'just throw some tables together as data comes along' approach way to common with inexperienced developers, I'd risk the statement that normalization will reduce resource needs on all levels ;)
Edit: Concerning the specific context added by cdeszaq, I'd say 'Good luck getting your point through' ;)
Oviously, with over 300 Tables and no constraints (!), the answer to your question is definitely 'normalizing will reduce resource needs on all levels' (and probably very substantially), but:
Refactoring such a mess will be a major undertaking. If there is only one app using this database, it is already dreadful - if there are many, it might become a nightmare!
So even if normalizing would substantially reduce resource needs in the long run, it might not be worth the trouble, depending on circumstances. The main questions here are about long term scope - how important is this database, how long will it be used, will there be more apps using it in the future, is the current maintenance effort constant or increasing, etc. ...
Don't ignore that it is a running system - even if it's ugly and horrible, according to your description it is not (yet) broken ;-)
"Normalization" applies only and exclusively to the logical design of a database.
The logical design of a database and the physical design of a database are two completely distinct things. Database theory has always intended for things to be this way. The fact that the developers who overlook/disregard this distinction (out of ignorance or out of carelessness or out of laziness or out of whatever other so-called-but-invalid "reason") are the vast majority, does not make them right.
A logical design can be said to be normalized or not, but a logical design does not inherently carry any "performance characteristic" whatsoever. Just like 'c:=c+1;' does not inherently carry any performance characteristic.
A physical design does determine "performance characteristics", but then again a physical design simply does not have the quality of being "normalized or not".
This flawed perception of "normalization hurting performance" is really nothing else than concrete proof that all the DBMS engines that exist today are just seriously lacking in physical design options.
There's a very simple answer to your question: it depends.
Firstly, I'd re-phrase your question as 'what is the benefit of denormalization', because normalization is the something that should be done as a default (as the result of a pure logical model) and then denormalization can be applied for very specific tables where performance is critical. The main problem of denormalization is that it can complicate data integrity management, but the benefits in some cases outweigh the risks.
My advice for denormalization: do it only when it really hurts and make sure you got all scenarios covered when it comes to maintaining data integrity after any inserts, updates or deleted.
To underscore some points made by prior posters: Is you current schema really denormalized? The proper way (imho) to design a database is to:
Understand as best you can the system/information to be modeled
Build a fully normalized model
Then, if and as you find it necessary, denormalize in a controlled fashion to enhance performance
(There may be other reasons to denormalize, but the only ones I can think of off-hand are political ones--have to match the existing code, the developers/managers don't like it, etc.)
My point is, if you never fully normalized, you don't have a denormalized database, you've got an unnormalized one. And I think you can think of more descriptive if less polite terms for those databases.
I've found that normalization, in some cases, will improve performance.
Small tables read more quickly. A badly denormalized database will often have (a) longer rows and (b) more rows than a normalized design.
Reading fewer shorter rows means less physical I/O.
For one thing, you'll end up having to do resultset calculations. For example, if you have a Blog, with a number of Posts, you could either do:
select count(*) from Post where BlogID = #BlogID
which is more expensive than
select PostCount from Blog where ID = #BlogID
and can lead to the SELECT N+1 problem, if you're not careful.
Of course with the second option you have to deal with keeping the data integrity, but if the first option is painful enough, then you make it work.
Be careful you don't fall foul of premature optimisation. Do it in the normalised fashion, then measure performance against requirements, and only if it falls short should you look to denormalise.
Normalized schemas tend to perform better for INSERT/UPDATE/DELETE because there are no "update anomalies" and the actual changes that need to be made are more localized.
SELECTs are mixed. Denormalization is esentially materializing a join. There's no doubt that materializing a join sometimes helps, however, materialization is often very pessimistic (probably more often than not), so don't assume that denormalization will help you. Also, normalized schemas are generally smaller and therefore might require less I/O. A join is not necessarily expensive, so don't automatically assume that it will be.
I wanted to elaborate on Henrik Opel's #3 bullet point. Development costs might go up, but they don't have to. In fact, normalization of a database should simplify or enable the use of tools like ORMs, Code Generators, Report Writers, etc. These tools can significantly reduce the time spent on the data access layer of your applications and move development on through to adding business value.
You can find a good StackOverflow discussion here about the development aspect of normalized databases. There were many good answers, comments and things to think about.

How far to take normalization? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have these tables:
Projects(projectID, CreatedByID)
Employees(empID,depID)
Departments(depID,OfficeID)
Offices(officeID)
CreatedByID is a foreign key for Employees. I have a query that runs for almost every page load.
Is it bad practice to just add a redundant OfficeID column to Projects to eliminate the three joins? Or should I do the following:
SELECT *
FROM Projects P
JOIN Employees E ON P.CreatedBY = E.EmpID
JOIN Departments D ON E.DepID = D.DepID
JOIN Offices O ON D.officeID = O.officeID
WHERE O.officeID = #SomeOfficeID
In application programming I "Write with best practices first and optimize afterwards", but database administrators are always warning about the cost of joins.
Normalize till it hurts, then denormalize till it works
Denormalization has the advantage of fast SELECTs on large queries.
Disadvantages are:
It takes more coding and time to ensure integrity (which is most important in your case)
It's slower on DML (INSERT/UPDATE/DELETE)
It takes more space
As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).
Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.
In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in three tables instead of one.
So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.
Always normalize as far as necessary to remove database integrity issues (i.e. potential duplicated or missing data).
Even if there were performance gains from denormalizing (which is usually not the case), the price of losing data integrity is too high to justify.
Just ask anyone who has had to work on fixing all the obscure issues from a legacy database whether they would prefer good data or insignificant (if any) speed increases.
Also, as mentioned by John - if you do end up needing denormalised data (for speed/reporting/etc) then create it in a separate table, preserving the raw data.
The cost of joins shouldn't worry you too much per se (unless you're trying to scale to millions of users, in which case you absolutely should worry).
I'd be more concerned about the effect on the code that's calling this. Normalized databases are much easier to program against, and almost always lead to better efficiency within the application itself.
That said, don't normalize beyond the bounds of reason. I've seen normalization for normalization's sake, which usually ends up in a database that has one or two tables of actual data, and 20 tables filled with nothing but foreign keys. That's clearly overkill. The rule I normally use is: If the data in a column would otherwise be duplicated, it should be normalized.
It is better to keep that schema in Third Normal Form and let your DBA to complain about joins cost.
DBA's should be concerned if your db is not properly normalized to begin with. After you carefully measured performance and determined you have bottlenecks you may start denormalizing, but I would be extremely cautious.
I'd be most concerned about DBAs who are warning you about the cost of joins, unless you're in a highly pathological situation.
You shouldn't look at denormalizing before you've tried everything else.
Is the performance of this really an issue?
Do your database have any features you can use to speed things up without compromising integrity?
Can you increase your performance by caching?
Normalize to model the concepts in your design, and their relationship. Think of what relationships can change, and what a change like that will mean in terms of your design.
In the schema you posted, there is what looks to me like a glaring error (which may not be an error if you have a special case in terms of how your organization works) -- there is an implicit assumption that every department is in exactly one office, and that all the employees who are in the same department work at that office.
What if the department occupies two offices?
What if an employee nominally belongs to one department, but works out of a different office (assuming you are referring to physical offices)?
Don't denormalize.
Design your tables according to simple and sound design principles that will make it easy to implement the rest of your system. Easy to build, populate, use, and administer the database. Easy and fast to run queries and updates against. Easy to revise and extend the table design when the situation calls for it, and unnecessary to do so for light and transient reasons.
One set of design principles is normalization. Normalization leads to tables that are easy and fast to update (including inserts and deletes). Normalization obviates update anomalies, and obviates the possiblity of a database that contradicts itself. This prevents a whole lot of bugs by making them impossible. It also prevents a whole lot of update bottlenecks by making them unnecessary. This is good.
There are other sets of design principles. They lead to table designs that are less than fully normalized. But that isn't "denormalization". It's just a different design, somewhat incompatible with normalization.
One set of design principles that leads to a radically different design from normalization is star schema design. Star schema is very fast for queries. Even large scale joins and aggregations can be done in a reasonable time, given a good DBMS, good physical design, and enough hardware to get the job done. As you might expect, a star schema suffers update anomalies. You have to program around these anomalies when you keep the database up to date. You will will generally need a tightly controlled and carefully built ETL process that updates the star schema from other (perhaps normalized) data sources.
Using data stored in a star schema is dramatically easy. It's so easy that using some kind of OLAP and reporting engine, you can get all the information needed without writing any code, and without sacrificing performance too much.
It takes good and somewhat deep data analysis to design a good normalized schema. Errors and omissions in data analysis may result in undiscovered functional dependencies. These undiscovered FDs will result in unwitting departures from normalization.
It also takes good and somewhat deep data analysis to design and build a good star schema. Errors and ommissions in data analysis may result in unfortunate choices in dimensions and granularity. This will make ETL almost impossible to build, and/or make the information carrying capacity of the star inadequate for the emerging needs.
Good and somewhat deep data analysis should not be an excuse for analysis paralysis. The analysis has to be right and reasonably complete in a short amount of time. Shorter for smaller projects. The design and implementation should be able to survive some late additions and corrections to the data analysis and to the requirements, but not a steady torrent of requirements revisions.
This response expands on your original question, but I think it's relevant for the would be database designer.
Normalization is a quality decision.
Denormalization is a performance decision.
That's why -
Normalize till it hurts; De-normalize till it works.
Quality decisions tell which is the least Normal Form that you can live with:
How much non-redundancy is important for your tables?
How fast data management do you want?
How clear do you want the relation between your tables?
Performance decisions tell what is the highest Normal Form acceptable:
Is my database's response fast enough?
Are too many joins causing a slowdown?
When you have fixed the least and the highest Normal Form acceptable in your case, pick the Normal Form anywhere in between.
If you're using Integers (or BIGINT) as the ID's and they are the clustered primary key you should be fine.
Although it seems like it would always be faster to find an office from a project as you are always looking up primary keys the use of indexes on the foreign keys will make the difference minimal as the indexes will cover the primary keys too.
If you ever find a need later on to denormalise the data, you can create a cache table on a schedule or trigger.
In the example given indexes set up properly on the tables should allow the joins to occur extremely fast and will scale well to the 100,000s of rows. This is usually the approach that I take to get around the issue.
There are times though that the data is written once and the selected for the rest of its life where it really didn't make sense to do a dozen joins each time.