How far to take normalization? [closed]

How far to take normalization? [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have these tables:
Projects(projectID, CreatedByID)
Employees(empID,depID)
Departments(depID,OfficeID)
Offices(officeID)
CreatedByID is a foreign key for Employees. I have a query that runs for almost every page load.
Is it bad practice to just add a redundant OfficeID column to Projects to eliminate the three joins? Or should I do the following:
SELECT *
FROM Projects P
JOIN Employees E ON P.CreatedBY = E.EmpID
JOIN Departments D ON E.DepID = D.DepID
JOIN Offices O ON D.officeID = O.officeID
WHERE O.officeID = #SomeOfficeID
In application programming I "Write with best practices first and optimize afterwards", but database administrators are always warning about the cost of joins.

Normalize till it hurts, then denormalize till it works

Denormalization has the advantage of fast SELECTs on large queries.
Disadvantages are:
It takes more coding and time to ensure integrity (which is most important in your case)
It's slower on DML (INSERT/UPDATE/DELETE)
It takes more space
As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).
Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.
In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in three tables instead of one.
So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.

Always normalize as far as necessary to remove database integrity issues (i.e. potential duplicated or missing data).
Even if there were performance gains from denormalizing (which is usually not the case), the price of losing data integrity is too high to justify.
Just ask anyone who has had to work on fixing all the obscure issues from a legacy database whether they would prefer good data or insignificant (if any) speed increases.
Also, as mentioned by John - if you do end up needing denormalised data (for speed/reporting/etc) then create it in a separate table, preserving the raw data.

The cost of joins shouldn't worry you too much per se (unless you're trying to scale to millions of users, in which case you absolutely should worry).
I'd be more concerned about the effect on the code that's calling this. Normalized databases are much easier to program against, and almost always lead to better efficiency within the application itself.
That said, don't normalize beyond the bounds of reason. I've seen normalization for normalization's sake, which usually ends up in a database that has one or two tables of actual data, and 20 tables filled with nothing but foreign keys. That's clearly overkill. The rule I normally use is: If the data in a column would otherwise be duplicated, it should be normalized.

It is better to keep that schema in Third Normal Form and let your DBA to complain about joins cost.

DBA's should be concerned if your db is not properly normalized to begin with. After you carefully measured performance and determined you have bottlenecks you may start denormalizing, but I would be extremely cautious.

I'd be most concerned about DBAs who are warning you about the cost of joins, unless you're in a highly pathological situation.

You shouldn't look at denormalizing before you've tried everything else.
Is the performance of this really an issue?
Do your database have any features you can use to speed things up without compromising integrity?
Can you increase your performance by caching?

Normalize to model the concepts in your design, and their relationship. Think of what relationships can change, and what a change like that will mean in terms of your design.
In the schema you posted, there is what looks to me like a glaring error (which may not be an error if you have a special case in terms of how your organization works) -- there is an implicit assumption that every department is in exactly one office, and that all the employees who are in the same department work at that office.
What if the department occupies two offices?
What if an employee nominally belongs to one department, but works out of a different office (assuming you are referring to physical offices)?

Don't denormalize.
Design your tables according to simple and sound design principles that will make it easy to implement the rest of your system. Easy to build, populate, use, and administer the database. Easy and fast to run queries and updates against. Easy to revise and extend the table design when the situation calls for it, and unnecessary to do so for light and transient reasons.
One set of design principles is normalization. Normalization leads to tables that are easy and fast to update (including inserts and deletes). Normalization obviates update anomalies, and obviates the possiblity of a database that contradicts itself. This prevents a whole lot of bugs by making them impossible. It also prevents a whole lot of update bottlenecks by making them unnecessary. This is good.
There are other sets of design principles. They lead to table designs that are less than fully normalized. But that isn't "denormalization". It's just a different design, somewhat incompatible with normalization.
One set of design principles that leads to a radically different design from normalization is star schema design. Star schema is very fast for queries. Even large scale joins and aggregations can be done in a reasonable time, given a good DBMS, good physical design, and enough hardware to get the job done. As you might expect, a star schema suffers update anomalies. You have to program around these anomalies when you keep the database up to date. You will will generally need a tightly controlled and carefully built ETL process that updates the star schema from other (perhaps normalized) data sources.
Using data stored in a star schema is dramatically easy. It's so easy that using some kind of OLAP and reporting engine, you can get all the information needed without writing any code, and without sacrificing performance too much.
It takes good and somewhat deep data analysis to design a good normalized schema. Errors and omissions in data analysis may result in undiscovered functional dependencies. These undiscovered FDs will result in unwitting departures from normalization.
It also takes good and somewhat deep data analysis to design and build a good star schema. Errors and ommissions in data analysis may result in unfortunate choices in dimensions and granularity. This will make ETL almost impossible to build, and/or make the information carrying capacity of the star inadequate for the emerging needs.
Good and somewhat deep data analysis should not be an excuse for analysis paralysis. The analysis has to be right and reasonably complete in a short amount of time. Shorter for smaller projects. The design and implementation should be able to survive some late additions and corrections to the data analysis and to the requirements, but not a steady torrent of requirements revisions.
This response expands on your original question, but I think it's relevant for the would be database designer.

Normalization is a quality decision.
Denormalization is a performance decision.
That's why -
Normalize till it hurts; De-normalize till it works.
Quality decisions tell which is the least Normal Form that you can live with:
How much non-redundancy is important for your tables?
How fast data management do you want?
How clear do you want the relation between your tables?
Performance decisions tell what is the highest Normal Form acceptable:
Is my database's response fast enough?
Are too many joins causing a slowdown?
When you have fixed the least and the highest Normal Form acceptable in your case, pick the Normal Form anywhere in between.

If you're using Integers (or BIGINT) as the ID's and they are the clustered primary key you should be fine.
Although it seems like it would always be faster to find an office from a project as you are always looking up primary keys the use of indexes on the foreign keys will make the difference minimal as the indexes will cover the primary keys too.
If you ever find a need later on to denormalise the data, you can create a cache table on a schedule or trigger.

In the example given indexes set up properly on the tables should allow the joins to occur extremely fast and will scale well to the 100,000s of rows. This is usually the approach that I take to get around the issue.
There are times though that the data is written once and the selected for the rest of its life where it really didn't make sense to do a dozen joins each time.

Related

To normalize or not normalize? What performs better? [duplicate]

This question already has answers here:
more performant to have normalized or denormalized tables
(5 answers)
Closed 8 years ago.
Would multiple, joined, normalized tables return queries faster than 1 denormalized table? I'm interested in the performance of read (select) statement, not insert, delete, update.
I believe the normalized, joined tables return select queries faster, but I've also heard that since all of the data is in one row with 1 denormalized table, that denormalized tables return queries faster.
I'm trying to find this out, so I can improve visualization rendering on Tableau, so I'm concerned with the read operations of the table, not write.
Any clearing up on this confusion would be appreciated.

If you are dealing with static data warehouse, sometimes it IS better to deal with denormalized data, especially with any type of aggregations / roll-up values you may be interested in within the data. Having pre-summarized tables on very large datasets is good, but without knowing more of the context of your data, as best I can offer as an answer.
To clarify from your comment...
Lets say you are dealing with (ex: something I worked with in the past) government contract and grants data for the year 2010-2012. The data itself is not going to change... who awarded, gov't sector, small/large business classification, amount awarded, etc. These values won't really change, so if you wanted to know which companies were awarded how much per gov't congressional district, per state, per industry, etc. Having pre-aggregate totals would save time.
Having a read-only display system (querying only) from another system that is performing the data entry (such as sales activity that DOES the insert/update/delete), you should obviously stay in a normalized mode as the underlying data IS changing.. again even though you are giving read-only inquiry access to it.

It should be pretty obvious that the fastest way to get a query result is if it has already been pre-built and is sitting ready for retrieval in a single table.
However, from a maintenance perspective that is not practical.
It is generally good advice to keep most data in normalized tables, but see DRapp's answer for scenarios where denormalization is sometimes used.

That's very dependent on the situation, as others have pointed out. The best thing you can do if you need top-notch performance is generate some tests to see how things work out and then implement the fastest solution. Create one set of denormalized tables, one set of normalized, and run some queries and see how fast they execute. Go from there.
However, unless you have TONS of data, speed is probably not your biggest concern. Modern RDBMS's are extremely efficient, especially with the appropriate indexes, etc. in place. You might be better off asking whether normalized or de-normalized tables make more logical sense for the work you are doing. You might also consider that one of the biggest arguments for normalized tables is that they help prevent data errors. Consider doing some background reading on normalization for an explanation of this. If you want to make sure your data is as clean as possible, you may want to normalize, even if you take a small performance hit.

When to use a query or code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.

Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.

There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.

In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.

It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.

Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.

-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.

What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.

Database Normalisation and Searching it Quickly

I'm working on the technical architecture for a content solution integration. The data from the solution provider runs to millions of rows and normalised to 3NF. It is updated on a regular schedule (daily most likely) and its data is split down to a very granular level of atomicity.
I need to search and query this data and my current inclination is to leave the normalised data alone and create a denormalised database from its data (OLAP to OLTP). The 'transfer' can be a custom built program that can contain the necessary business logic in addition to the raw copying power and be run at a set schedule as required. The denormalised database would then reduce the atomicity and allow the keyword searches and queries to run efficiently. I was looking at using Lucene .NET for the keyword work on the denormalised database.
So before I sing loudly from the hills that this is the way forward, I wanted some expert opinion on this and what is the perceived "best practise". Is the method I have suggested the best way forward considering the data I will be provided? It was suggested that perhaps I could use a 'search engine' to search the normalised data. This scared the hell out of me, but raised the question; what search engine and how?
Opinions, flames, bad language and help appreciated :)

I have built reporting databases and data warehouses based on data stored in normalized form. There is quite a bit of work involved in the transfer program (ETL). Given your description of the data feed, maybe some of that work has been done for you by the feeder.
Millions of rows isn't a lot, these days. You may be able to get away with report oriented views into the existing database. Try it and see.
The biggest benefit to building an OLAP oriented database is not speed. It's flexibility. "We love this report, but now we want to see it weekly and quarterly instead of monthly. Bam! Done!" "Can you break it down by marketing category instead of manufacturing category? Bam! Done!" And so on.

A resonably normalized model (3NF/BCNF) provides the best average performance and the least amount of modification anomalies for the largest number of scenarios. That's big, so I would start from there. As your requirements are fuzzy, it's seems like the most sensible option.
Actually, the most sensible thing would be to go over the requirements until they are a bit more "crisp" ;)
Also, if you could get your hands on a few early extracts from your data provider you could experiment with it and get a feeling for the data distributions (not all people live in one country, and some countries holds more people than others. Not all people have children, and the number children per person is vastly different depending on the country). This is a major point and it is crucial that the optimizer can make good decisions.
Other than that, I agree with everything Walter said and also gave him my vote.

What is the resource impact from normalizing a database?

When taking a database from a relatively un-normalized form and normalizing it, what, if any, changes in resource utilization might one expect?
For example, normalization often means more tables get created from fewer which means the database now has a higher number of tables, but many of them are quite small, allowing the often used ones to fit into memory better.
The higher number of tables also means that more joins are needed (potentially) to get at the data that was abstracted out, so one would expect some sort of impact from the higher number of joins the system needs to do.
So, what impact on resource usage (ie. what will change) does normalizing an un-normalized database have?
Edit:
To add a bit of context, I have an existing (ie. legacy) database with over 300 horrible tables. About 1/2 of the data is TEXT and the other half is either char fields or integers. There are no constraints of any kind. The reason I ask is primarily to get more information for convincing others that things need to change and that there won't be a decrease in performance or maintainability. Unfortunately, those I have to convince know just enough about the performance benefits of a de-normalized database to want to avoid normalization as much as possible.

This can not really be answered in a general manner, as the impact will vary heavily depending on the specifics of the database in question and the apps using it.
So you basically stated the general expectations concerning the impact:
Overall memory demands for storage should go down, as redundant data gets removed
CPU needs might go up, as queries might get more expensive (Note that in many cases queries on a normalized database will actually be faster, even if they are more complex, as there are more optimization options for the query engine)
Development resource needs might go up, as developers might need to construct more elaborate queries (But on the other hand, you need less development effort to maintain data integrity)
So the only real answer is the usual: it depends ;)
Note: This assumes that we are talking about cautious and intentional denormalization. If you are referring to the 'just throw some tables together as data comes along' approach way to common with inexperienced developers, I'd risk the statement that normalization will reduce resource needs on all levels ;)
Edit: Concerning the specific context added by cdeszaq, I'd say 'Good luck getting your point through' ;)
Oviously, with over 300 Tables and no constraints (!), the answer to your question is definitely 'normalizing will reduce resource needs on all levels' (and probably very substantially), but:
Refactoring such a mess will be a major undertaking. If there is only one app using this database, it is already dreadful - if there are many, it might become a nightmare!
So even if normalizing would substantially reduce resource needs in the long run, it might not be worth the trouble, depending on circumstances. The main questions here are about long term scope - how important is this database, how long will it be used, will there be more apps using it in the future, is the current maintenance effort constant or increasing, etc. ...
Don't ignore that it is a running system - even if it's ugly and horrible, according to your description it is not (yet) broken ;-)

"Normalization" applies only and exclusively to the logical design of a database.
The logical design of a database and the physical design of a database are two completely distinct things. Database theory has always intended for things to be this way. The fact that the developers who overlook/disregard this distinction (out of ignorance or out of carelessness or out of laziness or out of whatever other so-called-but-invalid "reason") are the vast majority, does not make them right.
A logical design can be said to be normalized or not, but a logical design does not inherently carry any "performance characteristic" whatsoever. Just like 'c:=c+1;' does not inherently carry any performance characteristic.
A physical design does determine "performance characteristics", but then again a physical design simply does not have the quality of being "normalized or not".
This flawed perception of "normalization hurting performance" is really nothing else than concrete proof that all the DBMS engines that exist today are just seriously lacking in physical design options.

There's a very simple answer to your question: it depends.
Firstly, I'd re-phrase your question as 'what is the benefit of denormalization', because normalization is the something that should be done as a default (as the result of a pure logical model) and then denormalization can be applied for very specific tables where performance is critical. The main problem of denormalization is that it can complicate data integrity management, but the benefits in some cases outweigh the risks.
My advice for denormalization: do it only when it really hurts and make sure you got all scenarios covered when it comes to maintaining data integrity after any inserts, updates or deleted.

To underscore some points made by prior posters: Is you current schema really denormalized? The proper way (imho) to design a database is to:
Understand as best you can the system/information to be modeled
Build a fully normalized model
Then, if and as you find it necessary, denormalize in a controlled fashion to enhance performance
(There may be other reasons to denormalize, but the only ones I can think of off-hand are political ones--have to match the existing code, the developers/managers don't like it, etc.)
My point is, if you never fully normalized, you don't have a denormalized database, you've got an unnormalized one. And I think you can think of more descriptive if less polite terms for those databases.

I've found that normalization, in some cases, will improve performance.
Small tables read more quickly. A badly denormalized database will often have (a) longer rows and (b) more rows than a normalized design.
Reading fewer shorter rows means less physical I/O.

For one thing, you'll end up having to do resultset calculations. For example, if you have a Blog, with a number of Posts, you could either do:
select count(*) from Post where BlogID = #BlogID
which is more expensive than
select PostCount from Blog where ID = #BlogID
and can lead to the SELECT N+1 problem, if you're not careful.
Of course with the second option you have to deal with keeping the data integrity, but if the first option is painful enough, then you make it work.
Be careful you don't fall foul of premature optimisation. Do it in the normalised fashion, then measure performance against requirements, and only if it falls short should you look to denormalise.

Normalized schemas tend to perform better for INSERT/UPDATE/DELETE because there are no "update anomalies" and the actual changes that need to be made are more localized.
SELECTs are mixed. Denormalization is esentially materializing a join. There's no doubt that materializing a join sometimes helps, however, materialization is often very pessimistic (probably more often than not), so don't assume that denormalization will help you. Also, normalized schemas are generally smaller and therefore might require less I/O. A join is not necessarily expensive, so don't automatically assume that it will be.

I wanted to elaborate on Henrik Opel's #3 bullet point. Development costs might go up, but they don't have to. In fact, normalization of a database should simplify or enable the use of tools like ORMs, Code Generators, Report Writers, etc. These tools can significantly reduce the time spent on the data access layer of your applications and move development on through to adding business value.
You can find a good StackOverflow discussion here about the development aspect of normalized databases. There were many good answers, comments and things to think about.

Any good literature on join performance vs systematic denormalization?

As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.
Specifically I want information about :
Performance or normalisation versus denormalisation.
Scalability of normalized vs denormalized system.
Maintainability issues of denormalization.
model consistency issues with denormalization.
A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.
End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.
Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).
My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.
---- EDIT
Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.

a thought: you have a clear impedence mis-match, a data access layer that allows access to only one table? Stop right there, this is simply inconsistent with optimal use of a relational database. Relational databases are designed to do complex queries really well. To have no option other than return a single table, and presumably do any joining in the bausiness layer, just doesn't make sense.
For justification of normalisation, and the potential consistency costs you can refer to all the material from Codd onwards, see the Wikipedia article.
I predict that benchmarking this kind of stuff will be a never ending activity, special cases will abound. I claim that normalisation is "normal", people get good enough performance fro a clean database deisgn. Perhaps an approach might be a survey: "How normalised is your data? Scale 0 to 4."

As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.
DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.
There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.
But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.
Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas