When taking a database from a relatively un-normalized form and normalizing it, what, if any, changes in resource utilization might one expect?
For example, normalization often means more tables get created from fewer which means the database now has a higher number of tables, but many of them are quite small, allowing the often used ones to fit into memory better.
The higher number of tables also means that more joins are needed (potentially) to get at the data that was abstracted out, so one would expect some sort of impact from the higher number of joins the system needs to do.
So, what impact on resource usage (ie. what will change) does normalizing an un-normalized database have?
Edit:
To add a bit of context, I have an existing (ie. legacy) database with over 300 horrible tables. About 1/2 of the data is TEXT and the other half is either char fields or integers. There are no constraints of any kind. The reason I ask is primarily to get more information for convincing others that things need to change and that there won't be a decrease in performance or maintainability. Unfortunately, those I have to convince know just enough about the performance benefits of a de-normalized database to want to avoid normalization as much as possible.
This can not really be answered in a general manner, as the impact will vary heavily depending on the specifics of the database in question and the apps using it.
So you basically stated the general expectations concerning the impact:
Overall memory demands for storage should go down, as redundant data gets removed
CPU needs might go up, as queries might get more expensive (Note that in many cases queries on a normalized database will actually be faster, even if they are more complex, as there are more optimization options for the query engine)
Development resource needs might go up, as developers might need to construct more elaborate queries (But on the other hand, you need less development effort to maintain data integrity)
So the only real answer is the usual: it depends ;)
Note: This assumes that we are talking about cautious and intentional denormalization. If you are referring to the 'just throw some tables together as data comes along' approach way to common with inexperienced developers, I'd risk the statement that normalization will reduce resource needs on all levels ;)
Edit: Concerning the specific context added by cdeszaq, I'd say 'Good luck getting your point through' ;)
Oviously, with over 300 Tables and no constraints (!), the answer to your question is definitely 'normalizing will reduce resource needs on all levels' (and probably very substantially), but:
Refactoring such a mess will be a major undertaking. If there is only one app using this database, it is already dreadful - if there are many, it might become a nightmare!
So even if normalizing would substantially reduce resource needs in the long run, it might not be worth the trouble, depending on circumstances. The main questions here are about long term scope - how important is this database, how long will it be used, will there be more apps using it in the future, is the current maintenance effort constant or increasing, etc. ...
Don't ignore that it is a running system - even if it's ugly and horrible, according to your description it is not (yet) broken ;-)
"Normalization" applies only and exclusively to the logical design of a database.
The logical design of a database and the physical design of a database are two completely distinct things. Database theory has always intended for things to be this way. The fact that the developers who overlook/disregard this distinction (out of ignorance or out of carelessness or out of laziness or out of whatever other so-called-but-invalid "reason") are the vast majority, does not make them right.
A logical design can be said to be normalized or not, but a logical design does not inherently carry any "performance characteristic" whatsoever. Just like 'c:=c+1;' does not inherently carry any performance characteristic.
A physical design does determine "performance characteristics", but then again a physical design simply does not have the quality of being "normalized or not".
This flawed perception of "normalization hurting performance" is really nothing else than concrete proof that all the DBMS engines that exist today are just seriously lacking in physical design options.
There's a very simple answer to your question: it depends.
Firstly, I'd re-phrase your question as 'what is the benefit of denormalization', because normalization is the something that should be done as a default (as the result of a pure logical model) and then denormalization can be applied for very specific tables where performance is critical. The main problem of denormalization is that it can complicate data integrity management, but the benefits in some cases outweigh the risks.
My advice for denormalization: do it only when it really hurts and make sure you got all scenarios covered when it comes to maintaining data integrity after any inserts, updates or deleted.
To underscore some points made by prior posters: Is you current schema really denormalized? The proper way (imho) to design a database is to:
Understand as best you can the system/information to be modeled
Build a fully normalized model
Then, if and as you find it necessary, denormalize in a controlled fashion to enhance performance
(There may be other reasons to denormalize, but the only ones I can think of off-hand are political ones--have to match the existing code, the developers/managers don't like it, etc.)
My point is, if you never fully normalized, you don't have a denormalized database, you've got an unnormalized one. And I think you can think of more descriptive if less polite terms for those databases.
I've found that normalization, in some cases, will improve performance.
Small tables read more quickly. A badly denormalized database will often have (a) longer rows and (b) more rows than a normalized design.
Reading fewer shorter rows means less physical I/O.
For one thing, you'll end up having to do resultset calculations. For example, if you have a Blog, with a number of Posts, you could either do:
select count(*) from Post where BlogID = #BlogID
which is more expensive than
select PostCount from Blog where ID = #BlogID
and can lead to the SELECT N+1 problem, if you're not careful.
Of course with the second option you have to deal with keeping the data integrity, but if the first option is painful enough, then you make it work.
Be careful you don't fall foul of premature optimisation. Do it in the normalised fashion, then measure performance against requirements, and only if it falls short should you look to denormalise.
Normalized schemas tend to perform better for INSERT/UPDATE/DELETE because there are no "update anomalies" and the actual changes that need to be made are more localized.
SELECTs are mixed. Denormalization is esentially materializing a join. There's no doubt that materializing a join sometimes helps, however, materialization is often very pessimistic (probably more often than not), so don't assume that denormalization will help you. Also, normalized schemas are generally smaller and therefore might require less I/O. A join is not necessarily expensive, so don't automatically assume that it will be.
I wanted to elaborate on Henrik Opel's #3 bullet point. Development costs might go up, but they don't have to. In fact, normalization of a database should simplify or enable the use of tools like ORMs, Code Generators, Report Writers, etc. These tools can significantly reduce the time spent on the data access layer of your applications and move development on through to adding business value.
You can find a good StackOverflow discussion here about the development aspect of normalized databases. There were many good answers, comments and things to think about.
Related
As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.
Specifically I want information about :
Performance or normalisation versus denormalisation.
Scalability of normalized vs denormalized system.
Maintainability issues of denormalization.
model consistency issues with denormalization.
A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.
End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.
Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).
My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.
---- EDIT
Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.
a thought: you have a clear impedence mis-match, a data access layer that allows access to only one table? Stop right there, this is simply inconsistent with optimal use of a relational database. Relational databases are designed to do complex queries really well. To have no option other than return a single table, and presumably do any joining in the bausiness layer, just doesn't make sense.
For justification of normalisation, and the potential consistency costs you can refer to all the material from Codd onwards, see the Wikipedia article.
I predict that benchmarking this kind of stuff will be a never ending activity, special cases will abound. I claim that normalisation is "normal", people get good enough performance fro a clean database deisgn. Perhaps an approach might be a survey: "How normalised is your data? Scale 0 to 4."
As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.
DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.
There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.
But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.
Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.
I got into a bit of a debate yesterday with my boss about the proper role of optimization when building software. Essentially, his position was that optimization needs to be a primary concern during the entire process of development.
My opinion is that you need to make the right algorithmic decisions during development, but you should never be counting cycles during development. In fact, I feel so strongly about this I had to walk away from the conversation. I've seen too many bad programming decisions in the name of "optimization", and too much bad code defended with the excuse "this way is faster".
What does the StackOverflow.com community think?
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
- Donald Knuth
I think the premature optimization quote is used by too many to avoid thinking about the hard stuff concerning how well the application will run. I guarantee the users want you to think about how to design it so it will run as fast as possible.
This is not to say you should be timing everything, but the design phase is the easiest place to optimize and not cost lots of time later.
There are often several ways to do anything, you should pick in the design phase the one which is most likely to perform the best (if it turns out to be one of the times when it isn't the best, then optimize later). This should trump the need to have easy to read code.
If you aren't considering performance in the design phase, you aren't going to have a well designed system. That doesn't mean it should be the only concern (although in a database I'd rate it as 3rd in importance, right after data integrity and security), but trying to fix a system where poorly performing techniques were used throughout because the developers thought they were easier to understand is a nightmare. Being a user of such a system where you have to wait for minutes everytime you want to move from one screen to another is a nightmare (developers reallly should spend all day everyday for at least a week, using their systems!) for everyone who is stuck with the badly designed system. It costs less to design properly than to fix later and considering performance is critical to designing properly.
I work somewhere where the orginal developers drank the koolaid about premature optimization and did everything the way they thought was simplest (but which in almost every case was the wrong choice from a performance perspective). Now we are at 10 times the size we were three years ago and every screen on every website takes 30 seconds or so to load (or worse times out) and we are losing customers because of it. But changing it will be too hard because at the base they designed the database without considering how it would perform and redesigning a database with many many gigabytes of data into a new structure is way too time consuming and costly. If it had been designed to perform from the start it would be both easier to maintain and faster for the clients. We aren't talking about the need to performance tune the top 10 slowest queries here, we are talking about the fact that the overall structure requires a drastic change (one that would affect virtually every query against the system) to perform well.
Yes don't do micro optimization until you nmeed to, but please do the macro stuff. Consider if is this the best way before you commit to the path. Don't write cursors to hit tables with millions of records when a set-based statement will do. Don't try to have as few tables as possible becasue that seems to be a more elegant solution when the tables are storing disparate items (such as people, places, and vehicles) causing every single query to hit the same table and causing every delete to check all sorts of foreign key tables that will not ever have a record for that type of entity (it takes minutes to delete one record from the main table in our database, it's a real joy when something goes wrong in an import (bad data from a client usually) and we have to delete 200,000 let me tell you).
Optimization is almost tautologically a tradeoff: you gain runtime efficiency at the cost of other things (readability, maintainability, flexibility, compile times, etc.). As such, it's really never a good idea to do unless you know what the tradeoff is and why it's worthwhile.
Even worse, thinking about "how do I do X fast" can be very distracting. To much energy in that direction can easily lead to you misisng out on method Y which is much better (and often faster --- this is how "optimization" can make your code slower). Particularly if you do too much of this on a big project from the beginning, it represents a lot of momentum. If you can't afford to overcome that momentum, you can easily become locked into a bad design because you can't afford the time to restructure it. This way lie dragons
What your boss may be thinking of is more of an issue of writing bad code via inappropriate representations and algorithms. It's not really the same thing as optimizing, but an approach where you pay no attention whatsoever to appropriate data structures etc. can result in a codebase that is slow everywhere, and (much like the above "lock in") requires heroic effort to fix.
In general though, premature optimization really honestly is a terrible idea. Particularly when you end up with a complex, finely tuned, well documented (because that's the only way you can understand it) piece of code you end up not using anyway. And that's not even getting into the issue of subtle bugs that are often introduced when "optimizing"
[edit: pshaw, of course a Knuth quote encapsulates this well. That's what I get for typing too much]
Engineer throughout, optimize at the end.
Since with going with pithy, I'll say that optimization is as important as the impact of not doing it.
I think the "premature optimization is root of all evil" has to be understood literally - it does not say when is premature, and does not say you should optimize only at the end. Just not too early. Also, the "use the right algorithm, O(n^2) vs O(N)" is a bit dangerous if taken literally - because for many problems, the N is actually small, etc...
I think it depends a lot of the type of software you are doing: some software are such as every part is very independent, and can be optimized separately. But that's not always the case. For many (most ?) applications, speed just does not matter at all, the brute force but obviously correct way is the best one. But for projects where speed matters, it often has to be taken into account early - maybe that's another possible interpretation of Knuth's saying: many applications don't need to be optimized at all, just know which ones need and plan ahead.
Optimization is a primary concern through development when you have a good reason to expect that performance will be unfixably bad if optimization is a secondary concern.
This depends a lot what kind of code you're writing, but there are often better reasons to believe that your code will be unfixably difficult to use; or maintain; or full of bugs; or late; if all those things become secondary to tweaking performance.
Bad managers say, "all of those things are our primary concerns". Good managers work to find out which are dangers for this project.
Of course, good design does have to consider all these things, and the earlier you have a back-of-the-envelope estimate of any of them, the better. If all your manager is saying, is that if you never think about how fast your code will run then too much of your code will be dog-slow, then he's right. I just wouldn't say that makes optimization your "primary" concern.
If the USP of your software is that it's faster than your competitors', then optimization is a primary concern. With experience, you can often predict what sorts of operations will be the bottlenecks, design those with optimization in mind right from the start, and more-or-less ignore optimization elsewhere. A lot of projects won't even need this: they'll be fast enough without much effort, provided you use sensible algorithms and don't do anything stupid. "Don't do anything stupid" is always a primary concern, with no need to mention performance in particular.
I think that code needs to be, first and foremost, readable and understandable. So, optimisations that are done, should not be at the expense of readability. However, optimisation is often a trade-off.
Whether or not you should optimise your code depends on your application domain. If you are working on an embedded processor with only 8Mb of memory, then optimisation is probably something that every team member needs to keep in mind, when writing code - optimising for space vs speed.
However, pre-mature optimisation is not useful unless your system has been clearly spec'ed and understood. This is because most programmers do not make good optimisation decisions unless they can factor in the influence of the overall system, including processor architectural factors such as cache memory, hardware threads, pipelines, etc.
From 2 years of building highly optimized Java code (and that needed to be optimized that way) I would say that there is a time-spent rule that governs optimization:
optimizing on the spot: 5%-10% of your development time, because you have to do it countless times (every single time you have to amend your design)
optimizing just when you have had it working: 2% of your development time (you do it only once)
going back to it and optimizing when it's too slow: 30% of your development time, because you have to plunge yourself back into the system
SO I would come to the conclusion that there is a right time and a right way to optimize: do it entity by entity (class by class, if you have classes that have a single, well defined job to do, and can be tested and optimized), test well, make sure the logic is working, optimize just afterward, and forget about that entity's implementation details forever.
When developing, just keep it simple. IMHO, most performance problems are caused by over-engineering - making mountains out of molehills, often because of wanting "the right algorithm".
Periodically, stress test with a big data set, profiling or (my favorite technique) manual random sampling. You find a problem, you fix it. You find another, you fix it.
That way you avoid creating slugs (slowness bugs), and when they do arise, you kill them.
Added: If I can just elaborate on point 1. OO is seemingly the law of the land, and it certainly has good reasons behind it. Unfortunately it causes many young programmers to feel that programming is all about having lots of data structure, with layers upon layers of abstraction. Not that those are inherently bad, but combine that with the natural tendency to assume that the time something takes is roughly proportional to the number of characters you have to type to invoke it, and that this tendency multiplies over the layers (and besides, the machines are really fast), it's easy to create a perfect storm of cycle-waste.
Quote from a friend: "It's easier to make a working system efficient than to make an efficient system work".
I think it is important to use smart practices and patterns from the start, but get the system actually running for small test cases then do performance analysis. Frequently the areas of code that have poor performance aren't anticipated at the beginning, so get some real data and then optimize the bottlenecking 3% (or 20%, or whatever it is).
I think your boss is a lot more right than you are.
Allt too often the user experience is lost only to be "rediscovered" at the last possible moments when performance activites are prohibitively costly and inefficient. Or when it is discovered that the batch program that will process today's transactions requires forty hours to run.
Things such as database organization, when and when not to do which SELECTs are examples of design decisions that can make or break an application. Still you run the risk of a single programmer deciding to to otherwise, misinterpret or simply not understand what to do. Following-up on performance during an entire project decreases the risk that such things will happen. It also allows design decisions to be changed when there is a need for that without puting the entire poroject at risk.
"You need to make the right algorithmic decisions during development" is certainly true yet how many mainstream programmers are able to do that? Browsing the net for information does not guarantee finding a high quality solution. "Right" could be interpreted as meaning that it is ok to choose a poor algorithm because it is easy to understand and implement (= less development time, lower cost) than a more complicated one (= more development time, higher cost).
The pendulum of quantity vs quality is almost always on the quantity side because more code per hour or faster development time means money in the short term. The quality side means money in the long term.
EDIT
This article discusses performance and optimization quite thoroughly.
Performance preacher Rico Mariana sums it up in the short statement "never give up your performance accidentally."
Premature optimization is the root of all evil...There is a fine balance between, but I would say 95% of the time you need to optimize at the end; however, there are decisions you can make early on to help prevent issues. For example assume we are talking about an e-commerce web site. You have a requirement to display the catalog. Now you can grab all 100,000 items and display 50 of them, or you can grab just 50 from the database. These type of decisions should be made up front.
Cycle counting should only be done when a problem has been identified.
Your boss is partly right, optimisation does need to be considered throughout the development lifecycle but it is rarely the primary concern. Also, the term 'optimisation' is vague - it's an adjective, 'optimise for ...' which could be 'memory', 'speed', 'usability', 'maintainability' and so on.
However, the OP is right that counting cycles is pointless for many projects. For most PC applications the CPU is never the bottleneck. Also, the IA32 is not consistent - what worked well on one architecture, performs poorly on another. Cycle counting should only ever be done when it will actually make a difference - usually in CPU limited code or code with very specific timing needs.
Optimisation, of any kind, must always be driven by hard evidence. Never assume anything about the system or how the code is behaving. In an ideal world, application performance / constraints will be specified in the initial product design and tools to monitor the application's performance during development will be added early on in the development phase to guide the programmers as the product is made.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have these tables:
Projects(projectID, CreatedByID)
Employees(empID,depID)
Departments(depID,OfficeID)
Offices(officeID)
CreatedByID is a foreign key for Employees. I have a query that runs for almost every page load.
Is it bad practice to just add a redundant OfficeID column to Projects to eliminate the three joins? Or should I do the following:
SELECT *
FROM Projects P
JOIN Employees E ON P.CreatedBY = E.EmpID
JOIN Departments D ON E.DepID = D.DepID
JOIN Offices O ON D.officeID = O.officeID
WHERE O.officeID = #SomeOfficeID
In application programming I "Write with best practices first and optimize afterwards", but database administrators are always warning about the cost of joins.
Normalize till it hurts, then denormalize till it works
Denormalization has the advantage of fast SELECTs on large queries.
Disadvantages are:
It takes more coding and time to ensure integrity (which is most important in your case)
It's slower on DML (INSERT/UPDATE/DELETE)
It takes more space
As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).
Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.
In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in three tables instead of one.
So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.
Always normalize as far as necessary to remove database integrity issues (i.e. potential duplicated or missing data).
Even if there were performance gains from denormalizing (which is usually not the case), the price of losing data integrity is too high to justify.
Just ask anyone who has had to work on fixing all the obscure issues from a legacy database whether they would prefer good data or insignificant (if any) speed increases.
Also, as mentioned by John - if you do end up needing denormalised data (for speed/reporting/etc) then create it in a separate table, preserving the raw data.
The cost of joins shouldn't worry you too much per se (unless you're trying to scale to millions of users, in which case you absolutely should worry).
I'd be more concerned about the effect on the code that's calling this. Normalized databases are much easier to program against, and almost always lead to better efficiency within the application itself.
That said, don't normalize beyond the bounds of reason. I've seen normalization for normalization's sake, which usually ends up in a database that has one or two tables of actual data, and 20 tables filled with nothing but foreign keys. That's clearly overkill. The rule I normally use is: If the data in a column would otherwise be duplicated, it should be normalized.
It is better to keep that schema in Third Normal Form and let your DBA to complain about joins cost.
DBA's should be concerned if your db is not properly normalized to begin with. After you carefully measured performance and determined you have bottlenecks you may start denormalizing, but I would be extremely cautious.
I'd be most concerned about DBAs who are warning you about the cost of joins, unless you're in a highly pathological situation.
You shouldn't look at denormalizing before you've tried everything else.
Is the performance of this really an issue?
Do your database have any features you can use to speed things up without compromising integrity?
Can you increase your performance by caching?
Normalize to model the concepts in your design, and their relationship. Think of what relationships can change, and what a change like that will mean in terms of your design.
In the schema you posted, there is what looks to me like a glaring error (which may not be an error if you have a special case in terms of how your organization works) -- there is an implicit assumption that every department is in exactly one office, and that all the employees who are in the same department work at that office.
What if the department occupies two offices?
What if an employee nominally belongs to one department, but works out of a different office (assuming you are referring to physical offices)?
Don't denormalize.
Design your tables according to simple and sound design principles that will make it easy to implement the rest of your system. Easy to build, populate, use, and administer the database. Easy and fast to run queries and updates against. Easy to revise and extend the table design when the situation calls for it, and unnecessary to do so for light and transient reasons.
One set of design principles is normalization. Normalization leads to tables that are easy and fast to update (including inserts and deletes). Normalization obviates update anomalies, and obviates the possiblity of a database that contradicts itself. This prevents a whole lot of bugs by making them impossible. It also prevents a whole lot of update bottlenecks by making them unnecessary. This is good.
There are other sets of design principles. They lead to table designs that are less than fully normalized. But that isn't "denormalization". It's just a different design, somewhat incompatible with normalization.
One set of design principles that leads to a radically different design from normalization is star schema design. Star schema is very fast for queries. Even large scale joins and aggregations can be done in a reasonable time, given a good DBMS, good physical design, and enough hardware to get the job done. As you might expect, a star schema suffers update anomalies. You have to program around these anomalies when you keep the database up to date. You will will generally need a tightly controlled and carefully built ETL process that updates the star schema from other (perhaps normalized) data sources.
Using data stored in a star schema is dramatically easy. It's so easy that using some kind of OLAP and reporting engine, you can get all the information needed without writing any code, and without sacrificing performance too much.
It takes good and somewhat deep data analysis to design a good normalized schema. Errors and omissions in data analysis may result in undiscovered functional dependencies. These undiscovered FDs will result in unwitting departures from normalization.
It also takes good and somewhat deep data analysis to design and build a good star schema. Errors and ommissions in data analysis may result in unfortunate choices in dimensions and granularity. This will make ETL almost impossible to build, and/or make the information carrying capacity of the star inadequate for the emerging needs.
Good and somewhat deep data analysis should not be an excuse for analysis paralysis. The analysis has to be right and reasonably complete in a short amount of time. Shorter for smaller projects. The design and implementation should be able to survive some late additions and corrections to the data analysis and to the requirements, but not a steady torrent of requirements revisions.
This response expands on your original question, but I think it's relevant for the would be database designer.
Normalization is a quality decision.
Denormalization is a performance decision.
That's why -
Normalize till it hurts; De-normalize till it works.
Quality decisions tell which is the least Normal Form that you can live with:
How much non-redundancy is important for your tables?
How fast data management do you want?
How clear do you want the relation between your tables?
Performance decisions tell what is the highest Normal Form acceptable:
Is my database's response fast enough?
Are too many joins causing a slowdown?
When you have fixed the least and the highest Normal Form acceptable in your case, pick the Normal Form anywhere in between.
If you're using Integers (or BIGINT) as the ID's and they are the clustered primary key you should be fine.
Although it seems like it would always be faster to find an office from a project as you are always looking up primary keys the use of indexes on the foreign keys will make the difference minimal as the indexes will cover the primary keys too.
If you ever find a need later on to denormalise the data, you can create a cache table on a schedule or trigger.
In the example given indexes set up properly on the tables should allow the joins to occur extremely fast and will scale well to the 100,000s of rows. This is usually the approach that I take to get around the issue.
There are times though that the data is written once and the selected for the rest of its life where it really didn't make sense to do a dozen joins each time.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really like ORM as compared to store procedure, but one thing that I afraid is that ORM could be slow, because of layers and layers of abstraction. Will using ORM slow down my application? Or does it matter?
Yes, it matters. It is using more CPU cycles and consequently slowing your application down. Hear me out though...
But, consider this: what is more expensive? Server hardware or another programmer? Server hardware, generally, is cheaper than hiring another team of programmers. So, while ORM may be costing you CPU cycles, you need one less programmer to manage your SQL queries, often resulting in a lower net cost.
To determine if it's worth it for you, calculate or determine how many hours you saved by using an ORM. Then, figure out how much money you spent on the server to support ORM. Multiply the hours you saved by your hourly rate and compare to the server cost.
Of course, whether an ORM actually saves you time is a whole another debate...
Is ORM slow?
Not inherently. Some heavyweight ORMs can add a general drag to things but we're not talking orders of magnitude slowdown.
What does make ORM slow is naïve usage. If you're using an ORM because it looks easy and you don't know how the underlying relational data model works, you can easily write code that seems reasonable to an OO programmer, but will murder performance.
ORM is a handy tool, but you need the lower-level understanding (that usually comes from writing SQL queries) to go with it.
Does it matter?
If you end up performing a looped query for each of thousands of entities at once, instead of a single fast join, then certainly it can.
ORM's are slower and do add overhead to applications (unless you specifically know how to get around these problems, which is not very common). The database is the most critical element and Web applications should be designed around it.
Many OOP frameworks using Active Record or ORMs, developers in general - treat the database as an unimportant afterthought and tend to look at it as something they don't really need to learn. But performance and scalability usually suffer as the db is heavily taxed!
Many large scale web apps have fallen flat, wasting millions and months to years of time because they didn't recognize the importance of the database. Hundreds of concurrent users and tables with millions of records require database tuning and optimization. But I believe the problem is noticeable with a few users and less data.
Why are developers so afraid to learn proper SQL and tuning measures when it's the key to performance?
In a Windows Mobile 5 project against using SqlCe, I went from using hand-coded objects to code generated (CodeSmith) objects using an ORM template. In the process all my data access used CSLA as a base layer.
The straight conversion improved my performance by 32% in local testing, almost all of it a result of better access methods.
After that change, we adjusted the templates (after seeing some SqlCe performance stuff at PDC by Steve Lasker) and in less then 20 minutes, our entire data layer was greatly improved, our average 'slow' calls went from 460ms to ~20ms. The cool part about the ORM stuff is that we only had to implement (and unit test) these changes once and all the data access code got changed. It was an amazing time saver, we maybe saved 40 hours or more.
The above being said, we did lose some time by taking out a bunch of 'waiting' and 'progress' dialogs that were no longer needed.
I have used a few of the ORM tools, and I can recommend two of them:
.NET Tiers
CSLA codegen templates
Both of them have performed quite nicely and any performance loss has not been noticeable.
I've always found it doesn't matter. You should use whatever will make you the most productive, responsive to changes, and whatever is easiest to debug and maintain.
Most applications never need enough load for the difference between ORM and SPs to noticeable. And there are optimizations to make ORM faster.
Finally, a well-written app will have its data access seperated from everything else so that in the future switching from ORM to whatever would be possible.
Is ORM slow?
Yes ( compared with stored procedures )
Does it matter?
No ( except your concern is speed )
I think the problem is many people think of ORM as a object "trick" to databases, to code less or simplify SQL usage, while in reality is .. well an Object - To Relational ( DB ) - Mapping.
ORM is used to persist your objects to a relational database manager system, and not ( just ) to substitute or make SQL easier ( although it make a good job at that too )
If you don't have a good object model, or you're using to make reports, or even if you're just trying to get some information, ORM is not worth it.
If in the other hand you have a complex system modeled through objects were each one have different rules and they interact dynamically and you your concern is persist that information into the database rather than substitute some existing SQL scripts then go for ORM.
Yes, ORM will slow down your application. By how much depends on how far the abstraction goes, how well your object model maps to the database, and other factors. The question should be, are you willing to spend more developer time and use straight data access or trade less dev time for slower runtime performance.
Overall, the good ORMs have little overhead and, by and large, are considered well worth the trade off.
Yes, ORMs affect performance, whether that matters ultimately depends on the specifics of your project.
Programmers often love ORM because they like the nice front-end cding environments like Visual Studio and dislike coding raw SQL with no intellisense, etc.
ORMs have other limitations besides a performance hit--they also often do not do what you need 100% of the time, add the complexity of an additional abstraction layer that must be maintained and re-established every time chhnges are made, there are also caching issues to be dealt with.
Just a thought -- if the database vendors would make the SQL programming environment as nice as Visual Studio, and provide a more natural linkage between the db code and front-end code, we wouldn't need the ORMs...I guess things may go in that direction eventually.
Obvious answer: It depends
ORM does a good job of insulating a programmer from SQL. This in effect substitutes mediocre, computer generated queries for the catastrophically bad queries a programmer might give.
Even in the best case, an ORM is going to do some extra work, loading fields it doesn't need to, explicitly checking constraints, and so forth.
When these become a bottle-neck, most ORM's let you side-step them and inject raw SQL.
If your application fits well with objects, but not quite so easily with relations, then this can still be a win. If instead your app fits nicely around a relational model, then the ORM represents a coding bottleneck on top of a possible performance bottleneck.
One thing I've found to be particularly offensive about most ORM's is their handling of primary keys. Most ORM's require pk's for everything they touch, even if there is no concievable use for them. Example: Authors should have pk's, Blog posts SHOULD have pk's, but the links (join table) between authors and posts not.
I have found that the difference between "too slow" and "not too much slower" depends on if you have your ORM's 2nd level (SessionFactory) cache enabled. With it off it handles fine under development load, but will crush your system under mild production load. After turning on the 2nd Level cache the server handled the expected load and scaled nicely.
ORM can get an order of magnitude slower, not just on the grount=s of wasting a lot of CPU cycles on it's own but also using much more memeory which then has to be GC-d.
Much worse that that however is that the is no standard for ORM (unlike SQL) and that my and large ORM-s use SQL vary inefficiently so at the end of the day you still have to dig into SQL to fix per issues and every time an ORM makes a mess and you have to debug it. Meaning that you haven't gained anything at all.
It's terribly immature technology for real production-level applications. Very problematic things are handling indexes, foreign keys, tweaking tables to fit object hierarchies and terribly long transactions, which means much more deadlocks and repeats - if an ORM knows hows to handle that at all.
It actually makes servers less scalable which multiplies costs but these costs don't get mentioned at the begining - a little inconvenient truth :-) When something uses transactions 10-100 times bigger than optimal it becomes impossible to scale SQL side at all. Talking about serious systems again not home/toy/academic stuff.
An ORM will always add some overhead because of the layers of abstraction but unless it is a poorly designed ORM that should be minimal. The time to actually query the database will be many times more than the additional overhead of the ORM infrastructure if you are doing it correctly, for example not loading the full object graph when not required. A good ORM (nHibernate) will also give you many options for the queries run against the database so you can optimise as required as well.
Using an ORM is generally slower. But the boost in productivity you get will get your application up and running much faster. And the time you save can later be spent finding the portions of your application that are causing the biggest slow down - you can then spend time optimizing the areas where you get the best return on your development effort. Just because you've decided to use an ORM doesn't mean you can't use other techniques in the sections of code that can really benefit from it.
An ORM can be slower, but this is offset by their ability to cache data, therefore however fast the alternative, you can't get much faster than reading from memory.
I never really understood why people think that this is slower or that is slower... get a real machine I say. I have had mixed results... I've seen where execution time for a stored procedure is much slower than ORM and vise versa.. But in both cases the performance was due to difference in hardware.
Transactional programming is, in this day and age, a staple in modern development. Concurrency and fault-tolerance are critical to an applications longevity and, rightly so, transactional logic has become easy to implement. As applications grow though, it seems that transactional code tends to become more and more burdensome on the scalability of the application, and when you bridge into distributed transactions and mirrored data sets the issues start to become very complicated. I'm curious what seems to be the point, in data size or application complexity, that transactions frequently start becoming the source of issues (causing timeouts, deadlocks, performance issues in mission critical code, etc) which are more bothersome to fix, troubleshoot or workaround than designing a data model that is more fault-tolerant in itself, or using other means to ensure data integrity. Also, what design patterns serve to minimize these impacts or make standard transactional logic obsolete or a non-issue?
--
EDIT: We've got some answers of reasonable quality so far, but I think I'll post an answer myself to bring up some of the things I've heard about to try to inspire some additional creativity; most of the responses I'm getting are pessimistic views of the problem.
Another important note is that not all dead-locks are a result of poorly coded procedures; sometimes there are mission critical operations that depend on similar resources in different orders, or complex joins in different queries that step on each other; this is an issue that can sometimes seem unavoidable, but I've been a part of reworking workflows to facilitate an execution order that is less likely to cause one.
I think no design pattern can solve this issue in itself. Good database design, good store procedure programming and especially learning how to keep your transactions short will ease most of the problems.
There is no 100% guaranteed method of not having problems though.
In basically every case I've seen in my career though, deadlocks and slowdowns were solved by fixing the stored procedures:
making sure all tables are accessed in order prevents deadlocks
fixing indexes and statistics makes everything faster (hence diminishes the chance of deadlock)
sometimes there was no real need of transactions, it just "looked" like it
sometimes transactions could be eliminated by making multiple statement stored procedures in single statement ones.
The use of shared resources is wrong in the long run. Because by reusing an existing environment you are creating more and more possibilities. Just review the busy beavers :) The way Erlang goes is the right way to produce fault-tolerant and easily verifiable systems.
But transactional memory is essential for many applications in widespread use. If you consult a bank with its millions of customers for example you can't just copy the data for the sake of efficiency.
I think monads are a cool concept to handle the difficult concept of changing state.
One approach I've heard of is to make a versioned insert only model where no updates ever occur. During selects the version is used to select only the latest rows. One downside I know of with this approach is that the database can get rather large very quickly.
I also know that some solutions, such as FogBugz, don't use enforced foreign keys, which I believe would also help mitigate some of these problems because the SQL query plan can lock linked tables during selects or updates even if no data is changing in them, and if it's a highly contended table that gets locked it can increase the chance of DeadLock or Timeout.
I don't know much about these approaches though since I've never used them, so I assume there are pros and cons to each that I'm not aware of, as well as some other techniques I've never heard about.
I've also been looking into some of the material from Carlo Pescio's recent post, which I've not had enough time to do it justice unfortunately, but the material seems very interesting.
If you are talking 'cloud computing' here, the answer would be to localize each transaction to the place where it happens in the cloud.
There is no need for the entire cloud to be consistent, as that would kill performance (as you noted). Simply, keep track of what is changed and where and handle multiple small transactions as changes propagate through the system.
The situation where user A updates record R and user B at the other end of cloud does not see it (yet) is the same as the one when user A didn't do the change yet in the current strict-transactional environment. This could lead to a discrepancy in an update-heavy system, so systems should be architectured to work with updates as less as possible - moving things to aggregation of data and pulling out the aggregates once the exact figure is critical (i.e. moving requirement for consistency from write-time to critical-read-time).
Well, just my POV. It's hard to conceive a system that is application agnostic in this case.
Try to make changes at the database level in the least number of possible instructions.
The general rule is to lock a resource the lest possible time. Using T-SQL, PLSQL, Java on Oracle or any similar way you can reduce the time that each transaction locks a shared resource. I fact transactions in the database are optimized with row-level locks, multi-version, and other kinds of intelligent techniques. If you can make the transaction at the database you save the network latency. Apart from other layers like ODBC/JDBC/OLEBD.
Sometimes the programmer tries to obtain the good things of a database ( It is transactional, parallel, distributed, ) but keep a caché of the data. Then they need to add manually some of the database features.