I don't have much experience with databases, so I don't know which is better for long-term performance, better practice, etc.
Here is my (hypothetical) case: imagine you have a database of customer information and history of purchase orders for each. You want to keep track of how much each customer is buying. I can think of two ways of calculating this:
1) Just do a SUM() every time it is needed. This is an easy solution, but the concern is that this database could be 20 years old with tens of thousands of rows for each customer. As more customers purchases are added to the database, the SUM() operation will take longer to calculate.
2) Keep the sum cached in the customer information table, and every time a new purchase is made (updated, deleted, etc), update this cache. That way no matter how many purchase orders there are, there won't be an increase in the calculation time. The downside is that, this is a less flexible solution (only the sum over all rows, what about per month sum? other intervals? etc); this cached value could get out of sync with the actual total somehow (technically shouldn't happen, but it might)
So what should I be doing for this? I know I shouldn't be storing anything I can calculate from what's already in the database, but part of me doesn't like the fact that this type of calculation will get worse in time, and that there is some kind of elegance to option 2.

In database terms, there is no elegance in option 2 - it would be considered a hack that you might use as a last resort if your database gets to be truly humongous - not likely to happen for a novice setting it up for the first time (but possible).
There would be a lot of work maintaining the totals; and you'd be forever dealing with the question: "Why doesn't the detail add up to the total?"
Go for option 1 until you've proven you can't. Which will in most cases be a long time.

What you're describing in option #2 is a case of premature optimization. Using the SUM() of all purchases is going to work for a really long time (years). When (if) you start to see this function degrade, you can add indexes or a totals table to your database to speed things up. Don't complicate things when a simple solution exists.
Of course, the real solution is to try both solutions with 20 years of made-up data and see if there's any real difference. I suspect there isn't.

Almost always 1.
How often will you be querying for the total for 20 year's worth of history? If the answer is often, and the performance is poor, then you can think about optimisation or OLAP.
I suspect you're worrying about optimisation too early. This is the sort of thing databases are designed for - let them worry about caching.

I'll just add that another possibility is to create roll-up tables. For instance, when tracking page hits, it's not very useful necessarily to know that IP such-and-such accessed page1.php at 14:42:04 on 11/19/2008; but you might want to track daily stats for page1.php. In this case, at the end of each day you might run a procedure to sum up the hits for each page and create a record in a summary table, which can in turn be heavily indexed. Your reporting then can run against that table. In addition to speeding up reporting, it can also speed up writing the original records, as you don't need to worry about table locking or building indexes.
That said, good indexes can go a long way with reporting; and like others here have admonished, it's best to go with the easier, even if less optimal, solution until (if ever) it becomes a problem.

Kudos for thinking ahead, but the question that begs asking, is your sales data going to stay in the transactional database for 20 years?
There comes a point where it's going to be a lot easier to move that data into a data warehouse, and just keep your frontline database current.
If this is a new project, be more concerned with making it work, and getting people to use it. Worry about scalability when you've crossed those bridges.

Use option 1. Later, if performance becomes poor, you can identify specific bottlenecks and resolve them with options like #2, or materialized views, or several other possibilities.


PostgreSQL - "Ten most frequent entries"

We've got a table with two colums: USER and MESSAGE
An USER can have more than one message.
The table is frequently updated with more USER-MESSAGE pairs.
I want to frequently retrieve the top X users that sent the most messages. What would be the optimal (DX and performnce wise) solution for it?
The solutions I see myself:
I could GROUP BY and COUNT, however it doesn't seem like the most performant nor clean solution.
I could keep an additional table that'd keep count of every user's messages. On every message insertion into the main table, I could also update the relevant row here. Could the update be done automaticaly? Perhaps I could write a procedure for it?
For the main table, I could create a VIEW that'd have an additional "calculated" column - it'd GROUP BY and COUNT, but again, it's probably not the most performant solution. I'd query the view instead.
Please tell me whatever you think might be the best solution.
Some databases have incrementally updated views, where you create a view like in your example 3, and it automatically keeps it updated like in your example 2. PostgreSQL does not have this feature.
For your option 1, it seems pretty darn clean to me. Hard to get much simpler than that. Yes, it could have performance problems, but how fast do you really need it to be? You should make sure you actually have a problem before worrying about solving it.
For your option 2, what you are looking for is a trigger. For each insertion, it would increment a count in the user table. If you ever delete, you would also need to decrease the count. Also, if ever update to change the user of an existing entry, the trigger would need to decrease the count of the old user and increase it of the new user. This will decrease the concurrency, as if two processes try to insert messages from the same user at the same time, one will block until the other finishes. This may not matter much to you. Also, the mere existence of triggers imposes some CPU overhead, plus whatever the trigger itself actually does. But unless our server is already overloaded, this might not matter.
Your option 3 doesn't make much sense to me, at least not in PostgreSQL. There is no performance benefit, and it would act to obscure rather than clarify what is going on. Anyone who can't understand a GROUP BY is probably going to have even more problems understanding a view which exists only to do a GROUP BY.
Another option is a materialized view. But you will see stale data from them between refreshes. For some uses that is acceptable, for some it is not.
The first and third solutions are essentially the same, since a view is nothing but a “crystallized” query.
The second solution would definitely make for faster queries, but at the price of storing redundant data. The disadvantages of such an approach are:
You are running danger of inconsistent data. You can reduce that danger somewhat by using triggers that automatically keep the data synchronized.
The performance of modifications of message will be worse, because the trigger will have to be executed, and each modification will also modify users (that is the natural place to keep such a count).
The decision should be based on the question whether the GROUP BY query will be fast enough for your purposes. If yes, use it and avoid the above disadvantages. If not, consider storing the extra count.

Database normalization - who's right?

My professor (who claimed to have a firm understanding about systems development for many years) and I are arguing about the design of our database.
As an example:
My professor insists this design is right:
(list of columns)
Notice the total units column. He said that this column must be included.
I tried to explain that it is unnecessary, because if you want it, then just make a query by simply adding the two.
I showed him an example I found in a book, but he insists that I don't have to rely on books too much in making our system.
The same thing applies to similar cases as in this one:
He wanted me to include the average! Anywhere I go, I can find myself reading articles that convince me that this is a violation of normalization. If I needed the average, I can easily compute the three grades. He enumerated some scenarios including ('Hey! What if the query has been accidentally deleted? What will you do? That is why you need to include it in your table!')
Do I need to reconstruct my database(which consists of about more than 40 tables) to comply with what he want? Am I wrong and just have overlooked these things?
Another thing is that he wanted to include the total amount in the payments table, which I believe is unnecessary. (Just compute the unit price of the product and the quantity.) He pointed out that we need that column for computing debits and/or credits that are critical for the overall system management, that it is needed for balancing transaction. Please tell me what you think.
You are absolutely correct! One of the rules of normalization is to reduce those attributes which can be easily deduced by using other attributes' values. ie, by performing some mathematical calculation. In your case, the total units column can be obtained by simply adding.
Tell your professor that having that particular column will show clear signs of transitive dependency and according to the 3rd normalization rule, its recommended to reduce those.
You are right when you say your solution is more normalized.
However, there is a thing called denormalization (google for it) which is about deliberately violating normalization rules to increase queries performance.
For instance you want to retrieve first five subjects (whatever the thing would be) ordered by decreasing number or total units.
You solution would require a full scan on two tables (subject and unit), joining the resultsets and sorting the output.
Your professor's solution would require just taking first five records from an index on total_units.
This of course comes at the price of increased maintenance cost (both in terms of computational resources and development).
I can't tell you who is "right" here: we know nothing about the project itself, data volumes, queries to be made etc. This is a decision which needs to be made for every project (and for some projects it may be a core decision).
The thing is that the professor does have a rationale for this requirement which may or may not be just.
Why he hasn't explained everything above to you himself, is another question.
In addition to redskins80's great answer I want to point out why this is a bad idea: Every time you need to update one of the source columns you need to update the calculated column as well. This is more work that can contain bugs easily (maybe 1 year later when a different programmer is altering the system).
Maybe you can use a computed column instead? That would be a workable middle-ground.
Edit: Denormalization has its place, but it is the last measure to take. It is like chemotherapy: The doctor injects you poison only to cure an even greater threat to your health. It is the last possible step.
Think it is important to add this because when you see the question the answer is not complete in my opinion. The original question has been answered well but there is a glitch here. So I take in account only the added question quoted below:
Another thing is that he wanted to include the total amount in the
payments table, which I believe is unnecessary(Just compute the unit
price of the product and the quantity.). He pointed out that we need
that column for computing debits and/or credits that are critical for
the overall system management, that it is needed for balancing
transaction. Please tell me what you think.
This edit is interesting. Based on the facts that this is a transactional system handling about money it has to be accountable. I take some basic terms: Transaction, product, price, amount.
In that sense it is very common or even required to denormalize. Why? Because you need it to be accountable. So when the transaction is registered that's it, it may never ever be modified. If you need to correct it then you make another transaction.
Now yes you can calculate for example product price * amount * taxes etc. That makes sense in normalization sense. But then you will need a complete lockdown of all related records. So take for example the products table: If you change the price before the transaction it should be taken into account when the transaction happens. But if the price changes afterwards it does not affect the transaction.
So it is not acceptable to just join since that product might change. Example:
2012-01-01 price = 10
2012-01-05 price = 20
Transaction happens here, we sell 10 items so 10 * 20 = 200
2012-01-06 price = 22
Now we lookup the transaction at 2012-01-10, so we do:
transactions.amount * products.price AS totalAmount
FROM transactions
INNER JOIN products on
That would give 10 * 22 = 220 so it is not correct.
So you have 2 options:
Do not allow updates on the products table. So you make that table versioned, so for every record you add a new INSERT instead of update. So the transaction keeps pointing at the right version of the product.
Or you just add the fields to the transactions table. So add totalAmount to the transactions table and calculate it (in a database transaction) when the transaction is inserted and save it.
Yes, it is denormalized but it has a good reason, it makes it accountable. You just know and it's verified with transactions, locks etc. that the moment that transaction happened it related to the described product with the price = 20 etc.
Next to that, and that is just a nice thing of denormalization when you have to do that anyway, it is very easy to run reports. Total transaction amount of the month, year etc. It is all very easy to calculate.
Normalization has good things, for example no double storage, single point of edit etc. But in this case you just don't want that concept since that is not allowed and not preferred for a transactions log database.
See a transaction as a registration of something happened in real world. It happened, you wrote it down. Now you cannot change history, it was written as it was. Future won't change it, it happened.
If you want to implement the good, old, classic relational model, I think what you're doing is right.
In general, it's actually a matter of philosophy. Some systems, Oracle being an example, even allow you to give up the traditional, relational model in favor of objects, which (by being complex structures kept in tables) violate the 1st NF but give you the power of object-oriented model (you can use inheritance, override methods, etc.), which is pretty damn awesome in some cases. The language used is still SQL, only extended.
I know my answer drifts away from the subject (as we take into consideration a whole new database type) but I thought it's an interesting thing to share on the occasion of a pretty general question.
Database design for actual applications is hardly the question of what tables to make. Currently, there are countless possibilities when it comes to keeping and processing your data. There are relational systems we all know and love, object databases (like db4o), object-relational databases (not to be confused with object relational mapping, what I mean is tools like Oracle 11g with its objects), xml databases (take eXist), stream databases (like Esper) and the currently thriving noSQL databases (some insist they shouldn't be called databases) like MongoDB, Cassandra, CouchDB or Oracle NoSQL
In case of some of these, normalization loses its sense. Each model serves a completely different purpose. I think the term "database" has a much wider meaning than it used to.
When it comes to relational databases, I agree with you and not the professor (although I'm not sure if it's a good idea to oppose him to strongly).
Now, to the point. I think you might win him over by showing that you are open-minded and that you understand that there are many options to take into consideration (including his views) but that the situation requires you to normalize the data.
I know my answer is quite a stream of conscience for a stackoverflow post but I hope it's not received as lunatic babbling.
Good luck in the relational tug of war
You are talking about historical and financial data here. It is common to store some computations that will never change becasue that is the cost that was charged at the time. If you do the calc from product * price and the price changed 6 months after the transaction, then you have the incorrect value. Your professor is smart, listen to him. Further, if you do a lot of reporting off the database, you don't want to often calculate values that are not allowed to be changed without another record of data entry. Why perform calculations many times over the history of the application when you only need to do it once? That is wasteful of precious server resources.
The purpose of normalization is to eliminate redundancies so as to eliminate update anomalies, predominantly in transactional systems. Relational is still the best solution by far for transaction processing, DW, master data and many BI solutions. Most NOSQLs have low-integrity requirements. So you lose my tweet - annoying but not catastrophic. But to lose my million dollar stock trade is a big problem. The choice is not NOSQL vs. relational. NOSQL does certain things very well. But Relational is not going anywhere. It is still the best choice for transactional, update oriented solutions. The requirements for normalization can be loosened when the data is read-only or read-mostly. That's why redundancy is not such a huge problem in DW; there are no updates.

Is it bad to not use normalised tables in this database?

I recently learned about normalisation in my informatics class and I'm developing a multiplayer game using SQLite as backend database at the moment.
Some information on it:
The simplified structure looks a bit like the following:
player_id | level | exp | money | inventory
1 | 3 | 120 | 400 | {item a; item b; item c}
Okay. As you can see, I'm storing a table/array in string form in the column "inventory". This is against normalization.
But the thing is: Making an extra table for the inventory of players brings only disadvantages for me!
The only points where I access the database is:
When a player joins the game and his profile is loaded
When a player's profile is saved
When a player joins, I load his data from the DB and store it in memory. I only write to the DB like every five minutes when the player is saved. So there are actually very few SQL queries in my script.
If I used an extra table for the inventory I would have to, upon loading:
Perform an performance and probably more data-intensive query to fetch all items from the inventory table which belong to player X
Walk through the results and convert them into a table for storage in memory
And upon saving:
Delete all items from the inventory table which belong to player X (player might have dropped/sold some items?)
Walk through the table and perform a query for each item the player owns
If I kept all the player data in one table:
I'd only have one query for saving and loading
Everything would be in one place
I would only have to (de)serialize the tables upon loading and saving, in my script
What should I do now?
Do my arguments and situation justify working against normalisation?
Are you saying that you think parsing a string out of "inventory" doesn't take any time or effort? Because everything you need to do to store/retrieve inventory items from a sub table is something you'd need to do with this string, and with the string you don't have any database tools to help you do it.
Also, if you had a separate subtable for inventory items, you could add and remove items in real time, meaning that if the app crashes or the user disconnects, they don't lose anything.
There are a lot of possible answers, but the one that works for you is the one to choose. Keep in mind, your choice may need to change over time.
If the amount of data you need to persist is small (ie: fits into a single table row) and you only need to update that data infrequently, and you don't have any reason to care about subsets of that data, then your approach makes sense. As time goes on and your players gain more items and you add more personalization to the game, you may begin to push up against the limits of SQLite, and you'll need to evolve your design. If you discover that you need to be able to query the item list to determine which players have what items, you'll need to evolve your design.
It's generally considered a good idea to get your data architecture right early, but there's no point in sitting in meetings today trying to guess how you'll use your software in 5-10 years. Better to get a design that meets this year's needs, and then plan to re-evaluate the design again after a year.
What's going to happen when you have one hundred thousand items in your inventory and you only want to bring back two?
If this is something that you're throwing together for a one off class and that you won't ever use again, then yes, the quick and dirty route might be a quicker option for you.
However if this is something you're going to be working on for a few months, then you're going to run into long-term issues with that design decision.
No, your arguments aren't valid. They basically boil down to "I want to do all of this processing in my client code instead of in SQL and then just write it all to a single field" because you are still doing all of the exact same processing to generate the string. By doing this you are removing the ability to easily load a small portion of the list and losing relationships to the actual item table which could contain more information about the items (I assume you're hard coding it all based on names instead of using internal item IDs which is a really bad idea, imo).
Don't do it. Long term the approach you are wanting to take will generate a lot more work for you as your needs evolve.
Another case of premature optimization.
You are trying to optimize something that you don't have any performance metrics. What is the target platform? Even crappiest computers nowadays could run at least hundreds of your reading operation per second. Then you add better hardware for more users, then you can go to cloud and when you come into problem space that Google, Twitter and Facebook are dealing with, you can consider denormalizing. Even then, best solution is some sort of key-value database.
Maybe you should check Wikipedia article on Database Normalization to remind you why normalized database is a good thing.
You should also think about the items. Are the items unique for every user or does user1 could have item1 and user2 have item1 to. If you now want to change item1 you have to go through your whole table and check which user have this item. If you would normalize your table, this would be much more easy.
But it the end, I think the answer is: It depends
Do my arguments and situation justify
working against normalisation?
Not based on what I've seen so far.
Normalized database designs (appropriately indexed and with efficient usage of the database with UPSERTS, transactions, etc) in general-purpose engines will generally outperform code except where code is very tightly optimized. Typically in such code, some feature of the general purpose RDBMS engine is abandoned, such as one of the ACID properties or referntial integrity.
If you want to have very simple data access (you tout one table, one query as a benefit), perhaps you should look at a document centric database like mongodb or couchdb.
The reason that you use any technology is to leverage the technology's advantages. SQL has many advantages that you seem to not want to use, and that's fine, if you don't need them. In Neal Stephenson's Zodiac, the main character mentions that few things bought from a hardware store are used for their intended purpose. Software's like that, too. What counts is that it works, and it works nearly 100% of the time, and it works fast enough.
And yet, I can't help but think that someday you're going to have some overpowered item released into the wild, and you're going to want to deal with this problem at the database layer. Say you accidently gave out some superinstakillmegadeathsword inventory items that kill everything within 50 meters on use (wielder included), and you want to remove those things from play. As an apology to the people who lose their superinstakillmegadeathsword items, you want to give them 100 money for each superinstakillmegadeathsword you take away.
With a properly normalized database structure, that's a trivial task. With a denormalized structure, it's quite a bit harder and slower. A normalized database is also going to be easier to expand on the design in the future.
So are you sure you don't want to normalize your database?

Where's the tradeoff between normalization (SQL View) and performance/fiability (SQL Table)

I've got quite a long business process which eventually results into financial operations.
What matters in the end is quite exclusively these final operations, although I've got to keep a log of everything which led to it.
Since all the information contained into the final operations is available in other tables (used during the business process), it makes sense to use a view, but the view logic would be quite complicated (there are dozens of tables implicated), and I'm concerned that :
even with appropriate indexes, a table will probably be way faster (my table will eventually contain millions if items, and should be fully searchable on almost all its columns)
the view logic would be complicated, so I'm afraid it may complicate things in a few years if I want to evolve my business logic.
Because of those two reasons, I'm a bit tempted to write the data in a table at the end of my business process instead of relying on a view, but duplicating the data doesn't smells right (and it also looks a bit like premature optimization, but since it's such a central point in my design, I'd like to address the issue ASAP)
Have you ever faced such a choice? What did you decide?
Edit : creating a table would clearly lead to duplication in my situation, ie. the data written in the table exists somewhere else in the database and could be retrieved using only joins without any calculations.
I think you answered your question writing it down Brann.
This problem can be seen in this way: from one hand you have "real time data". You have fresh data and from them it's nice to create view to show "real time data" too.
But as time goes on, there are more data and logic changes. So it's good to have written down summaries of data you had some time ago. It's very pragmatic - you do not duplicate data, because you recalculate them and save into new table summary of them.
So when you think of it in this way, it's obvious that in this example new table will be better. As you write:
Faster access
Can have more complicated logic
Have archive data unchanged when logic changes
So when you meet this (or part) of this criteria as you requirement than its not choice - you go into tables.
When I would use view is only when showing fresh data out of other fresh data. In very, very simple problems. And when it gets more complicated - you always switch to new table.
So do not be afraid to go into it. Having one summary table with faster access is very pretty solution and it's a sign of well formed database.
Take care of the design of this table - so when business logic changes - you do not need to change everything from one stone in this table. And then everything will be OK!
I'm for the new table in this situation. The view has many disadvantages - performance clearly, complexity, and logic lock in. However, IMHO the over-arching reason is that as the underlying data changes, so the value in your view will change also. In most instances this is a good thing, however, with financial operations isn't it better to have a fixed record of what occured.
I always decide to have better normalization. In your case , though the view may be complicated , it's better to have that than the new table which has to be kept in sync with all the data changing operation.Plus the view would always be current while your end of business day table population would be only current for few hours a day.
Also , you have a bigger problem if the data in this table goes out of sync for whatever reasons.
As MrTelly alluded to, are you sure that your end result table really is a duplication of the view data? Or, is it actually a record of the final action taken as a result of the items in the view data.
For a clearer example, let's say that every time my gas tank gets to half-empty I buy $10 of gas. I write this down in a log. One day I buy my gas and write it in my log then later find out that my fuel gauge was broken and I really had 3/4 a tank of gas. Should I now erase the $10 purchase from my log because the underlying data (the level of gas in my tank) has changed? Ok, maybe that's not a clearer example, but hopefully it gets the point across. Recording the results is a different thing from recording the events that led up to the result. This is especially true in financial application. Therefore, I don't know that you're breaking normalization at all with storing the final outcome in its own table.
An indexed view is the way. But there are quite a few limitations to this approach, but it's generally favorable although it has some overhead issues if implemented incorrectly. With this approach you won't need to keep track of the changes that take place in your base tables and the data would accumulate itself nicely in that indexed view of yours. In theory.
Improving Performance with SQL Server 2005 Indexed Views
Oracle: Materialized View Concepts and Architecture

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it