Any SQL database: When is it better to fetch a whole table instead of querying for particular rows? - sql

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.

with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.

Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.

I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .

In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.

First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.

The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).

Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.

Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?

if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it

Related

PostgreSQL - "Ten most frequent entries"

We've got a table with two colums: USER and MESSAGE
An USER can have more than one message.
The table is frequently updated with more USER-MESSAGE pairs.
I want to frequently retrieve the top X users that sent the most messages. What would be the optimal (DX and performnce wise) solution for it?
The solutions I see myself:
I could GROUP BY and COUNT, however it doesn't seem like the most performant nor clean solution.
I could keep an additional table that'd keep count of every user's messages. On every message insertion into the main table, I could also update the relevant row here. Could the update be done automaticaly? Perhaps I could write a procedure for it?
For the main table, I could create a VIEW that'd have an additional "calculated" column - it'd GROUP BY and COUNT, but again, it's probably not the most performant solution. I'd query the view instead.
Please tell me whatever you think might be the best solution.
Some databases have incrementally updated views, where you create a view like in your example 3, and it automatically keeps it updated like in your example 2. PostgreSQL does not have this feature.
For your option 1, it seems pretty darn clean to me. Hard to get much simpler than that. Yes, it could have performance problems, but how fast do you really need it to be? You should make sure you actually have a problem before worrying about solving it.
For your option 2, what you are looking for is a trigger. For each insertion, it would increment a count in the user table. If you ever delete, you would also need to decrease the count. Also, if ever update to change the user of an existing entry, the trigger would need to decrease the count of the old user and increase it of the new user. This will decrease the concurrency, as if two processes try to insert messages from the same user at the same time, one will block until the other finishes. This may not matter much to you. Also, the mere existence of triggers imposes some CPU overhead, plus whatever the trigger itself actually does. But unless our server is already overloaded, this might not matter.
Your option 3 doesn't make much sense to me, at least not in PostgreSQL. There is no performance benefit, and it would act to obscure rather than clarify what is going on. Anyone who can't understand a GROUP BY is probably going to have even more problems understanding a view which exists only to do a GROUP BY.
Another option is a materialized view. But you will see stale data from them between refreshes. For some uses that is acceptable, for some it is not.
The first and third solutions are essentially the same, since a view is nothing but a “crystallized” query.
The second solution would definitely make for faster queries, but at the price of storing redundant data. The disadvantages of such an approach are:
You are running danger of inconsistent data. You can reduce that danger somewhat by using triggers that automatically keep the data synchronized.
The performance of modifications of message will be worse, because the trigger will have to be executed, and each modification will also modify users (that is the natural place to keep such a count).
The decision should be based on the question whether the GROUP BY query will be fast enough for your purposes. If yes, use it and avoid the above disadvantages. If not, consider storing the extra count.

are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?

summary: I've seen a lot of advice against using SELECT * in MS SQL, due to both performance and maintainability concerns. however, many of these posts are very old - 5 to 10 years! it seems, from many of these posts, that the performance concerns may have actually been quite small, even in their time, and as to the maintainability concerns ("oh no, what if someone changes the columns, and you were getting data by indexing an array! your SELECT * would get you in trouble!"), modern coding practices and ORMs (such as Dapper) seem - at least in my experience - to eliminate such concerns.
and so: are there concerns with SELECT * that are still relevant today?
greater context: I've started working at a place with a lot of old MS code (ASP scripts, and the like), and I've been helping to modernize a lot of it, however: most of my SQL experience is actually from MySQL and PHP frameworks and ORMs - this is my first time working with MS SQL - and I know there are subtle differences between the two. ALSO: my co-workers are a little older than I am, and have some concerns that - to me - seem "older". ("nullable fields are slow! avoid them!") but again: in this particular field, they definitely have more experience than I do.
for this reason, I'd also like to ask: whether SELECT * with modern ORMs is or isn't safe and sane to do today, are there recent online resources which indicate such?
thanks! :)
I will not touch maintainability in this answer, only performance part.
Performance in this context has little to do with ORMs.
It doesn't matter to the server how the query that it is running was generated, whether it was written by hand or generated by the ORM.
It is still a bad idea to select columns that you don't need.
It doesn't really matter from the performance point of view whether the query looks like:
SELECT * FROM Table
or all columns are listed there explicitly, like:
SELECT Col1, Col2, Col3 FROM Table
If you need just Col1, then make sure that you select only Col1. Whether it is achieved by writing the query by hand or by fine-tuning your ORM, it doesn't matter.
Why selecting unnecessary columns is a bad idea:
extra bytes to read from disk
extra bytes to transfer over the network
extra bytes to parse on the client
But, the most important reason is that optimiser may not be able to generate a good plan. For example, if there is a covering index that includes all requested columns, the server will usually read just this index, but if you request more columns, it would do extra lookups or use some other index, or just scan the whole table. The final impact can vary from negligible to seconds vs hours of run time. The larger and more complicated the database, the more likely you see the noticeable difference.
There is a detailed article on this topic Myth: Select * is bad on the Use the index, Luke web-site.
Now that we have established a common understanding of why selecting
everything is bad for performance, you may ask why it is listed as a
myth? It's because many people think the star is the bad thing.
Further they believe they are not committing this crime because their
ORM lists all columns by name anyway. In fact, the crime is to select
all columns without thinking about it—and most ORMs readily commit
this crime on behalf of their users.
I'll add answers to your comments here.
I have no idea how to approach an ORM that doesn't give me an option which fields to select. I personally would try not to use it. In general, ORM adds a layer of abstraction that leaks badly. https://en.wikipedia.org/wiki/Leaky_abstraction
It means that you still need to know how to write SQL code and how DBMS runs this code, but also need to know how ORM works and generates this code. If you choose not to know what's going on behind ORM you'll have unexplainable performance problems when your system grows beyond trivial.
You said that at your previous job you used ORM for a large system without problems. It worked for you. Good. I have a feeling, though, that your database was not really large (did you have billions of rows?) and the nature of the system allowed to hide performance questions behind the cache (it is not always possible). The system may never grow beyond the hardware capacity. If your data fits in cache, usually it will be reasonably fast in any case. It begins to matter only when you cross the certain threshold. After which suddenly everything becomes slow and it is hard to fix it.
It is common for a business/project manager to ignore the possible future problems which may never happen. Business always has more pressing urgent issues to deal with. If business/system grows enough when performance becomes a problem, it will either have accumulated enough resources to refactor the whole system, or it will continue working with increasing inefficiency, or if the system happens to be really critical to the business, just fail and give a chance to another company to overtake it.
Answering your question "whether to use ORMs in applications where performance is a large concern". Of course you can use ORM. But, you may find it more difficult than not using it. With ORM and performance in mind you have to inspect manually the SQL code that ORM generates and make sure that it is a good code from performance point of view. So, you still need to know SQL and specific DBMS that you use very well and you need to know your ORM very well to make sure it generates the code that you want. Why not just write the code that you want directly?
You may think that this situation with ORM vs raw SQL somewhat resembles a highly optimising C++ compiler vs writing your code in assembler manually. Well, it is not. Modern C++ compiler will indeed in most cases generate code that is better than what you can write manually in assembler. But, compiler knows processor very well and the nature of the optimisation task is much simpler than what you have in the database. ORM has no idea about the volume of your data, it knows nothing about your data distribution.
The simple classic example of top-n-per-group can be done in two ways and the best method depends on the data distribution that only the developer knows. If performance is important, even when you write SQL code by hand you have to know how DBMS works and interprets this SQL code and lay out your code in such a way that DBMS accesses the data in an optimal way. SQL itself is a high-level abstraction that may require fine-tuning to get the best performance (for example, there are dozens of query hints in SQL Server). DBMS has some statistics and its optimiser tries to use it, but it is often not enough.
And now on top of this you add another layer of ORM abstraction.
Having said all this, "performance" is a vague term. All these concerns become important after a certain threshold. Since modern hardware is pretty good, this threshold had been pushed rather far to allow a lot of projects to ignore all these concerns.
Example. An optimal query over a table with million rows returns in 10 milliseconds. A non-optimal query returns in 1 second. 100 times slower. Would the end-user notice? Maybe, but likely not critical. Grow the table to billion rows or instead of one user have 1000 concurrent users. 1 second vs 100 seconds. The end-user would definitely notice, even though the ratio (100 times slower) is the same. In practice the ratio would increase as data grows, because various caches would become less and less useful.
From a SQL-Server-Performance-Point-of-view, you should NEVER EVER use select *, because this means to sqlserver to read the complete row from disk or ram. Even if you need all fields, i would suggest to not do select *, because you do not know, who is appending any data to the table that your application does NOT need. For Details see answer of #sandip-patel
From a DBA-perspective: If you give exactly those columnnames you need the dbadmin can better analyse and optimize his databases.
From a ORM-Point-Of-View with changing column-names i would suggest to NOT use select *. You WANT to know, if the table changes. How do you want to give a guarantee for your application to run and give correct results if you do not get errors if the underlying tables change??
Personal Opinion: I really do not work with ORM in Applications needing to perform well...
This question is out some time now, and noone seems to be able to find, what Ben is looking for...
I think this is, because the answer is "it depends".
There just NOT IS THE ONE answer to this.
Examples
As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...
I could go on. There just is not ONE answer. It just depends on to many factors.
It is generally a better idea to select the column names explicitly. Should a table receive an extra column it would be loaded with a select * call, where the extra column is not needed.
This can have several implications:
More network traffic
More I/O (got to read more data from disk)
Possibly even more I/O (a covering index cannot be used - a table scan is performed to get the data)
Possibly even more CPU (a covering index cannot be used so data needs sorting)
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2 Where column = t1.colA)
More Details -1
More Details -2
More Details -3
Maintainability point.
If you do a "Select * from Table"
Then I alter the Table and add a column.
Your old code will likely crash as it now has an additional column in it.
This creates a night mare for future revisions because you have to identify all the locations for the select *.
The speed differences is so minimal I would not be concerned about it. There is a speed difference in using Varchar vs Char, Char is faster. But the speed difference is so minimal it is just about not worth talking about.
Select *'s biggest issue is with changes (additions) to the table structure.
Maintainability nightmare. Sign of a Junior programmer, and poor project code. That being said I still use select * but intend to remove it before I go to production with my code.

Managing very large SQL queries

I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.

Refactoring "extreme" SQL queries

I have a business user who tried his hand at writing his own SQL query for a report of project statistics (e.g. number of tasks, milestones, etc.). The query starts off declaring a temp table of 80+ columns. There are then almost 70 UPDATE statements to the temp table over almost 500 lines of code that each contain their own little set of business rules. It finishes with a SELECT * from the temp table.
Due to time constraints and 'other factors', this was rushed into production and now my team is stuck with supporting it. Performance is appalling, although thanks to some tidy up it's fairly easy to read and understand (although the code smell is nasty).
What are some key areas we should be looking at to make this faster and follow good practice?
First off, if this is not causing a business problem, then leave it until it becomes a problem. Wait until it becomes a problem, then fix everything.
When you do decide to fix it, check if there is one statement causing most of your speed issues ... issolate and fix it.
If the speed issue is over all the statements, and you can combine it all into a single SELECT, this will probably save you time. I once converted a proc like this (not as many updates) to a SELECT and the time to run it went from over 3 minutes to under 3 seconds (no shit ... I couldn't believe it). By the way, don't attempt this if some of the data is coming from a linked server.
If you don't want to or can't do that for whatever reason, then you might want to adjust the existing proc. Here are some of the things I would look at:
If you are creating indexes on the temp table, wait until after your initial INSERT to populate it.
Adjust your initial INSERT to insert as many of the columns as possible. There are probably some update's you can eliminate by doing this.
Index the temp table before running your updates. Do not create indexes on any of the columns targetted by the update statements until after their updated.
Group your updates if your table(s) and groupings allow for it. 70 updates is quite a few for only 80 columns, and sounds like there may be an opportunity to do this.
Good luck
First thing I would do is check to make sure there is an active index maintenance job being run periodically. If not, get all existing indexes rebuilt or if not possible at least get statistics updated.
Second thing I would do is set up a trace (as described here) and find out which statements are causing the highest number of reads.
Then I would run in SSMS with 'show actual execution plan' and tally the results with the trace. From this you should be able to work out whether there are missing indexes that could improve performance.
EDIT: If you are going to downvote, please leave a comment as to why.
Just like any refactoring, make sure you have an automated way to verify your refactorings after each change (you can write this yourself using queries which check the development output against a known good baseline). That way, you are always matching the known good data. This will give you a high degree of confidence in the correctness of your approach when you enter the phase where you are deciding whether to switch over to your new version of the process and want to run side by side for a few iterations to ensure correctness.
I also like to log all the test batches and the run times of the processes within the batch, so I can tell if some particular process within the batch was adversely affected at some point in time. I can get average times for processes and see trends of improvement or spot potential problems. This also lets me identify the low-hanging fruit within the batch where I can make the most improvement.
There are then almost 70 UPDATE
statements to the temp table over
almost 500 lines of code that each
contain their own little set of
business rules. It finishes with a
SELECT * from the temp table.
Actually this sounds like it can be followed and understood quite well, each update statement does one thing to the table with a specific purpose and set of business rules. I think that maintaining procedures of 500 lines of code with one or a couple of select statements that does "everything", built with 15 or so joins, and case statements etc scattered all over the place, is a lot harder to maintain. Although it would make for better performance..
It's a bit of a dilemma with SQL, that writing clear and concise code (using multiple updates, creating functions etc) always seems to have a big negative impact on performance. Trying to do everything at once, which is considered bad practice in other programming languages, seems to be the very core of set oriented languages.
If this is a report generating stored procedure, how often is it being run? If it's only necessary to run it once a day and is run during the night how much of an issue is the performance?
If it's not I'd recommend being careful in your choice to re-write it because there is a chance that you could muck up your figures.
Also it sounds like the sort of thing that should be pulled out into an SSIS package building up a new permanent table with the results so it only has to be run once.
Hope this makes sense
One thing you could try is to replace the temp table with a table variable. There are times when this is faster and times when it is not, you will have to just try it and see.
Look at the 70 update statements. It is possible to combine any of them? If the person writing did not use CASE statments, it might be possible to do fewer statements.
Other obvious things to look at - eliminate any cursors, change any subqueries to joins to tables or derived tables.
Rewrite perhaps. One hardware solution would be to make sure your database temp table goes on a 'fast' drive, perhaps a solid state disk (SSD), or can be managed all in memory.
My guess is this 'solution' was developed by someone with a grasp of and a dependency upon spreadsheets, someone who may not be very savvy on 'normalized' databases--how to construct and populate tables to retain data for reporting purposes, something which perhaps BI Business Intelligence software can be utilized with sophistication and yet be adaptable.
You didn't say 'where' the update process is being run. Is the update process being run as a SQL script from a separate computer (desktop) against the server where the data is? There can be significant bottlenecks and overhead created by that approach. If so, consider running the entire update process directly on the server as a local job, as a compiled stored procedure, bypassing the network and (multiple) cursor management overhead. It could have a scheduled time to run and a controlled priority, completing in off peak business data usage hours.
Evaluate how often 'commit' statements are really needed for the sequence of update statements...saving on a bunch of commit lines could notably improve the overall update time. There may be a couple of settings in the database client driver software which can make a notable difference.
Can the queries used for update conditions be factored out as static 'views' which in turn can be shared across multiple update statements? Views can keep in memory data/query rows frequently accessed. There may be performance tuning in determining how much update data can be pended before a commit is optimal.
It might be worth evaluating whether Triggers could be used to replace the batch job update sequence. You don't say from how many tables the data used comes from...that might help with decision making. I don't know if you have the option of adding triggers to the database tables from which the data is gathered. If so, adding a few triggers to a number of tables wouldn't really degrade overall system performance much, but might save a big wad of time on that update process. You could try replacing the update statements one at a time with triggers and see if the results are the same as before. Create a similar temp table, based on the same update process, then carefully test whether triggers feeding updates to the temp table could replace individual update statements. Perhaps you may have a sort of 'Data Warehouse' application. It might be useful to review how to set up a 'star' schema of tables to retain summarized business data for reporting.
Creating a comprehensive and cached 'view' which updates via the queries once per day, reflecting the updates might be another approach to explore.
Well, since the only thing you've told us about this stored procedure is that it has a 80+ column temp table, the only thing I can recommend is to remove that table, and rewrite the rest to remove the need for it.
You should get a tool that allows you to get an explain plan of all queries your app will run. It is the best bang for the buck on an SQL heavy app for performace increases. If you read and react to what the Explain Plan is telling you. If you are on Oracle what we used to use was TOAD by Qwest(?) I think. It was a great tool.
I would recommend looking at the tables involved, the end result, and starting from scratch to see if the query can be done in a more efficient manner. Keep the query to verify that the new one is working exactly the same as the old one, but try to forget all methods used to obtain the end result.
I would rewrite it from scratch.
You say that you understand what it supposed to do so it should not be that difficult. And I bet that the requirements for that piece of code will keep changing so if you do not rewrite it now you may end up maintaining some ugly monster

use SUM() or caching

I don't have much experience with databases, so I don't know which is better for long-term performance, better practice, etc.
Here is my (hypothetical) case: imagine you have a database of customer information and history of purchase orders for each. You want to keep track of how much each customer is buying. I can think of two ways of calculating this:
1) Just do a SUM() every time it is needed. This is an easy solution, but the concern is that this database could be 20 years old with tens of thousands of rows for each customer. As more customers purchases are added to the database, the SUM() operation will take longer to calculate.
2) Keep the sum cached in the customer information table, and every time a new purchase is made (updated, deleted, etc), update this cache. That way no matter how many purchase orders there are, there won't be an increase in the calculation time. The downside is that, this is a less flexible solution (only the sum over all rows, what about per month sum? other intervals? etc); this cached value could get out of sync with the actual total somehow (technically shouldn't happen, but it might)
So what should I be doing for this? I know I shouldn't be storing anything I can calculate from what's already in the database, but part of me doesn't like the fact that this type of calculation will get worse in time, and that there is some kind of elegance to option 2.
In database terms, there is no elegance in option 2 - it would be considered a hack that you might use as a last resort if your database gets to be truly humongous - not likely to happen for a novice setting it up for the first time (but possible).
There would be a lot of work maintaining the totals; and you'd be forever dealing with the question: "Why doesn't the detail add up to the total?"
Go for option 1 until you've proven you can't. Which will in most cases be a long time.
What you're describing in option #2 is a case of premature optimization. Using the SUM() of all purchases is going to work for a really long time (years). When (if) you start to see this function degrade, you can add indexes or a totals table to your database to speed things up. Don't complicate things when a simple solution exists.
Of course, the real solution is to try both solutions with 20 years of made-up data and see if there's any real difference. I suspect there isn't.
Almost always 1.
How often will you be querying for the total for 20 year's worth of history? If the answer is often, and the performance is poor, then you can think about optimisation or OLAP.
I suspect you're worrying about optimisation too early. This is the sort of thing databases are designed for - let them worry about caching.
I'll just add that another possibility is to create roll-up tables. For instance, when tracking page hits, it's not very useful necessarily to know that IP such-and-such accessed page1.php at 14:42:04 on 11/19/2008; but you might want to track daily stats for page1.php. In this case, at the end of each day you might run a procedure to sum up the hits for each page and create a record in a summary table, which can in turn be heavily indexed. Your reporting then can run against that table. In addition to speeding up reporting, it can also speed up writing the original records, as you don't need to worry about table locking or building indexes.
That said, good indexes can go a long way with reporting; and like others here have admonished, it's best to go with the easier, even if less optimal, solution until (if ever) it becomes a problem.
Kudos for thinking ahead, but the question that begs asking, is your sales data going to stay in the transactional database for 20 years?
There comes a point where it's going to be a lot easier to move that data into a data warehouse, and just keep your frontline database current.
If this is a new project, be more concerned with making it work, and getting people to use it. Worry about scalability when you've crossed those bridges.
Use option 1. Later, if performance becomes poor, you can identify specific bottlenecks and resolve them with options like #2, or materialized views, or several other possibilities.