Where's the tradeoff between normalization (SQL View) and performance/fiability (SQL Table) - sql

I've got quite a long business process which eventually results into financial operations.
What matters in the end is quite exclusively these final operations, although I've got to keep a log of everything which led to it.
Since all the information contained into the final operations is available in other tables (used during the business process), it makes sense to use a view, but the view logic would be quite complicated (there are dozens of tables implicated), and I'm concerned that :
even with appropriate indexes, a table will probably be way faster (my table will eventually contain millions if items, and should be fully searchable on almost all its columns)
the view logic would be complicated, so I'm afraid it may complicate things in a few years if I want to evolve my business logic.
Because of those two reasons, I'm a bit tempted to write the data in a table at the end of my business process instead of relying on a view, but duplicating the data doesn't smells right (and it also looks a bit like premature optimization, but since it's such a central point in my design, I'd like to address the issue ASAP)
Have you ever faced such a choice? What did you decide?
Edit : creating a table would clearly lead to duplication in my situation, ie. the data written in the table exists somewhere else in the database and could be retrieved using only joins without any calculations.

I think you answered your question writing it down Brann.
This problem can be seen in this way: from one hand you have "real time data". You have fresh data and from them it's nice to create view to show "real time data" too.
But as time goes on, there are more data and logic changes. So it's good to have written down summaries of data you had some time ago. It's very pragmatic - you do not duplicate data, because you recalculate them and save into new table summary of them.
So when you think of it in this way, it's obvious that in this example new table will be better. As you write:
Faster access
Can have more complicated logic
Have archive data unchanged when logic changes
So when you meet this (or part) of this criteria as you requirement than its not choice - you go into tables.
When I would use view is only when showing fresh data out of other fresh data. In very, very simple problems. And when it gets more complicated - you always switch to new table.
So do not be afraid to go into it. Having one summary table with faster access is very pretty solution and it's a sign of well formed database.
Take care of the design of this table - so when business logic changes - you do not need to change everything from one stone in this table. And then everything will be OK!

I'm for the new table in this situation. The view has many disadvantages - performance clearly, complexity, and logic lock in. However, IMHO the over-arching reason is that as the underlying data changes, so the value in your view will change also. In most instances this is a good thing, however, with financial operations isn't it better to have a fixed record of what occured.

I always decide to have better normalization. In your case , though the view may be complicated , it's better to have that than the new table which has to be kept in sync with all the data changing operation.Plus the view would always be current while your end of business day table population would be only current for few hours a day.
Also , you have a bigger problem if the data in this table goes out of sync for whatever reasons.

As MrTelly alluded to, are you sure that your end result table really is a duplication of the view data? Or, is it actually a record of the final action taken as a result of the items in the view data.
For a clearer example, let's say that every time my gas tank gets to half-empty I buy $10 of gas. I write this down in a log. One day I buy my gas and write it in my log then later find out that my fuel gauge was broken and I really had 3/4 a tank of gas. Should I now erase the $10 purchase from my log because the underlying data (the level of gas in my tank) has changed? Ok, maybe that's not a clearer example, but hopefully it gets the point across. Recording the results is a different thing from recording the events that led up to the result. This is especially true in financial application. Therefore, I don't know that you're breaking normalization at all with storing the final outcome in its own table.

An indexed view is the way. But there are quite a few limitations to this approach, but it's generally favorable although it has some overhead issues if implemented incorrectly. With this approach you won't need to keep track of the changes that take place in your base tables and the data would accumulate itself nicely in that indexed view of yours. In theory.
Reference:
Improving Performance with SQL Server 2005 Indexed Views
Oracle: Materialized View Concepts and Architecture

Related

Is it bad to not use normalised tables in this database?

I recently learned about normalisation in my informatics class and I'm developing a multiplayer game using SQLite as backend database at the moment.
Some information on it:
The simplified structure looks a bit like the following:
player_id | level | exp | money | inventory
---------------------------------------------------------
1 | 3 | 120 | 400 | {item a; item b; item c}
Okay. As you can see, I'm storing a table/array in string form in the column "inventory". This is against normalization.
But the thing is: Making an extra table for the inventory of players brings only disadvantages for me!
The only points where I access the database is:
When a player joins the game and his profile is loaded
When a player's profile is saved
When a player joins, I load his data from the DB and store it in memory. I only write to the DB like every five minutes when the player is saved. So there are actually very few SQL queries in my script.
If I used an extra table for the inventory I would have to, upon loading:
Perform an performance and probably more data-intensive query to fetch all items from the inventory table which belong to player X
Walk through the results and convert them into a table for storage in memory
And upon saving:
Delete all items from the inventory table which belong to player X (player might have dropped/sold some items?)
Walk through the table and perform a query for each item the player owns
If I kept all the player data in one table:
I'd only have one query for saving and loading
Everything would be in one place
I would only have to (de)serialize the tables upon loading and saving, in my script
What should I do now?
Do my arguments and situation justify working against normalisation?
Are you saying that you think parsing a string out of "inventory" doesn't take any time or effort? Because everything you need to do to store/retrieve inventory items from a sub table is something you'd need to do with this string, and with the string you don't have any database tools to help you do it.
Also, if you had a separate subtable for inventory items, you could add and remove items in real time, meaning that if the app crashes or the user disconnects, they don't lose anything.
There are a lot of possible answers, but the one that works for you is the one to choose. Keep in mind, your choice may need to change over time.
If the amount of data you need to persist is small (ie: fits into a single table row) and you only need to update that data infrequently, and you don't have any reason to care about subsets of that data, then your approach makes sense. As time goes on and your players gain more items and you add more personalization to the game, you may begin to push up against the limits of SQLite, and you'll need to evolve your design. If you discover that you need to be able to query the item list to determine which players have what items, you'll need to evolve your design.
It's generally considered a good idea to get your data architecture right early, but there's no point in sitting in meetings today trying to guess how you'll use your software in 5-10 years. Better to get a design that meets this year's needs, and then plan to re-evaluate the design again after a year.
What's going to happen when you have one hundred thousand items in your inventory and you only want to bring back two?
If this is something that you're throwing together for a one off class and that you won't ever use again, then yes, the quick and dirty route might be a quicker option for you.
However if this is something you're going to be working on for a few months, then you're going to run into long-term issues with that design decision.
No, your arguments aren't valid. They basically boil down to "I want to do all of this processing in my client code instead of in SQL and then just write it all to a single field" because you are still doing all of the exact same processing to generate the string. By doing this you are removing the ability to easily load a small portion of the list and losing relationships to the actual item table which could contain more information about the items (I assume you're hard coding it all based on names instead of using internal item IDs which is a really bad idea, imo).
Don't do it. Long term the approach you are wanting to take will generate a lot more work for you as your needs evolve.
Another case of premature optimization.
You are trying to optimize something that you don't have any performance metrics. What is the target platform? Even crappiest computers nowadays could run at least hundreds of your reading operation per second. Then you add better hardware for more users, then you can go to cloud and when you come into problem space that Google, Twitter and Facebook are dealing with, you can consider denormalizing. Even then, best solution is some sort of key-value database.
Maybe you should check Wikipedia article on Database Normalization to remind you why normalized database is a good thing.
You should also think about the items. Are the items unique for every user or does user1 could have item1 and user2 have item1 to. If you now want to change item1 you have to go through your whole table and check which user have this item. If you would normalize your table, this would be much more easy.
But it the end, I think the answer is: It depends
Do my arguments and situation justify
working against normalisation?
Not based on what I've seen so far.
Normalized database designs (appropriately indexed and with efficient usage of the database with UPSERTS, transactions, etc) in general-purpose engines will generally outperform code except where code is very tightly optimized. Typically in such code, some feature of the general purpose RDBMS engine is abandoned, such as one of the ACID properties or referntial integrity.
If you want to have very simple data access (you tout one table, one query as a benefit), perhaps you should look at a document centric database like mongodb or couchdb.
The reason that you use any technology is to leverage the technology's advantages. SQL has many advantages that you seem to not want to use, and that's fine, if you don't need them. In Neal Stephenson's Zodiac, the main character mentions that few things bought from a hardware store are used for their intended purpose. Software's like that, too. What counts is that it works, and it works nearly 100% of the time, and it works fast enough.
And yet, I can't help but think that someday you're going to have some overpowered item released into the wild, and you're going to want to deal with this problem at the database layer. Say you accidently gave out some superinstakillmegadeathsword inventory items that kill everything within 50 meters on use (wielder included), and you want to remove those things from play. As an apology to the people who lose their superinstakillmegadeathsword items, you want to give them 100 money for each superinstakillmegadeathsword you take away.
With a properly normalized database structure, that's a trivial task. With a denormalized structure, it's quite a bit harder and slower. A normalized database is also going to be easier to expand on the design in the future.
So are you sure you don't want to normalize your database?

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

What's the best way to store/calculate user scores?

I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!
I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.
For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.
In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.
If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.

use SUM() or caching

I don't have much experience with databases, so I don't know which is better for long-term performance, better practice, etc.
Here is my (hypothetical) case: imagine you have a database of customer information and history of purchase orders for each. You want to keep track of how much each customer is buying. I can think of two ways of calculating this:
1) Just do a SUM() every time it is needed. This is an easy solution, but the concern is that this database could be 20 years old with tens of thousands of rows for each customer. As more customers purchases are added to the database, the SUM() operation will take longer to calculate.
2) Keep the sum cached in the customer information table, and every time a new purchase is made (updated, deleted, etc), update this cache. That way no matter how many purchase orders there are, there won't be an increase in the calculation time. The downside is that, this is a less flexible solution (only the sum over all rows, what about per month sum? other intervals? etc); this cached value could get out of sync with the actual total somehow (technically shouldn't happen, but it might)
So what should I be doing for this? I know I shouldn't be storing anything I can calculate from what's already in the database, but part of me doesn't like the fact that this type of calculation will get worse in time, and that there is some kind of elegance to option 2.
In database terms, there is no elegance in option 2 - it would be considered a hack that you might use as a last resort if your database gets to be truly humongous - not likely to happen for a novice setting it up for the first time (but possible).
There would be a lot of work maintaining the totals; and you'd be forever dealing with the question: "Why doesn't the detail add up to the total?"
Go for option 1 until you've proven you can't. Which will in most cases be a long time.
What you're describing in option #2 is a case of premature optimization. Using the SUM() of all purchases is going to work for a really long time (years). When (if) you start to see this function degrade, you can add indexes or a totals table to your database to speed things up. Don't complicate things when a simple solution exists.
Of course, the real solution is to try both solutions with 20 years of made-up data and see if there's any real difference. I suspect there isn't.
Almost always 1.
How often will you be querying for the total for 20 year's worth of history? If the answer is often, and the performance is poor, then you can think about optimisation or OLAP.
I suspect you're worrying about optimisation too early. This is the sort of thing databases are designed for - let them worry about caching.
I'll just add that another possibility is to create roll-up tables. For instance, when tracking page hits, it's not very useful necessarily to know that IP such-and-such accessed page1.php at 14:42:04 on 11/19/2008; but you might want to track daily stats for page1.php. In this case, at the end of each day you might run a procedure to sum up the hits for each page and create a record in a summary table, which can in turn be heavily indexed. Your reporting then can run against that table. In addition to speeding up reporting, it can also speed up writing the original records, as you don't need to worry about table locking or building indexes.
That said, good indexes can go a long way with reporting; and like others here have admonished, it's best to go with the easier, even if less optimal, solution until (if ever) it becomes a problem.
Kudos for thinking ahead, but the question that begs asking, is your sales data going to stay in the transactional database for 20 years?
There comes a point where it's going to be a lot easier to move that data into a data warehouse, and just keep your frontline database current.
If this is a new project, be more concerned with making it work, and getting people to use it. Worry about scalability when you've crossed those bridges.
Use option 1. Later, if performance becomes poor, you can identify specific bottlenecks and resolve them with options like #2, or materialized views, or several other possibilities.

Deciding on a database structure for pricing wizard

Option A
We are working on a small project that requires a pricing wizard for custom tables. (yes, actual custom tables- the kind you eat at. From here out I'll call them kitchen tables so we don't get confused) I came up with a model where each kitchen table part was a database table. So the database looked like this:
TableLineItem
-------------
ID
TableSizeID
TableEdgeWoodID
TableBaseID
Quantity
TableEdgeWoodID
---------------
ID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
Each part has to be able to calculate its price. Most of the calculations are very similar. I liked this structure because I can drag it right into the linq-to-sql designer, and have all of my classes generated. (Less code writing means less to maintain...) I then implement a calculate cost interface which just takes in the size of the table. I have written some tests and this functions pretty well. I added also added a table to filter parts in the UI based on previous selections. (You can't have a particular wood with a particular finish.) There some other one off exceptions in the model, and I have them hard coded. This model is very rigid, and changing requirements would change the datamodel. (For example, if all the tables suddenly need umbrellas.)
Option B:
After various meetings with my colleagues (which probably took more time than it should considering the size of this project), my colleagues decided they would prefer a more generic approach. Something like this:
Spec
----
SpecID
SpecTypeID
TableType_LookupID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
SpecType
--------
SpecTypeID
ParentSpecType_SpecTypeID
IsCustomerOption
IsRequiredCustomerOption
etc...
This is a much more generic approach that could be used to construct any product. (like, if they started selling chairs...) I think this would take longer time to implement, but would be more flexible in the future. (although I doubt we will revisit this.) Also you lose some referential integrity- you would need triggers to enforce that a table base cannot be set for a table wood.
Questions:
Which database structure do you prefer? Feel free to suggest your own.
What would be considered a best practice? If you have several similar database tables, do you create 1 database table with a type column, or several distinct tables? I suspect the answer begins with "It depends..."
What would an estimated time difference be in the two approaches (1 week, 1 day, 150% longer, etc)
Thanks in advance. Let me know if you have any questions so I can update this.
Having been caught out much more often than I should have by designing db structures that met my clients original specs but which turned out to be too rigid, I would always go for the more flexible approach, even though it takes more time to set up.
I don't have time for a complete answer right now, but I'll throw this out:
It's usually a bad idea to design a database based on the development tool that you're using to code against it.
You want to be generic to a point. Tables in a database should represent something and it is possible to make it too generic. For example, a table called "Things" is probably too generic.
It may be possible to make constraints that go beyond what you expect. Your example of a "table base" with a "table wood" didn't make sense to me, but if you can expand on a specific example someone might be able to help with that.
Finally, if this is a small application for a single store then your design is going to have much less impact on the project outcome than it would if you were designing for an application that would be heavily used and constantly changed. This goes back to the "too generic" comment above. It is possible to overdesign a system when its use will be minimal and well-defined. I hope that makes sense.
Given your comment below about the table bases and woods, you could set up a table called TableAttributes (or something similar) and each possible option would be of a particular table attribute type. You could then enforce that any given option is only used for the attribute to which it applies all through foreign keys.
There is a tendency to over-abstract with database schema design, because the cost of change can be high. Myself, I like table names that are fairly descriptive. I often equate schema design with OO design. E.g., you wouldn't normally create a class named Thing, you would probably call it Product, Furniture, Item, something that relates to your business.
In the schema you have provided there is a mix of the abstract (spec) and the specific (TableType_LookupID). I would tend to equalize the level of abstraction, so use entities like:
ProductGroup (for the case where you have a product that is a collection of other products)
Product
ProductType
ProductDetail
ProductDetailType
etc.
Here's what my experience would tell me:
Which database structure do you prefer? Without a doubt, I'd go for approach one. Go for the simplest setup that might work. If you add complexity, always ask yourself, what value will it have to the customer?
What would be considered a best practice? That does indeed depend, among others on the size of the project and the expected rate of change. As a general rule, generic tables are worth it when you expect the customer to be adding new types. For example, if your customer wants to be able to add a new "color" entity to the table, you'd need generic tables. You can't predict beforehand what they will add.
What would an estimated time difference be in the two approaches? Not knowing your business, skill, and environment, it's impossible to give a valid estimate. The approach that you are confident in coding will take the least time. Here, my guess would be approach #1 could be 5x-50x as fast. Generic tables are hard, both on the database and the client side.
Option B..
Generic is generally better than specific. Software already is doomed to fail or reach it's capacity by it's design for a certain set of tasks only. If you build something generic it will break less if abstracted with a realistic analysis of where it might head. As long as you stay away from over-abstraction and under-abstraction, it's probably the sweet spot.
In this case the adage "less code is more" would probably be drawn in that you wouldn't have to come back and re-write it again.