Is precalculation denormalization? If not, what is (in simple terms)? - sql

I'm attempting to understand denormalization in databases, but almost all the articles google has spat out are aimed at advanced DB administrators. I fair amount of knowledge about MySQL and MSSQL, but I can't really grasp this.
The only example I can think of when speed was an issue was when doing calculations on about 2,500,000 rows in two tables at a place I used to intern at. As you can guess, calculating that much on demand took forever and froze the dev server I was on for a few minutes. So right before I left my supervisor wanted me to write a calculation table that would hold all the precalculated values, and would be updated about every hour or so (this was an internal site that wasn't used often). However I never got to finish it because I left
Would this be an example of denormalization? If so, is this a good example of it or does it go much farther? If not, then what is it in simple terms?

Say you had an Excel file with 2 worksheets you want to use to store family contact details. On the first worksheet, you have names of your contacts with their cell phone numbers. On the second worksheet, you have mailing addresses for each family with their landline phone numbers.
Now you want to print Christmas card labels to all of your family contacts listing all of the names but only one label per mailing address.
You need a way to link the two normalized sets. All the data in the 2 sets you have is normalized. It's 'atomic,' representing one 'atom,' or piece of information that can't be broken down. None of it is repeated.
In a denormalized view of the 2 sets, you'd have one list of all contacts with the mailing addresses repeated multiple times (cousin Alan lives with Uncle Bob at the same address, so it's listed on both Alan and Bob's rows.)
At this point, you want to introduce a Household ID in both sets to link them. Each mailing address has one householdID, each contact has a householdID value that can be repeated (cousin Alan and Uncle Bob, living in the same household, have the same householdID.)
Now say we're at work and we need to track zillions of contacts and households. Keeping the data normalized is great for maintenance purposes, because we only want to store contact and household details in one place. When we update an address, we're updating it for all the related contacts. Unfortunately, for performance reasons, when we ask the server to join the two related sets, it takes forever.
Therefore, some developer comes along and creates one denormalized table with all the zillions of rows, one for each contact with the household details repleated. Performance improves, and space considerations are tossed right out the window, as we now need space for 3 zillion rows instead of just 2.
Make sense?

I would call that aggregation not denormalization(if it is quantity of orders for example, SUM(Orders) per day...). This is what OLAP is used for. Denormalization would be for example instead of having a PhoneType table and the PhoneTypeID in the Contact table, you would just have the PhoneType in the Contact table thus eliminating 1 join
You could also of course use index/materialized views to have to aggregation values...but now you will slow down your update, delete and inserts
triggers are also another way to accomplish this

In an overly simplified form I would describe de-normalisation as reducing the number of tables used to represent the same data.
Customers and addresses are often kept in different tables to allow the concept of one customer having multiple addresses. (Work, Home, Current Address, Previous Address, etc)
The same could be said to apply to surnames, and other properties, but only the current surname ever be of concern. As such, one might normalise all the way to having a Customer table and a Surname table, with foreign key relationships, etc. But then denormalise this by merging the two tables together.
The benefit of "normalise until it hurts" is that it forces one to consider a pure and (hopefully) complete representation of the data and possible behaviours and relationships.
The benefit of "de-normalise until it works" is to reduce certain maintenance and/or processing overheads, but sticking to the same basic model as derived by working out a normalised model.
In the "Surname" example, by denormalising one is able to add an index to the customers based on their Surname and Date of Birth. Without de-normalising the Surname and DoB are in different tables and the composite index is not possible.

Denormalizing can be beneficial, the example you provided is an instance of this. It is not ideal to dynamically calculate these as the cost is expensive and thus you create a table and have a functional id referencing the other table along with calculation value.
The data is redundant as it can be derived from another table but due to production requirements this is a better design in the functional sense.
Curious to see what others have to say on this topic because I know my sql professor would cringe at the term denormalize but it has practicality.

Normal form would reject this table, as it is fully derivable from existing data. However, for performance reasons data of this type is commonly found. For example inventory counts are typically carried, but are derivable from the transactions that created them.
For smaller faster sets a view can be used to derive the aggregate. This provides the user the data they need (the aggregated value) rather than forcing them to aggregate it themselves. Oracle (and others?) have introduced materialized views to do what your manager was suggesting. This can be updated on various schedules.
If update volumes permit, triggers could be used to emulate a materialized view using a table. This may reduce the cost of maintaining the aggregated value. If not it would spread the overhead over a greater period of time. It does however, add the risk of creating a deadlock condition.
OLAP takes this simple case to more of an extreme interest in aggregates. Analysts are interested in aggregated values not the details. However, if the aggregated value is interesting, they may look at the details. Starting from normal form, is still a good practice.

Related

Building a MySQL database that can take an infinite number of fields

I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.

When to split a Large Database Table?

I'm working on an 'Employee' database and the fields are beginning to add up (20 say). The database would be populated from different UI say:
Personal Information UI: populates fields of the 'Employee' table such as birthday, surname, gender etc
Employment Details UI: populates fields of the 'Employee' table such as employee number, date employed, grade level etc
Having all the fields populated from a single UI (as you would imagine) is messy and results in one very long form that you'd need to scroll.
I'm thinking of splitting this table into several smaller tables, such that each smaller table captures a related information of an employee (i.e. splitting the table logically according to the UI).
The tables will then be joined by the employee id. I understand that splitting tables with one-to-one relationship is generally not a good idea (multiple-database-tables), but could splitting the table logically help, such that the employee information is captured in several INSERT statements?
Thanks.
Your data model should not abide to any rules imposed by the UI, just for convenience. One way to reduce the column-set for a given UI component is to use views (in most databases, you can also INSERT / UPDATE / DELETE using simple views). Another is to avoid SELECT *. You can always select subsets of your table's columns
"could splitting the table logically help, such that the employee information is captured in several INSERT statements?"
No.
How could it help?
20 fields is a fairly small number of fields for a relational database. There was a question a while ago on SO where a developer expected to have around 3,000 fields for a single table (which was actually beyond the capability of the RDBMS in question) - under such circumstances, it does make sense to split up the table.
It could also make sense to split up the table if a subset of columns were only ever going to be populated for a small proportion of rows (for example, if there were attributes that were specific to company directors).
However, from the information supplied so far, there is no apparent reason to split the table.
Briefly put, you want to normalize your data model. Normalization is the systematic restructuring of data into tables that formed the theoretical foundation of the relational data model developed by EF Codd forty years ago. There are levels of normalization - non-normalized and then first, second, third etc normal forms.
Normalization is now barely an afterthought in many database shops, ostensibly because it is erroneously believed to slow database performance.
I found a terrific summary on an IBM site covering normalization which may help. http://publib.boulder.ibm.com/infocenter/idshelp/v10/topic/com.ibm.ddi.doc/ddi56.htm
The original Codd book later revised by CJ Date is not very accessible, unfortunately. I can recommend "Database Systems: Design, Implementation & Management; Authors: Rob, P. & Coronel, C.M". I TA'ed a class on database design that used this textbook and I've kept using it for reference.

database design bigger table vs split table have the same col

i have a database program for store as you know there is too type of invoice in it one for the thing i bought and the other for me when i sold them the two table is almost identical like
invoice table
Id
customerName
date
invoiceType
and invoiceDetails which have
id
invoiceId
item
price
amount
my question is simple its what best to keep the design like that or split every table for two sperate tabels
couple of my friend suggest splitting the tables as one for saleInvoice and the other for buyInvoice to speed the time for querying
so whats the pro and con of every abrouch i feel that if i split them its like i dont follow DRY rule
i am using Nhibernate BTW so its kindda weird to have to identical class with different names
Both approached would work. If you use the single table approach, then the invoiceType column would be your discriminator field. In your nHibernate mapping, this discriminator field would be used by nHibernate to decide which type (i.e. a purchase or a sale) to instantiate for a given row in the table (see section 5.1.6 of the nHibernate mapping guide. For ad hoc SQL queries or reporting queries, you could create two views, one to return only rows with invoiceType = purchase and one to return only rows where invoiceType=sales.
Alternatively, you could create two separate tables, one for purchase and one for sales. As you point out, these two tables would have nearly identical schemas and nhibernate mapping files.
If you are anticipating very high transaction volumes, you would want to put purchases and sales on two different physical discs. With two different tables, this can be accomplished by putting them into different file groups. With a single table, you still could accomplish this by creating a SQL Server Partitioned Table. Before you go to this trouble, you might want to evaluate if this really is necessary and that disc access to the table is really going to be the performance bottleneck. You don't want to spend a lot of time doing premature optimization if it is not necessary.
My preference would be to have a single table with a discriminator column, to better follow DRY principles. Unless I had solid numbers that indicated that indicated it was necessary, I would hold off implementing a partitioned table until if and when it became necessary.
I'd ask myself, how do I intend to use this information? Will I need sales and buy invoices in the same queries? Am I likely to need specialized information eventually (highly likely in my experience) for each type? And if I do will I need to have child tables for only 1 type? How would that affect referntial integrity? Would a change to one automatically mean I needed a change to the other? How large is the table likely to be (It would have to be in the multi-millions before I would consider that it might need to be split out only due to size). How likely is it that I would get the information mixed up by accident if they are in the same table and include both when I didn't want to? The answers would determine whether I needed to split it out for me. I would tend to see these as two separate functions and it would take alot to convince me to put them in one table.

Normalize or Denormalize: Store Contact Details (Phone Numbers) in separate Table? Search Performance?

I'm designing a database application which stores simple contact information (First/Last Name etc.) and I also have to store phone numbers. Besides phone numbers I have to store what they are for (mobile, business etc.) and possibly an additional comment for each.
My first approach was to normalize and keep the phone numbers in a separate table, so that I have my 'Contacts' and my 'PhoneNumbers' table. The PhoneNumbers table would be like this:
Id int PK
ContactId int FK<->Contacts.Id
PhoneNumber nvarchar(22)
Description nvarchar(100)
However, it would make things a lot easier AND save a SQL Join on retrieval if I just stored this information as part of each contact's record (assuming that I limit the total # of phone numbers that can be stored, to say 4 numbers total).
However, I end up with an "ugly" structure like this:
PhoneNumber1 nvarchar(22)
Description1 nvarchar(100)
PhoneNumber2 nvarchar(22)
Description2 nvarchar(100)
etc. etc.
It looks amateurish to me but here are the advantages I see:
1) In ASP.NET MVC I can simply attach the input textboxes to my LINQ object's properties and I'm done with wiring up record adds and updates.
2) No SQL Join necessary to retrieve the information.
Unfortunately I am not very knowledgeable on issues such as table width problems (I read that this can cause problems if it grows too big/too many columns and that performance issues come up?) and then also it would mean that when I search for a phone number I'd have to look at 4 fields instead of 1 if I kept it in a separate table.
My application has about 80% search/data retrieval activity so search performance is an important factor.
I appreciate your help in finding the right way to do this. Separate table or keep it all in one? Thank you!
It won't likely cause problems to have the data denormalized like that, but I wouldn't suggest it. Even though it may be more complex to query, it's better to have well formed data that you can manipulate in a multitude of ways. I would suggest a database schema like this:
Contacts:
ID (Primary Key)
Name
Job Title
Phone Number Categories:
ID (Primary key)
Name
Phone Numbers:
ID (Primary Key)
Category_ID (Foreign Key -> Phone Number Categories.ID)
Contact_ID (Foreign Key -> Contacts.ID)
Phone Number
This allows you a lot of flexibility in the number of phone numbers allowed, and gives you the ability to categorize them.
This might be fine now, but what happens when someone wants a fifth phone number? Do you keep on adding more and more fields?
Another thing to consider is how would you run a query to say 'Give me all people and their mobile phone numbers', or 'Give me everyone with no phone number'? With a separate table, this is easy, but with one table the mobile number could be in any one of four fields so it becomes much more complicated.
If you take the normalised approach, and in future you wanted to add additional data about the phone number, you can simply add another column to the phone numbers table, not add 4 columns to the contacts table.
Going back to the first point about adding more phone numbers in future - if you do add more numbers, you will have to amend probably every query/bit of logic/form that works on data to do with phone numbers.
I'm in favor of the normalized approach. What if you decided that you wanted to add an "Extension" column for business phone numbers? You'd have to create the columns "Extension1", "Extension2", "Extension3", etc. This could become quite tedious to maintain at some point.
Then again, I don't think you can go too wrong either way. It's not like normalization/denormalization will take all that much time if you decided to switch to the other method.
An important principle of denormalization is that it does not sacrifice normalized data. You should always start with a schema that accurately describes your data. As such you should put different kinds of information in different kinds of tables. You should also put as many constraints on your data as you think is reasonable.
All of these goals tend to make queries a teeny bit longer, as you have to join different tables to get the desired information, but with the right names for tables and columns, this shouldn't be a burden from the point of view of readability.
More imporantly, these goals can have an affect on performance. You should monitor your actual load to see if your database is performing adequately. If nearly all of your queries are returning quickly, and you have lots of CPU headroom for more queries, then you're done.
If you find that write queries are taking long, make sure you don't denormalize your data. You will make the database work harder to keep things consistent, since it will have to do many reads followed by many more writes. Instead, you want to look at your indexes. Do you have indexes on columns you rarely query? Do you have indexes that are needed to verify the integrity of an update?
If read queries are your bottleneck, then once again, you want to start by looking at your indexes. Do you need to add an index or two to avoid table scans? If you just can't avoid the table scans, are there any things you could do to make each row smaller, like by reducing the number of characters in a varchar column, or splitting rarely queried columns into another table to be joined upon when they are needed.
If there is a specific slow query that always uses the same join, then that query might benefit from denormalization. First verify that reads on those tables strongly outnumber writes. Determine which columns you need from one table to add to the other. You might want to use a slightly different name to those columns so that it's more obvious that they are from denormalization. Alter your write logic to update both the original table used in the join, and the denormalized fields.
It's important to note that you aren't removing the old table. The problem with denormalized data is that while it accelerates the specific query it was designed for, it tends to complicate other queries. In particular, write queries must do more work to insure that the data remains consistent, either by copying data from table to table, by doing additonal subselects to make sure that the data is valid, or jump over other sorts of hurdles. By keeping the original table, you can leave all your old constraints in place, so at least those original columns are always valid. If you find for some reason that the denormalized columns are out of sync, you can switch back to the original, slower query and everything is valid, and then you can work on ways to rebuild the denormalized data.
I agree with #Tom that normalizing makes more sense and provides flexibility. If you get your indexes right you shouldn't suffer too much by doing a join between the tables.
As for your normalized table I'd add a type or code field so you can ID say Home, Home 1,Home 2, Business, Bus 1, Bus 2, Mobile, Mob1 etc...
Id int PK
ContactId int FK<->Contacts.Id
Code char(5)
PhoneNumber nvarchar(22)
Description nvarchar(100)
And store this type in a separate table with say other code/code description information
We tend to have a Code Group table with info such as
CODE_GROUP, CODE DESC
ST State
PH Phone Number
AD Address Type
And a CODE table with
CODE_ID, CODE_GROUP, DESCRIPTION
MB1 PH Mobile One
MB2 PH Mobile Two
NSW ST New South Wales
etc...
You can expand this out to have long description, short desc, ordering, filtering etc
How about an XML field in the Contacts table? It takes away the complexity of another table.
(Please correct me if this is a bad idea, I've never worked with XML fields before)

Table with a lot of columns

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id