SQL - Denormalization

SQL - Denormalization - sql

I am trying to familiarize myself with a new database that is structured like this:
CREATE TABLE [TableA] (ID int not null, Primary Key (ID))
CREATE TABLE [TableB] (ID int not null, Primary Key (ID))
CREATE TABLE [TableC] (ID int not null, ID2 int, ID3 int, ID4 int, primary key (ID),
FOREIGN KEY (ID2) REFERENCES TableA(ID), FOREIGN KEY (ID3) REFERENCES TableB(ID))
Table C is a many to many junction table between tableA and tableB. TableC.ID is unique (as it is a Primary Key). TableC.ID4 is also unique and does not seem to refer to anything. I contacted the developer who described it as a "denormalization of the M1 (many to 1) entity". I fully understand the purpose of dernormalization (normalizing a database and then intentionally introducing anomalies for performance reasons), however I still do not understand the reasoning behind this. Is there a pattern or concept that I am unaware of? The application is written in C++ with a bit of VB.NET.

It's fair denormalization if tableC.ID4 contains values that ordinarily you'd have to perform an additional join or lookup for. So have you checked the application code to see what that column is being populated for? If it doesn't refer to anything and doesn't provide any enrichment to the row data as a whole, you may safely move on with your development.

This is not an answer per se, just a related thought. Please don't start down voting it.
Is there any sort of link between tableC.ID and tableC.ID4 ? In one of my projects I had similar case - I was having userid and username in user table. Both were unique with userid as primary key. There is one way to remove username from that table and reate a separate table mapping userid to username. I am not a great fan of normalization. So I thought its an overhead to fire a join query every time I need data from user table containing username and kept my design as it is.

Related

Combine multiple outrigger tables into one?

I have a dimensional table and several outriggers.
create table dimFoo (
FooKey int primary key,
......
)
create table triggerA (
FooKey int references dimFoo (FooKey),
Value varchar(255),
primary key nonclustered (FooKey, Value)
)
create table triggerB (
FooKey int references dimFoo (FooKey),
Value varchar(255)
primary key nonclustered (FooKey, Value)
)
create table triggerC (
FooKey int references dimFoo (FooKey),
Value varchar(255)
primary key nonclustered (FooKey, Value)
)
Should these outrigger tables be merged into one table?
create table Triggers (
FooKey int references dimFoo (FooKey),
TriggerType varchar(20), // triggerA, triggerB, triggerC, etc....
Value varchar(255),
primary key nonclustered (FooKey, TriggerType, Value)
)

In order to meet this kind of scenario, such as with dimCustomer with customers potentially having multiple hobbies, the typical Kimball approach is to use a Bridge table between dimensions (dimCustomer and dimHobby).
This link provides a summary of how bridge tables could solve this problem and also alternatives which may work better for you.
Without knowing more about your specific scenario, including what the business requirements are, how many of these value types you have, how 'uniform' the various value types and values are, and the BI technology you'll be using for accessing the data, its hard to give a definitive answer to whether you should combine the bridges into one uber-bridge that caters for the various many-to-manys. All the above influence the answer to some extent.
Typically the 'generic tables' approach is more useful behind the scenes for administration than it is for presenting for analytics. My default approach would be to have specific bridge tables until/unless this became unmanageable from an ETL perspective or perceived as much more complex from a user query perspective. I wouldn't look to 'optimise' to a combined table from the get-go.
If your situation is outside the usual norms (do you have three as per your example, or ten?), combining could well be a good idea. This would make it more like a factless fact, with dimensions of dimCustomer, dimValueType and dimValue, and would be a perfectly reasonable solution.

Problems on having a field that will be null very often on a table in SQL Server

I have a column that sometimes will be null. This column is also a foreign key, so I want to know if I'll have problems with performance or with data consistency if this column will have weight
I know its a foolish question but I want to be sure.

There is no problem necessarily with this, other than it is likely indication that you might have poorly normalized design. There might be performance implications due to the way indexes are structured and the sparseness of the column with nulls, but without knowing your structure or intended querying scenarios any conclusions one might draw would be pure speculation.
A better solution might be a shared primary key where table A has a primary key, and there is zero or one records in B with the same primary key.
If table A can have one or zero B, but more than one A can refer to B, then what you have is a one to many relationship. This can be represented as Pieter laid out in his answer. This allows multiple A records to refer to the same B, and in turn each B may optionally refer to an A.
So you see there are two optional structures to address this problem, and choosing each is not guesswork. There is a distinct rational between why you would choose one or the other, but it depends on the nature of your relationships you are modelling.

Instead of this design:
create table Master (
ID int identity not null primary key,
DetailID int null references Detail(ID)
)
go
create table Detail (
ID int identity not null primary key
)
go
consider this instead
create table Master (
ID int identity not null primary key
)
go
create table Detail (
ID int identity not null primary key,
MasterID int not null references Master(ID)
)
go
Now the Foreign Key is never null, rather the existence (or not) of the Detail record indicates whether it exists.
If a Detail can exist for multiple records, create a mapping table to manage the relationship.

ORACLE Table design: M:N table best practice

I'd like to hear your suggestions on this very basic question:
Imagine these three tables:
--DROP TABLE a_to_b;
--DROP TABLE a;
--DROP TABLE b;
CREATE TABLE A
(
ID NUMBER NOT NULL ,
NAME VARCHAR2(20) NOT NULL ,
CONSTRAINT A_PK PRIMARY KEY ( ID ) ENABLE
);
CREATE TABLE B
(
ID NUMBER NOT NULL ,
NAME VARCHAR2(20) NOT NULL ,
CONSTRAINT B_PK PRIMARY KEY ( ID ) ENABLE
);
CREATE TABLE A_TO_B
(
id NUMBER NOT NULL,
a_id NUMBER NOT NULL,
b_id NUMBER NOT NULL,
somevalue1 VARCHAR2(20) NOT NULL,
somevalue2 VARCHAR2(20) NOT NULL,
somevalue3 VARCHAR2(20) NOT NULL
) ;
How would you design table a_to_b?
I'll give some discussion starters:
synthetic id-PK column or combined a_id,b_id-PK (dropping the "id" column)
When synthetic: What other indices/constraints?
When combined: Also index on b_id? Or even b_id,a_id (don't think so)?
Also combined when these entries are referenced themselves?
Also combined when these entries perhaps are referenced themselves in the future?
Heap or Index-organized table
Always or only up to x "somevalue"-columns?
I know that the decision for one of the designs is closely related to the question how the table will be used (read/write ratio, density, etc.), but perhaps we get a 20/80 solution as blueprint for future readers.
I'm looking forward to your ideas!
Blama

I have always made the PK be the combination of the two FKs, a_id and b_id in your example. Adding a synthetic id field to this table does no good, since you never end up looking for a row based on a knowledge of its id.
Using the compound PK gives you a constraint that prevents the same instance of the relationship between a and b from being inserted twice. If duplicate entries need to be permitted, there's something wrong with your data model at the conceptual level.
The index you get behind the scenes (for every DBMS I know of) will be useful to speed up common joins. An extra index on b_id is sometimes useful, depending on the kinds of joins you do frequently.
Just as a side note, I don't use the name "id" for all my synthetic pk columns. I prefer a_id, b_id. It makes it easier to manage the metadata, even though it's a little extra typing.

CREATE TABLE A_TO_B
(
a_id NUMBER NOT NULL REFERENCES A (a_id),
b_id NUMBER NOT NULL REFERENCES B (b_id),
PRIMARY KEY (a_id, b_id),
...
) ;
It's not unusual for ORMs to require (or, in more clueful ORMs, hope for) an integer column named "id" in addition to whatever other keys you have. Apart from that, there's no need for it. An id number like that makes the table wider (which usually degrades I/O performance just slightly), and adds an index that is, strictly speaking, unnecessary. It isn't necessary to identify the entity--the existing key does that--and it leads new developers into bad habits. (Specifically, giving every table an integer column named "id", and believing that that column alone is the only key you need.)
You're likely to need one or more of these indexed.
a_id
b_id
{a_id, b_id}
{b_id, a_id}
I believe Oracle should automatically index {a_id, b_id}, because that's the primary key. Oracle doesn't automatically index foreign keys. Oracle's indexing guidelines are online.
In general, you need to think carefully about whether you need ON UPDATE CASCADE or ON DELETE CASCADE. In Oracle, you only need to think carefully about whether you need ON DELETE CASCADE. (Oracle doesn't support ON UPDATE CASCADE.)

the other comments so far are good.
also consider adding begin_dt and end_dt to the relationship. in this way, you can manage a good number of questions about each relationship through time. (consider baseline issues)

How can I share the same primary key across two tables?

I'm reading a book on EF4 and I came across this problem situation:
So I was wondering how to create this database so I can follow along with the example in the book.
How would I create these tables, using simple TSQL commands? Forget about creating the database, imagine it already exists.

You've been given the code. I want to share some information on why you might want to have two tables in a relationship like that.
First when two tables have the same Primary Key and have a foreign key relationship, that means they have a one-to-one relationship. So why not just put them in the same table? There are several reasons why you might split some information out to a separate table.
First the information is conceptually separate. If the information contained in the second table relates to a separate specific concern, it makes it easier to work with it the data is in a separate table. For instance in your example they have separated out images even though they only intend to have one record per SKU. This gives you the flexibility to easily change the table later to a one-many relationship if you decide you need multiple images. It also means that when you query just for images you don't have to actually hit the other (perhaps significantly larger) table.
Which bring us to reason two to do this. You currently have a one-one relationship but you know that a future release is already scheduled to turn that to a one-many relationship. In this case it's easier to design into a separate table, so that you won't break all your code when you move to that structure. If I were planning to do this I would go ahead and create a surrogate key as the PK and create a unique index on the FK. This way when you go to the one-many relationship, all you have to do is drop the unique index and replace it with a regular index.
Another reason to separate out a one-one relationship is if the table is getting too wide. Sometimes you just have too much information about an entity to easily fit it in the maximum size a record can have. In this case, you tend to take the least used fields (or those that conceptually fit together) and move them to a separate table.
Another reason to separate them out is that although you have a one-one relationship, you may not need a record of what is in the child table for most records in the parent table. So rather than having a lot of null values in the parent table, you split it out.
The code shown by the others assumes a character-based PK. If you want a relationship of this sort when you have an auto-generating Int or GUID, you need to do the autogeneration only on the parent table. Then you store that value in the child table rather than generating a new one on that table.

When it says the tables share the same primary key, it just means that there is a field with the same name in each table, both set as Primary Keys.
Create Tables
CREATE TABLE [Product (Chapter 2)](
SKU varchar(50) NOT NULL,
Description varchar(50) NULL,
Price numeric(18, 2) NULL,
CONSTRAINT [PK_Product (Chapter 2)] PRIMARY KEY CLUSTERED
(
SKU ASC
)
)
CREATE TABLE [ProductWebInfo (Chapter 2)](
SKU varchar(50) NOT NULL,
ImageURL varchar(50) NULL,
CONSTRAINT [PK_ProductWebInfo (Chapter 2)] PRIMARY KEY CLUSTERED
(
SKU ASC
)
)
Create Relationships
ALTER TABLE [ProductWebInfo (Chapter 2)]
ADD CONSTRAINT fk_SKU
FOREIGN KEY(SKU)
REFERENCES [Product (Chapter 2)] (SKU)
It may look a bit simpler if the table names are just single words (and not key words, either), for example, if the table names were just Product and ProductWebInfo, without the (Chapter 2) appended:
ALTER TABLE ProductWebInfo
ADD CONSTRAINT fk_SKU
FOREIGN KEY(SKU)
REFERENCES Product(SKU)

This simply an example that I threw together using the table designer in SSMS, but should give you an idea (note the foreign key constraint at the end):
CREATE TABLE dbo.Product
(
SKU int NOT NULL IDENTITY (1, 1),
Description varchar(50) NOT NULL,
Price numeric(18, 2) NOT NULL
) ON [PRIMARY]
ALTER TABLE dbo.Product ADD CONSTRAINT
PK_Product PRIMARY KEY CLUSTERED
(
SKU
)
CREATE TABLE dbo.ProductWebInfo
(
SKU int NOT NULL,
ImageUrl varchar(50) NULL
) ON [PRIMARY]
ALTER TABLE dbo.ProductWebInfo ADD CONSTRAINT
FK_ProductWebInfo_Product FOREIGN KEY
(
SKU
) REFERENCES dbo.Product
(
SKU
) ON UPDATE NO ACTION
ON DELETE NO ACTION

See how to create a foreign key constraint. http://msdn.microsoft.com/en-us/library/ms175464.aspx This also has links to creating tables. You'll need to create the database as well.
To answer your question:
ALTER TABLE ProductWebInfo
ADD CONSTRAINT fk_SKU
FOREIGN KEY (SKU)
REFERENCES Product(SKU)

Generic Database table design

Just trying to figure out the best way to design my table for the following scenario:
I have several areas in my system (documents, projects, groups and clients) and each of these can have comments logged against them.
My question is should I have one table like this:
CommentID
DocumentID
ProjectID
GroupID
ClientID
etc
Where only one of the ids will have data and the rest will be NULL or should I have a separate CommentType table and have my comments table like this:
CommentID
CommentTypeID
ResourceID (this being the id of the project/doc/client)
etc
My thoughts are that option 2 would be more efficient from an indexing point of view. Is this correct?

Option 2 is not a good solution for a relational database. It's called polymorphic associations (as mentioned by #Daniel Vassallo) and it breaks the fundamental definition of a relation.
For example, suppose you have a ResourceId of 1234 on two different rows. Do these represent the same resource? It depends on whether the CommentTypeId is the same on these two rows. This violates the concept of a type in a relation. See SQL and Relational Theory by C. J. Date for more details.
Another clue that it's a broken design is that you can't declare a foreign key constraint for ResourceId, because it could point to any of several tables. If you try to enforce referential integrity using triggers or something, you find yourself rewriting the trigger every time you add a new type of commentable resource.
I would solve this with the solution that #mdma briefly mentions (but then ignores):
CREATE TABLE Commentable (
ResourceId INT NOT NULL IDENTITY,
ResourceType INT NOT NULL,
PRIMARY KEY (ResourceId, ResourceType)
);
CREATE TABLE Documents (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 1),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
CREATE TABLE Projects (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 2),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now each resource type has its own table, but the serial primary key is allocated uniquely by Commentable. A given primary key value can be used only by one resource type.
CREATE TABLE Comments (
CommentId INT IDENTITY PRIMARY KEY,
ResourceId INT NOT NULL,
ResourceType INT NOT NULL,
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now Comments reference Commentable resources, with referential integrity enforced. A given comment can reference only one resource type. There's no possibility of anomalies or conflicting resource ids.
I cover more about polymorphic associations in my presentation Practical Object-Oriented Models in SQL and my book SQL Antipatterns.

Read up on database normalization.
Nulls in the way you describe would be a big indication that the database isn't designed properly.
You need to split up all your tables so that the data held in them is fully normalized, this will save you a lot of time further down the line guaranteed, and it's a lot better practice to get into the habit of.

From a foreign key perspective, the first example is better because you can have multiple foreign key constraints on a column but the data has to exist in all those references. It's also more flexible if the business rules change.

To continue from #OMG Ponies' answer, what you describe in the second example is called a Polymorphic Association, where the foreign key ResourceID may reference rows in more than one table. However in SQL databases, a foreign key constraint can only reference exactly one table. The database cannot enforce the foreign key according to the value in CommentTypeID.
You may be interested in checking out the following Stack Overflow post for one solution to tackle this problem:
MySQL - Conditional Foreign Key Constraints

The first approach is not great, since it is quite denormalized. Each time you add a new entity type, you need to update the table. You may be better off making this an attribute of document - I.e. store the comment inline in the document table.
For the ResourceID approach to work with referential integrity, you will need to have a Resource table, and a ResourceID foreign key in all of your Document, Project etc.. entities (or use a mapping table.) Making "ResourceID" a jack-of-all-trades, that can be a documentID, projectID etc.. is not a good solution since it cannot be used for sensible indexing or foreign key constraint.
To normalize, you need to the comment table into one table per resource type.
Comment
-------
CommentID
CommentText
...etc
DocumentComment
---------------
DocumentID
CommentID
ProjectComment
--------------
ProjectID
CommentID
If only one comment is allowed, then you add a unique constraint on the foreign key for the entity (DocumentID, ProjectID etc.) This ensures that there can only be one row for the given item and so only one comment. You can also ensure that comments are not shared by using a unique constraint on CommentID.
EDIT: Interestingly, this is almost parallel to the normalized implementation of ResourceID - replace "Comment" in the table name, with "Resource" and change "CommentID" to "ResourceID" and you have the structure needed to associate a ResourceID with each resource. You can then use a single table "ResourceComment".
If there are going to be other entities that are associated with any type of resource (e.g. audit details, access rights, etc..), then using the resource mapping tables is the way to go, since it will allow you to add normalized comments and any other resource related entities.

I wouldn't go with either of those solutions. Depending on some of the specifics of your requirements you could go with a super-type table:
CREATE TABLE Commentable_Items (
commentable_item_id INT NOT NULL,
CONSTRAINT PK_Commentable_Items PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Projects (
commentable_item_id INT NOT NULL,
... (other project columns)
CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Documents (
commentable_item_id INT NOT NULL,
... (other document columns)
CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED (commentable_item_id))
GO
If the each item can only have one comment and comments are not shared (i.e. a comment can only belong to one entity) then you could just put the comments in the Commentable_Items table. Otherwise you could link the comments off of that table with a foreign key.
I don't like this approach very much in your specific case though, because "having comments" isn't enough to put items together like that in my mind.
I would probably go with separate Comments tables (assuming that you can have multiple comments per item - otherwise just put them in your base tables). If a comment can be shared between multiple entity types (i.e., a document and a project can share the same comment) then have a central Comments table and multiple entity-comment relationship tables:
CREATE TABLE Comments (
comment_id INT NOT NULL,
comment_text NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_Comments PRIMARY KEY CLUSTERED (comment_id))
GO
CREATE TABLE Document_Comments (
document_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Document_Comments PRIMARY KEY CLUSTERED (document_id, comment_id))
GO
CREATE TABLE Project_Comments (
project_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Project_Comments PRIMARY KEY CLUSTERED (project_id, comment_id))
GO
If you want to constrain comments to a single document (for example) then you could add a unique index (or change the primary key) on the comment_id within that linking table.
It's all of these "little" decisions that will affect the specific PKs and FKs. I like this approach because each table is clear on what it is. In databases that's usually better then having "generic" tables/solutions.

Of the options you give, I would go for number 2.

Option 2 is a good way to go. The issue that I see with that is you are putting the resouce key on that table. Each of the IDs from the different resources could be duplicated. When you join resources to the comments you will more than likely come up with comments that do not belong to that particular resouce. This would be considered a many to many join. I would think a better option would be to have your resource tables, the comments table, and then tables that cross reference the resource type and the comments table.

If you carry the same sort of data about all comments regardless of what they are comments about, I'd vote against creating multiple comment tables. Maybe a comment is just "thing it's about" and text, but if you don't have other data now, it's likely you will: date the comment was entered, user id of person who made it, etc. With multiple tables, you have to repeat all these column definitions for each table.
As noted, using a single reference field means that you could not put a foreign key constraint on it. This is too bad, but it doesn't break anything, it just means you have to do the validation with a trigger or in code. More seriously, joins get difficult. You can just say "from comment join document using (documentid)". You need a complex join based on the value of the type field.
So while the multiple pointer fields is ugly, I tend to think that's the right way to go. I know some db people say there should never be a null field in a table, that you should always break it off into another table to prevent that from happening, but I fail to see any real advantage to following this rule.
Personally I'd be open to hearing further discussion on pros and cons.

Pawnshop Application:
I have separate tables for Loan, Purchase, Inventory & Sales transactions.
Each tables rows are joined to their respective customer rows by:
customer.pk [serial] = loan.fk [integer];
= purchase.fk [integer];
= inventory.fk [integer];
= sale.fk [integer];
I have consolidated the four tables into one table called "transaction", where a column:
transaction.trx_type char(1) {L=Loan, P=Purchase, I=Inventory, S=Sale}
Scenario:
A customer initially pawns merchandise, makes a couple of interest payments, then decides he wants to sell the merchandise to the pawnshop, who then places merchandise in Inventory and eventually sells it to another customer.
I designed a generic transaction table where for example:
transaction.main_amount DECIMAL(7,2)
in a loan transaction holds the pawn amount,
in a purchase holds the purchase price,
in inventory and sale holds sale price.
This is clearly a denormalized design, but has made programming alot easier and improved performance. Any type of transaction can now be performed from within one screen, without the need to change to different tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas