Designing the primary key in associative table - sql

Suppose I have an artist table like:
id
name
1
John Coltrane
2
Springsteen
and a song table like:
id
title
1
Singing in the rain
2
Mimosa
Now an artist can write more than one song, and a song can be written by more than one artist. We have a many-to-many relation. We need an associative table!
How to design the primary key of the associative table?
One way would be to define a composite key of the two foreign keys, like this:
CREATE TABLE artist_song_map(
artist_id INTEGER,
song_id INTEGER,
PRIMARY KEY(artist_id, song_id),
FOREIGN KEY(artist_id) REFERENCES artist(id),
FOREIGN KEY(song_id) REFERENCES song(id)
)
Another way would be to have a synthetic primary key, and impose an unique constraint on the tuple of the two foreign keys:
CREATE TABLE artist_song_map(
id INTEGER PRIMARY KEY AUTOINCREMENT,
artist_id INTEGER,
song_id INTEGER,
UNIQUE(artist_id, song_id),
FOREIGN KEY(artist_id) REFERENCES artist(id),
FOREIGN KEY(song_id) REFERENCES song(id)
)
Which design choice is better?

Unless you define the table as WITHOUT ROWID both queries will create the same table.
The column id in your 2nd way adds nothing but an alias for the column rowid that will be created in any of the 2 ways.
Since this is a bridge table, you only need to define the combination of the columns artist_id and song_id as UNIQUE.
If you want to extend your design with other tables, like a playlist table, you will have to decide how it will be linked to the existing tables:
If there is no id column in artist_song_map then you will link
playlist to song and artist, just like you did with
artist_song_map.
If there is an id column in artist_song_map then you can link playlist directly to that id.
I suggest that you base your decision not only on these 3 tables (song, artist and artist_song_map), but also on the tables that you plan to add.

Logically the both design is the same. But from administration aspect the identity design is more efficient. Less disk fragmentation and future redesign or maintenance will be easier.

Bridge tables normally don't require a ID(auto_inCREMNT) to identify the rows.
The linking columns(foreign key) are the main point, as thea link artists to a 8or songs)
only when you need special attributes to that bridge or you want to reference a row of that bridge table and don't want to have ttwo linking columns, then you would use such an ID field, but as i said normally you never need it

While, generally, the differences are minor, the composite/compound foreign key design sounds more natural. A separate primary key together with the associated index take additional space in the database. Further, if you use a composite primary key, you can declare the table as WITHOUT ROWID. According to the official docs, "in some cases, a WITHOUT ROWID table can use about half the amount of disk space and can operate nearly twice as fast".

Related

SQL Server use same Guid as primary key in 2 tables

We have 2 tables with a 1:1 relationship.
1 table should reference the other, typically one would use a FK relationship.
Since there is a 1:1 relationship, we could also directly use the same Guid in both tables as primary key.
Additional info: the data is split into 2 tables since the data is rather separate, think "person" and "address" - but in a world where there is a clear 1:1 relationship between the 2.
As per the tags I was suggested I assume this is called "shared primary key".
Would using the same Guid as PK in 2 tables have any ill effects?
To consolidate info from comments into answer...
No, there are no ill effects of two tables sharing PK.
You will still need to create a FK reference from 2nd table, FK column will be the same as PK column.
Though, your example of "Person" and "Address" in 1:1 situation is not best suited. Common usage of this practice is entities that extend one another. For example: Table "User" can hold common info on all users, but tables "Candidate" and "Recruiter" can each expand on it, and all tables can share same PK. Programming language representation would also be classes that extends one another.
Other (similar) example would be table that store more detailed info than the base table like "User" and "UserDetails". It's 1:1 and no need to introduce additional PK column.
Code sample where PK is also a FK:
CREATE TABLE [User]
(
id INT PRIMARY KEY
, name NVARCHAR(100)
);
CREATE TABLE [Candidate]
(
id INT PRIMARY KEY FOREIGN KEY REFERENCES [User](id)
, actively_looking BIT
);
CREATE TABLE [Recruiter]
(
id INT PRIMARY KEY
, currently_hiring BIT
, FOREIGN KEY (id) REFERENCES [User](id)
);
PS: As mentioned GUID is not best suited column for PK due to performance issues, but that's another topic.

MS SQL creating many-to-many relation with a junction table

I'm using Microsoft SQL Server Management Studio and while creating a junction table should I create an ID column for the junction table, if so should I also make it the primary key and identity column? Or just keep 2 columns for the tables I'm joining in the many-to-many relation?
For example if this would be the many-to many tables:
MOVIE
Movie_ID
Name
etc...
CATEGORY
Category_ID
Name
etc...
Should I make the junction table:
MOVIE_CATEGORY_JUNCTION
Movie_ID
Category_ID
Movie_Category_Junction_ID
[and make the Movie_Category_Junction_ID my Primary Key and use it as the Identity Column] ?
Or:
MOVIE_CATEGORY_JUNCTION
Movie_ID
Category_ID
[and just leave it at that with no primary key or identity table] ?
I would use the second junction table:
MOVIE_CATEGORY_JUNCTION
Movie_ID
Category_ID
The primary key would be the combination of both columns. You would also have a foreign key from each column to the Movie and Category table.
The junction table would look similar to this:
create table movie_category_junction
(
movie_id int,
category_id int,
CONSTRAINT movie_cat_pk PRIMARY KEY (movie_id, category_id),
CONSTRAINT FK_movie
FOREIGN KEY (movie_id) REFERENCES movie (movie_id),
CONSTRAINT FK_category
FOREIGN KEY (category_id) REFERENCES category (category_id)
);
See SQL Fiddle with Demo.
Using these two fields as the PRIMARY KEY will prevent duplicate movie/category combinations from being added to the table.
There are different schools of thought on this. One school prefers including a primary key and naming the linking table something more significant than just the two tables it is linking. The reasoning is that although the table may start out seeming like just a linking table, it may become its own table with significant data.
An example is a many-to-many between magazines and subscribers. Really that link is a subscription with its own attributes, like expiration date, payment status, etc.
However, I think sometimes a linking table is just a linking table. The many to many relationship with categories is a good example of this.
So in this case, a separate one field primary key is not necessary. You could have a auto-assign key, which wouldn't hurt anything, and would make deleting specific records easier. It might be good as a general practice, so if the table later develops into a significant table with its own significant data (as subscriptions) it will already have an auto-assign primary key.
You can put a unique index on the two fields to avoid duplicates. This will even prevent duplicates if you have a separate auto-assign key. You could use both fields as your primary key (which is also a unique index).
So, the one school of thought can stick with integer auto-assign primary keys, and avoids compound primary keys. This is not the only way to do it, and maybe not the best, but it won't lead you wrong, into a problem where you really regret it.
But, for something like what you are doing, you will probably be fine with just the two fields. I'd still recommend either making the two fields a compound primary key, or at least putting a unique index on the two fields.
I would go with the 2nd junction table. But make those two fields as Primary key. That will restrict duplicate entries.

Identity column separate from composite primary key

I have a table representing soccer matches:
Date
Opponent
I feel {Date,Opponent} is the primary key because in this table there can never be more than one opponent per date. The problem is that when I create foreign key constraints in other tables, I have to include both Date and Opponent columns in the other tables:
Soccer game statistics table:
Date
Opponent
Event (Goal scored, yellow card etc)
Ideally I would like to have:
Soccer matches table:
ID
Date
Opponent
Soccer match statistics table:
SoccerMatchID
Event (Goal scored, yellow card etc)
where SoccerMatch.ID is a unique ID (but not the primary key) and {Date,Opponent} is still the primary key.
The problem is SQL Server doesn't seem to let me define ID as being a unique identity whilst {Date,Component} is the primary key. When I go to the properties for ID, the part signalling unique identifying is grayed-out with "No".
(I assume everyone agrees I should try to achieve the above as it's a better design?)
I think most people don't use the graphical designer to do this, as it's the graphical designer that's preventing it, not SQL Server. Try running DDL in a query window:
ALTER TABLE dbo.YourTable ADD ID INT IDENTITY(1,1);
GO
CREATE UNIQUE INDEX yt_id ON dbo.YourTable(ID);
GO
Now you can reference this column in other tables no problem:
CREATE TABLE dbo.SomeOtherTable
(
MatchID INT FOREIGN KEY REFERENCES dbo.YourTable(ID)
);
That said, I find the column name ID completely useless. If it's a MatchID, why not call it MatchID everywhere it appears in the schema? Yes it's redundant in the PK table but IMHO consistency throughout the model is more important.
For that matter, why is your table called SoccerMatch? Do you have other kinds of matches? I would think it would be Matches with a unique ID = MatchID. That way if you later have different types of matches you don't have to create a new table for each sport - just add a type column of some sort. If you only ever have soccer, then SoccerMatch is kind of redundant, no?
Also I would suggest that the key and unique index be the other way around. If you're not planning to use the multi-column key for external reference then it is more intuitive, at least to me, to make the PK the thing you do reference in other tables. So I would say:
CREATE TABLE dbo.Matches
(
MatchID INT IDENTITY(1,1),
EventDate DATE, -- Date is also a terrible name and it's reserved
Opponent <? data type ?> / FK reference?
);
ALTER TABLE dbo.Matches ADD CONSTRAINT PK_Matches
PRIMARY KEY (MatchID);
ALTER TABLE dbo.Matches ADD CONSTRAINT UQ_Date_Opponent
UNIQUE (EventDate, Opponent);

id columns or clustered primary keys/database consistency

If I had a table with the columns:
Artist
Album
Song
NumberOfListens
...is it better to put a clustered primary key on Artist, Album, and Song or to have an autoincrementing id column and put a unique constraint on Artist, Album, and Song.
How important is database consistency? If half of my tables have clustered primary keys and the other half an id column with unique constraints, is that bad or does it not matter? Both ways seem the same to me but I do not know what the industry standard is or which is better and why.
I would never put a primary key on columns of long text like: Artist, Album, and Song. Use an auto increment ID that is the clustered PK. If you want the Artist, Album, and Song to be unique, ad an Unique Index on the three. If you want to search by Album or Song, independent of independent Artist, you'll need an index for each, which pulls in the PK, so having a small PK saves you on each other index. The savings are not just disk space but in memory cache, and more keys on a page.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario. You reference primary keys in your foreign key constraints, so those are crucial for the integrity of your database. Use them - always - period.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, unique, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way, you can easily pick a column that is not your primary key to be your clustering key.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a VARCHAR(20) or so as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
Clustered indexes are great for range based queries. For example, a log date or order date. Putting one on Artist, Album, and Song will [probably] cause fragmentation when you insert new rows.
If your DB supports it, add a non-clustered primary key on Artist, Album, and Song and call it good. Or just add a unique key on Artist, Album, and Song.
Having an autoincrementing primary key would only really be useful if you had to had referential integrity to another table.
Without knowing the exact requirements, in general you would probably have an artist table, and possibly album table too. A song table would then be a unique combination of artist id, album id and then song. I'd enforce the uniqueness by an index or constraint depending on application, and use an id for a primary key.
First of all, there's already a problem here because the data is not normalized. Creating any sort of index on a bunch of text columns is something that should be avoided whenever possible. Even if these columns aren't text (and I suspect that they are), it still doesn't make sense to have artist, album and song in the same table. A much better design for this would be:
Artists (
ArtistID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
ArtistName varchar(100) NOT NULL)
Albums (
AlbumID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
ArtistID int NOT NULL,
AlbumName varchar(100) NOT NULL,
CONSTRAINT FK_Albums_Artists FOREIGN KEY (ArtistID)
REFERENCES Artists (ArtistID))
Songs (
SongID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
AlbumID int NOT NULL,
SongName varchar(100) NOT NULL,
NumberOfListens int NOT NULL DEFAULT 0
CONSTRAINT FK_Songs_Albums FOREIGN KEY (AlbumID)
REFERENCES Albums (AlbumID))
Once you have this design, you have the ability to search for individual albums and artists as well as songs. You can also add covering indexes to speed up queries, and the indexes will be much smaller and therefore faster than the original design.
If you don't need to do range queries (which you probably don't), then you could replace the IDENTITY key with a ROWGUID if that suits your design better; it doesn't really matter much in this case, I would stick with the simple IDENTITY.
You have to be careful with clustering keys. If you cluster on a key that is completely not even remotely sequential (and an artist, album, and song name definitely qualify as non-sequential), then you end up with page splits and other nastiness. You don't want this. And as Marc says, a copy of this key gets added to every index, and you definitely don't want this when your key is 300 or 600 bytes long.
If you want to be able to quickly query for the number of listens for a specific song by the artist, album, and song name, it's actually quite simple with the above design, you just need to index properly:
CREATE UNIQUE INDEX IX_Artists_Name ON Artists (ArtistName)
CREATE UNIQUE INDEX IX_Albums_Artist_Name ON Albums (ArtistID, AlbumName)
CREATE UNIQUE INDEX IX_Songs_Album_Name ON Songs (AlbumID, SongName)
INCLUDE (NumberOfListens)
Now this query will be fast:
SELECT ArtistName, AlbumName, SongName, NumberOfListens
FROM Artists ar
INNER JOIN Albums al
ON al.ArtistID = ar.ArtistID
INNER JOIN Songs s
ON s.AlbumID = al.AlbumID
WHERE ar.ArtistName = #ArtistName
AND al.AlbumName = #AlbumName
AND s.SongName = #SongName
If you check out the execution plan you'll see 3 index seeks - it's as fast as you can get it. We've guaranteed the exact same uniqueness as in the original design and optimized for speed. More importantly, it's normalized, so both an artist and an album have their own specific identity, which makes this a great deal easier to manage over the long term. It's much easier to search for "all albums by artist X." It's much much easier and faster to search for "all songs on album Y."
When designing a database, normalization should be your first concern, indexing should be your second. And you're likely to find that once you have a normalized design, the best indexing strategy becomes kind of obvious.

If I have two tables in SQL with a many-many relationship, do I need to create an additional table?

Take these tables for example.
Item
id
description
category
Category
id
description
An item can belong to many categories and a category obviously can be attached to many items.
How would the database be created in this situation? I'm not sure. Someone said create a third table, but do I need to do that? Do I literally do a
create table bla bla
for the third table?
Yes, you need to create a third table with mappings of ids, something with columns like:
item_id (Foreign Key)
category_id (Foreign Key)
edit: you can treat item_id and category_id as a primary key, they uniquely identify the record alone. In some applications I've found it useful to include an additional numeric identifier for the record itself, and you might optionally include one if you're so inclined
Think of this table as a listing of all the mappings between Items and Categories. It's concise, and it's easy to query against.
edit: removed (unnecessary) primary key.
Yes, you cannot form a third-normal-form many-to-many relationship between two tables with just those two tables. You can form a one-to-many (in one of the two directions) but in order to get a true many-to-many, you need something like:
Item
id primary key
description
Category
id primary key
description
ItemCategory
itemid foreign key references Item(id)
categoryid foreign key references Category(id)
You do not need a category in the Item table unless you have some privileged category for an item which doesn't seem to be the case here. I'm also not a big fan of introducing unnecessary primary keys when there is already a "real" unique key on the joining table. The fact that the item and category IDs are already unique means that the entire record for the ItemCategory table will be unique as well.
Simply monitor the performance of the ItemCategory table using your standard tools. You may require an index on one or more of:
itemid
categoryid
(itemid,categoryid)
(categoryid,itemid)
depending on the queries you use to join the data (and one of the composite indexes would be the primary key).
The actual syntax for the entire job would be along the lines of:
create table Item (
id integer not null primary key,
description varchar(50)
);
create table Category (
id integer not null primary key,
description varchar(50)
);
create table ItemCategory (
itemid integer references Item(id),
categoryid integer references Category(id),
primary key (itemid,categoryid)
);
There's other sorts of things you should consider, such as making your ID columns into identity/autoincrement columns, but that's not directly relevant to the question at hand.
Yes, you need a "join table". In a one-to-many relationship, objects on the "many" side can have an FK reference to objects on the "one" side, and this is sufficient to determine the entire relationship, since each of the "many" objects can only have a single "one" object.
In a many-to-many relationship, this is no longer sufficient because you can't stuff multiple FK references in a single field. (Well, you could, but then you would lose atomicity of data and all of the nice things that come with a relational database).
This is where a join table comes in - for every relationship between an Item and a Category, the relation is represented in the join table as a pair: Item.id x Category.id.