If I have two tables in SQL with a many-many relationship, do I need to create an additional table?

If I have two tables in SQL with a many-many relationship, do I need to create an additional table? - sql

Take these tables for example.
Item
id
description
category
Category
id
description
An item can belong to many categories and a category obviously can be attached to many items.
How would the database be created in this situation? I'm not sure. Someone said create a third table, but do I need to do that? Do I literally do a
create table bla bla
for the third table?

Yes, you need to create a third table with mappings of ids, something with columns like:
item_id (Foreign Key)
category_id (Foreign Key)
edit: you can treat item_id and category_id as a primary key, they uniquely identify the record alone. In some applications I've found it useful to include an additional numeric identifier for the record itself, and you might optionally include one if you're so inclined
Think of this table as a listing of all the mappings between Items and Categories. It's concise, and it's easy to query against.
edit: removed (unnecessary) primary key.

Yes, you cannot form a third-normal-form many-to-many relationship between two tables with just those two tables. You can form a one-to-many (in one of the two directions) but in order to get a true many-to-many, you need something like:
Item
id primary key
description
Category
id primary key
description
ItemCategory
itemid foreign key references Item(id)
categoryid foreign key references Category(id)
You do not need a category in the Item table unless you have some privileged category for an item which doesn't seem to be the case here. I'm also not a big fan of introducing unnecessary primary keys when there is already a "real" unique key on the joining table. The fact that the item and category IDs are already unique means that the entire record for the ItemCategory table will be unique as well.
Simply monitor the performance of the ItemCategory table using your standard tools. You may require an index on one or more of:
itemid
categoryid
(itemid,categoryid)
(categoryid,itemid)
depending on the queries you use to join the data (and one of the composite indexes would be the primary key).
The actual syntax for the entire job would be along the lines of:
create table Item (
id integer not null primary key,
description varchar(50)
);
create table Category (
id integer not null primary key,
description varchar(50)
);
create table ItemCategory (
itemid integer references Item(id),
categoryid integer references Category(id),
primary key (itemid,categoryid)
);
There's other sorts of things you should consider, such as making your ID columns into identity/autoincrement columns, but that's not directly relevant to the question at hand.

Yes, you need a "join table". In a one-to-many relationship, objects on the "many" side can have an FK reference to objects on the "one" side, and this is sufficient to determine the entire relationship, since each of the "many" objects can only have a single "one" object.
In a many-to-many relationship, this is no longer sufficient because you can't stuff multiple FK references in a single field. (Well, you could, but then you would lose atomicity of data and all of the nice things that come with a relational database).
This is where a join table comes in - for every relationship between an Item and a Category, the relation is represented in the join table as a pair: Item.id x Category.id.

Related

SQL Server use same Guid as primary key in 2 tables

We have 2 tables with a 1:1 relationship.
1 table should reference the other, typically one would use a FK relationship.
Since there is a 1:1 relationship, we could also directly use the same Guid in both tables as primary key.
Additional info: the data is split into 2 tables since the data is rather separate, think "person" and "address" - but in a world where there is a clear 1:1 relationship between the 2.
As per the tags I was suggested I assume this is called "shared primary key".
Would using the same Guid as PK in 2 tables have any ill effects?

To consolidate info from comments into answer...
No, there are no ill effects of two tables sharing PK.
You will still need to create a FK reference from 2nd table, FK column will be the same as PK column.
Though, your example of "Person" and "Address" in 1:1 situation is not best suited. Common usage of this practice is entities that extend one another. For example: Table "User" can hold common info on all users, but tables "Candidate" and "Recruiter" can each expand on it, and all tables can share same PK. Programming language representation would also be classes that extends one another.
Other (similar) example would be table that store more detailed info than the base table like "User" and "UserDetails". It's 1:1 and no need to introduce additional PK column.
Code sample where PK is also a FK:
CREATE TABLE [User]
(
id INT PRIMARY KEY
, name NVARCHAR(100)
);
CREATE TABLE [Candidate]
(
id INT PRIMARY KEY FOREIGN KEY REFERENCES [User](id)
, actively_looking BIT
);
CREATE TABLE [Recruiter]
(
id INT PRIMARY KEY
, currently_hiring BIT
, FOREIGN KEY (id) REFERENCES [User](id)
);
PS: As mentioned GUID is not best suited column for PK due to performance issues, but that's another topic.

One Primary Key Value in many tables

This may seem like a simple question, but I am stumped:
I have created a database about cars (in Oracle SQL developer). I have amongst other tables a table called: Manufacturer and a table called Parentcompany.
Since some manufacturers are owned by bigger corporations, I will also show them in my database.
The parentcompany table is the "parent table" and the Manufacturer table the "child table".
for both I have created columns, each having their own Primary Key.
For some reason, when I inserted the values for my columns, I was able to use the same value for the primary key of Manufacturer and Parentcompany
The column: ManufacturerID is primary Key of Manufacturer. The value for this is: 'MBE'
The column: ParentcompanyID is primary key of Parentcompany. The value for this is 'MBE'
Both have the same value. Do I have a problem with the thinking logic?
Or do I just not understand how primary keys work?
Does a primary key only need to be unique in a table, and not the database?
I would appreciate it if someone shed light on the situation.

A primary key is unique for each table.
Have a look at this tutorial: SQL - Primary key
A primary key is a field in a table which uniquely identifies each
row/record in a database table. Primary keys must contain unique
values. A primary key column cannot have NULL values.
A table can have only one primary key, which may consist of single or
multiple fields. When multiple fields are used as a primary key, they
are called a composite key.
If a table has a primary key defined on any field(s), then you cannot
have two records having the same value of that field(s).

Primary key is table-unique. You can use same value of PI for every separate table in DB. Actually that often happens as PI often incremental number representing ID of a row: 1,2,3,4...
For your case more common implementation would be to have hierarchical table called Company, which would have fields: company_name and parent_company_name. In case company has a parent, in field parent_company_name it would have some value from field company_name.

There are several reasons why the same value in two different PKs might work out with no problems. In your case, it seems to flow naturally from the semantics of the data.
A row in the Manufacturers table and a row in the ParentCompany table both appear to refer to the same thing, namely a company. In that case, giving a company the same id in both tables is not only possible, but actually useful. It represents a 1 to 1 correspondence between manufacturers and parent companies without adding extra columns to serve as FKs.

Thanks for the quick answers!
I think I know what to do now. I will create a general company table, in which all companies will be stored. Then I will create, as I go along specific company tables like Manufacturer and parent company that reference a certain company in the company table.
To clarify, the only column I would put into the sub-company tables is a column with a foreign key referencing a column of the company table, yes?
For the primary key, I was just confused, because I hear so much about the key needing to be unique, and can't have the same value as another. So then this condition only goes for tables, not the whole database. Thanks for the clarification!

MS SQL creating many-to-many relation with a junction table

I'm using Microsoft SQL Server Management Studio and while creating a junction table should I create an ID column for the junction table, if so should I also make it the primary key and identity column? Or just keep 2 columns for the tables I'm joining in the many-to-many relation?
For example if this would be the many-to many tables:
MOVIE
Movie_ID
Name
etc...
CATEGORY
Category_ID
Name
etc...
Should I make the junction table:
MOVIE_CATEGORY_JUNCTION
Movie_ID
Category_ID
Movie_Category_Junction_ID
[and make the Movie_Category_Junction_ID my Primary Key and use it as the Identity Column] ?
Or:
MOVIE_CATEGORY_JUNCTION
Movie_ID
Category_ID
[and just leave it at that with no primary key or identity table] ?

I would use the second junction table:
MOVIE_CATEGORY_JUNCTION
Movie_ID
Category_ID
The primary key would be the combination of both columns. You would also have a foreign key from each column to the Movie and Category table.
The junction table would look similar to this:
create table movie_category_junction
(
movie_id int,
category_id int,
CONSTRAINT movie_cat_pk PRIMARY KEY (movie_id, category_id),
CONSTRAINT FK_movie
FOREIGN KEY (movie_id) REFERENCES movie (movie_id),
CONSTRAINT FK_category
FOREIGN KEY (category_id) REFERENCES category (category_id)
);
See SQL Fiddle with Demo.
Using these two fields as the PRIMARY KEY will prevent duplicate movie/category combinations from being added to the table.

There are different schools of thought on this. One school prefers including a primary key and naming the linking table something more significant than just the two tables it is linking. The reasoning is that although the table may start out seeming like just a linking table, it may become its own table with significant data.
An example is a many-to-many between magazines and subscribers. Really that link is a subscription with its own attributes, like expiration date, payment status, etc.
However, I think sometimes a linking table is just a linking table. The many to many relationship with categories is a good example of this.
So in this case, a separate one field primary key is not necessary. You could have a auto-assign key, which wouldn't hurt anything, and would make deleting specific records easier. It might be good as a general practice, so if the table later develops into a significant table with its own significant data (as subscriptions) it will already have an auto-assign primary key.
You can put a unique index on the two fields to avoid duplicates. This will even prevent duplicates if you have a separate auto-assign key. You could use both fields as your primary key (which is also a unique index).
So, the one school of thought can stick with integer auto-assign primary keys, and avoids compound primary keys. This is not the only way to do it, and maybe not the best, but it won't lead you wrong, into a problem where you really regret it.
But, for something like what you are doing, you will probably be fine with just the two fields. I'd still recommend either making the two fields a compound primary key, or at least putting a unique index on the two fields.

I would go with the 2nd junction table. But make those two fields as Primary key. That will restrict duplicate entries.

SQL column containing array of table row values

I'm quite new to databases and SQL. I have an assignment, but I don't know how to solve this.
Let's say I have table named Persons. It has following columns: userId (primary key), userName and tags.
Tags is another table, containing only id and tagName. Person can have multiple tags, for example: "football", "cooking" etc.
Each tag corresponds to Tag table row. But I'm not sure how to achieve this kind of relation. How can I achieve this?
I'm guessing it has to something to do with foreign keys and making persons tag column an array of foreign keys.

You normally want to have a relation table (this is called a many-to-many relationship, or M:M).
Persons(id int, name varchar(40)...)
Tags(id int, name varchar(40)....)
PersonTags(id int,
personid FOREIGN KEY References Persons(id),
TagId FOREIGN KEY references Tags(id))
This way you can have as many or as few tags as you like, but the tag data isn't stored in-row for the person. You could potentially store a series of values in a single field, but this is a really terrible idea for a variety of reasons.

Generic Database table design

Just trying to figure out the best way to design my table for the following scenario:
I have several areas in my system (documents, projects, groups and clients) and each of these can have comments logged against them.
My question is should I have one table like this:
CommentID
DocumentID
ProjectID
GroupID
ClientID
etc
Where only one of the ids will have data and the rest will be NULL or should I have a separate CommentType table and have my comments table like this:
CommentID
CommentTypeID
ResourceID (this being the id of the project/doc/client)
etc
My thoughts are that option 2 would be more efficient from an indexing point of view. Is this correct?

Option 2 is not a good solution for a relational database. It's called polymorphic associations (as mentioned by #Daniel Vassallo) and it breaks the fundamental definition of a relation.
For example, suppose you have a ResourceId of 1234 on two different rows. Do these represent the same resource? It depends on whether the CommentTypeId is the same on these two rows. This violates the concept of a type in a relation. See SQL and Relational Theory by C. J. Date for more details.
Another clue that it's a broken design is that you can't declare a foreign key constraint for ResourceId, because it could point to any of several tables. If you try to enforce referential integrity using triggers or something, you find yourself rewriting the trigger every time you add a new type of commentable resource.
I would solve this with the solution that #mdma briefly mentions (but then ignores):
CREATE TABLE Commentable (
ResourceId INT NOT NULL IDENTITY,
ResourceType INT NOT NULL,
PRIMARY KEY (ResourceId, ResourceType)
);
CREATE TABLE Documents (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 1),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
CREATE TABLE Projects (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 2),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now each resource type has its own table, but the serial primary key is allocated uniquely by Commentable. A given primary key value can be used only by one resource type.
CREATE TABLE Comments (
CommentId INT IDENTITY PRIMARY KEY,
ResourceId INT NOT NULL,
ResourceType INT NOT NULL,
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now Comments reference Commentable resources, with referential integrity enforced. A given comment can reference only one resource type. There's no possibility of anomalies or conflicting resource ids.
I cover more about polymorphic associations in my presentation Practical Object-Oriented Models in SQL and my book SQL Antipatterns.

Read up on database normalization.
Nulls in the way you describe would be a big indication that the database isn't designed properly.
You need to split up all your tables so that the data held in them is fully normalized, this will save you a lot of time further down the line guaranteed, and it's a lot better practice to get into the habit of.

From a foreign key perspective, the first example is better because you can have multiple foreign key constraints on a column but the data has to exist in all those references. It's also more flexible if the business rules change.

To continue from #OMG Ponies' answer, what you describe in the second example is called a Polymorphic Association, where the foreign key ResourceID may reference rows in more than one table. However in SQL databases, a foreign key constraint can only reference exactly one table. The database cannot enforce the foreign key according to the value in CommentTypeID.
You may be interested in checking out the following Stack Overflow post for one solution to tackle this problem:
MySQL - Conditional Foreign Key Constraints

The first approach is not great, since it is quite denormalized. Each time you add a new entity type, you need to update the table. You may be better off making this an attribute of document - I.e. store the comment inline in the document table.
For the ResourceID approach to work with referential integrity, you will need to have a Resource table, and a ResourceID foreign key in all of your Document, Project etc.. entities (or use a mapping table.) Making "ResourceID" a jack-of-all-trades, that can be a documentID, projectID etc.. is not a good solution since it cannot be used for sensible indexing or foreign key constraint.
To normalize, you need to the comment table into one table per resource type.
Comment
-------
CommentID
CommentText
...etc
DocumentComment
---------------
DocumentID
CommentID
ProjectComment
--------------
ProjectID
CommentID
If only one comment is allowed, then you add a unique constraint on the foreign key for the entity (DocumentID, ProjectID etc.) This ensures that there can only be one row for the given item and so only one comment. You can also ensure that comments are not shared by using a unique constraint on CommentID.
EDIT: Interestingly, this is almost parallel to the normalized implementation of ResourceID - replace "Comment" in the table name, with "Resource" and change "CommentID" to "ResourceID" and you have the structure needed to associate a ResourceID with each resource. You can then use a single table "ResourceComment".
If there are going to be other entities that are associated with any type of resource (e.g. audit details, access rights, etc..), then using the resource mapping tables is the way to go, since it will allow you to add normalized comments and any other resource related entities.

I wouldn't go with either of those solutions. Depending on some of the specifics of your requirements you could go with a super-type table:
CREATE TABLE Commentable_Items (
commentable_item_id INT NOT NULL,
CONSTRAINT PK_Commentable_Items PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Projects (
commentable_item_id INT NOT NULL,
... (other project columns)
CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Documents (
commentable_item_id INT NOT NULL,
... (other document columns)
CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED (commentable_item_id))
GO
If the each item can only have one comment and comments are not shared (i.e. a comment can only belong to one entity) then you could just put the comments in the Commentable_Items table. Otherwise you could link the comments off of that table with a foreign key.
I don't like this approach very much in your specific case though, because "having comments" isn't enough to put items together like that in my mind.
I would probably go with separate Comments tables (assuming that you can have multiple comments per item - otherwise just put them in your base tables). If a comment can be shared between multiple entity types (i.e., a document and a project can share the same comment) then have a central Comments table and multiple entity-comment relationship tables:
CREATE TABLE Comments (
comment_id INT NOT NULL,
comment_text NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_Comments PRIMARY KEY CLUSTERED (comment_id))
GO
CREATE TABLE Document_Comments (
document_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Document_Comments PRIMARY KEY CLUSTERED (document_id, comment_id))
GO
CREATE TABLE Project_Comments (
project_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Project_Comments PRIMARY KEY CLUSTERED (project_id, comment_id))
GO
If you want to constrain comments to a single document (for example) then you could add a unique index (or change the primary key) on the comment_id within that linking table.
It's all of these "little" decisions that will affect the specific PKs and FKs. I like this approach because each table is clear on what it is. In databases that's usually better then having "generic" tables/solutions.

Of the options you give, I would go for number 2.

Option 2 is a good way to go. The issue that I see with that is you are putting the resouce key on that table. Each of the IDs from the different resources could be duplicated. When you join resources to the comments you will more than likely come up with comments that do not belong to that particular resouce. This would be considered a many to many join. I would think a better option would be to have your resource tables, the comments table, and then tables that cross reference the resource type and the comments table.

If you carry the same sort of data about all comments regardless of what they are comments about, I'd vote against creating multiple comment tables. Maybe a comment is just "thing it's about" and text, but if you don't have other data now, it's likely you will: date the comment was entered, user id of person who made it, etc. With multiple tables, you have to repeat all these column definitions for each table.
As noted, using a single reference field means that you could not put a foreign key constraint on it. This is too bad, but it doesn't break anything, it just means you have to do the validation with a trigger or in code. More seriously, joins get difficult. You can just say "from comment join document using (documentid)". You need a complex join based on the value of the type field.
So while the multiple pointer fields is ugly, I tend to think that's the right way to go. I know some db people say there should never be a null field in a table, that you should always break it off into another table to prevent that from happening, but I fail to see any real advantage to following this rule.
Personally I'd be open to hearing further discussion on pros and cons.

Pawnshop Application:
I have separate tables for Loan, Purchase, Inventory & Sales transactions.
Each tables rows are joined to their respective customer rows by:
customer.pk [serial] = loan.fk [integer];
= purchase.fk [integer];
= inventory.fk [integer];
= sale.fk [integer];
I have consolidated the four tables into one table called "transaction", where a column:
transaction.trx_type char(1) {L=Loan, P=Purchase, I=Inventory, S=Sale}
Scenario:
A customer initially pawns merchandise, makes a couple of interest payments, then decides he wants to sell the merchandise to the pawnshop, who then places merchandise in Inventory and eventually sells it to another customer.
I designed a generic transaction table where for example:
transaction.main_amount DECIMAL(7,2)
in a loan transaction holds the pawn amount,
in a purchase holds the purchase price,
in inventory and sale holds sale price.
This is clearly a denormalized design, but has made programming alot easier and improved performance. Any type of transaction can now be performed from within one screen, without the need to change to different tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas