Scalable Database Tagging Schema - sql

EDIT: To people building tagging systems. Don't read this. It is not what you are looking for. I asked this when I wasn't aware that RDBMS all have their own optimization methods, just use a simple many to many scheme.
I have a posting system that has millions of posts. Each post can have an infinite number of tags associated with it.
Users can create tags which have notes, date created, owner, etc. A tag is almost like a post itself, because people can post notes about the tag.
Each tag association has an owner and date, so we can see who added the tag and when.
My question is how can I implement this? It has to be fast searching posts by tag, or tags by post. Also, users can add tags to posts by typing the name into a field, kind of like the google search bar, it has to fill in the rest of the tag name for you.
I have 3 solutions at the moment, but not sure which is the best, or if there is a better way.
Note that I'm not showing the layout of notes since it will be trivial once I get a proper solution for tags.
Method 1. Linked list
tagId in post points to a linked list in tag_assoc, the application must traverse the list until flink=0
post: id, content, ownerId, date, tagId, notesId
tag_assoc: id, tagId, ownerId, flink
tag: id, name, notesId
Method 2. Denormalization
tags is simply a VARCHAR or TEXT field containing a tab delimited array of tagId:ownerId. It cannot be a fixed size.
post: id, content, ownerId, date, tags, notesId
tag: id, name, notesId
Method 3. Toxi
(from: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html,
also same thing here: Recommended SQL database design for tags or tagging)
post: id, content, ownerId, date, notesId
tag_assoc: ownerId, tagId, postId
tag: id, name, notesId
Method 3 raises the question, how fast will it be to iterate through every single row in tag_assoc?
Methods 1 and 2 should be fast for returning tags by post, but for posts by tag, another lookup table must be made.
The last thing I have to worry about is optimizing searching tags by name, I have not worked that out yet.
I made an ASCII diagram here: http://pastebin.com/f1c4e0e53

Here is how I'd do it:
posts: [postId], content, ownerId, date, noteId, noteType='post'
tag_assoc: [postId, tagName], ownerId, date, noteId, noteType='tagAssoc'
tags: [tagName], ownerId, date, noteId, noteType='tag'
notes: [noteId, noteType], ownerId, date, content
The fields in square brackets are the primary key of the respective table.
Define a constraint on noteType in each table: posts, tag_assoc, and tags. This prevents a given note from applying to both a post and a tag, for example.
Store tag names as a short string, not an integer id. That way you can use the covering index [postId, tagName] in the tag_assoc table.
Doing tag completion is done with an AJAX call. If the user types "datab" for a tag, your web page makes an AJAX call and on the server side, the app queries: SELECT tagName FROM tags WHERE tagName LIKE ?||'%'.

"A tag is almost like a post itself, because people can post notes about the tag." - this phrase makes me think you really just want one table for POST, with a primary key and a foreign key that references the POST table. Now you can have as many tags for each post as your disk space will allow.
I'm assuming there's no need for many to many between POST and tags, because a tag isn't shared across posts, based on this:
"Users can create tags which have notes, date created, owner, etc."
If creation date and owner are shared, those would be two additional foreign key relationships, IMO.

A linked list is almost certainly the wrong approach. It certainly means that your queries will be either complex or sub-optimal - which is ironic since the most likely reason for using a linked list is to keep the data in the correct sorted order. However, I don't see an easy way to avoid iteratively fetching a row, and then using the flink value retrieved to condition the select operation for the next row.
So, use a table-based approach with normal foreign key to primary key references. The one outlined by Bill Karwin looks similar to what I'd outline.

Bill I think I kind of threw you off, the notes are just in another table and there is a separate table with notes posted by different people. Posts have notes and tags, but tags also have notes, which is why tags are UNIQUE.
Jonathan is right about linked lists, I wont use them at all. I decided to implement the tags in the simplest normalized way that meats my needs:
DROP TABLE IF EXISTS `tags`;
CREATE TABLE IF NOT EXISTS `tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts`;
CREATE TABLE IF NOT EXISTS `posts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`content` TEXT NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts_notes`;
CREATE TABLE IF NOT EXISTS `posts_notes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
`note` TEXT NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts_tags`;
CREATE TABLE IF NOT EXISTS `posts_tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`tagId` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE,
FOREIGN KEY (`tagId`) REFERENCES tags(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
I'm not sure how fast this will be in the future, but it should be fine for a while as only a couple people use the database.

Related

Add a constraint to check whether a value isn't in 2 arrays in 2 columns at the same time

I'm implementing some kind of upvoting and downvoting system in my web app.
In order to do so, the logic behind requires that the user who upvoted cannot downvote at the same time.
My SQL logic is the following:
CREATE TABLE "CONCEPT_SLIDE"."CONCEPT" (
"concept_id" serial NOT NULL,
"owner_id" integer NOT NULL,
"title" text NOT NULL,
"description" text NOT NULL,
"tags" text [] NULL,
"picture_links" text [] NULL,
"website_link" TEXT NULL,
"state" "CONCEPT_SLIDE"."CONCEPT_STATE" NOT NULL DEFAULT 'ONLY A CONCEPT',
"users_who_upvoted" int [] NULL CONSTRAINT chk_upvote CHECK (ELEMENT) ,
"users_who_downvoted" int [] NULL,
CONSTRAINT PK_table_16 PRIMARY KEY ("concept_id"),
CONSTRAINT FK_43 FOREIGN KEY ("owner_id") REFERENCES "CONCEPT_SLIDE"."USER" ("user_id"),
);
I want to make sure that the user ID in users_who_upvoted (or resp. users_who_downvoted) can't be in users_who_downvoted (or resp. users_who_upvoted)
How do I do so?
This seems like an awkward data model. Why not simply have a table with "votes" and a flag on whether it is an upvote or downvote? Your question is not 100% clear about what is being voted on, so I'll just generically call it "entity":
create table entity_votes (
entity_vote_id int generated always as identity,
entity_id int references <whatever is being voted on>,
user_id int references users(id),
up_or_down int check (up_or_down in (-1, 1)),
unique (entity_id, user_id),
created_at timestamp
. . .
);
The unique constraint guarantees that a user votes on an entity only once. Columns such as created_at capture the timing for each vote.
You can use the array overlap operation in a check constraint.
...
CHECK (NOT users_who_upvoted && users_who_downvoted)
...
But I would recommend not to use arrays here at all (also concerning all the others you have in that table) but go the traditional relational way with (linking) tables and foreign keys, etc.. Otherwise you got a recipe for headaches there.

Modelling Post and Flag relationship in SQL

I am modeling the data for my web I am building. I use Postgresql database.
In the app there are posts like SO posts and also the flags for posts as Github flags or marks, whatever the correct term for it. A post can have only one flag at a time. There are plenty of posts ever increasing, but four or five flags and they will not increase.
First approach, normalized; I have modeled this part of my data with three tables; two for the corresponding entities posts and flags, and one for the relationship as post_flag. No reference in any of the entity tables mentioned to the other entity table for relationship. All relationship is recorded in the relationship table post_flag, and that is only the id pair for ids of a post and a flag.
Table structure in that case would be:
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
product_id integer NOT NULL REFERENCES products (id),
);
CREATE TABLE flags
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
flag character varying(30) NOT NULL -- planned, in progress, fixed
);
CREATE TABLE post_flag
(
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
post_id integer NOT NULL REFERENCES posts (id),
flag_id integer NOT NULL REFERENCES flags (id)
);
To get posts flagged as fixed I have to use:
-- homepage posts- fixed posts tab
SELECT
p.*,
f.flag
FROM posts p
JOIN post_flag p_f
ON p.id = p_f.post_id
JOIN flags f
ON p_f.flag_id = f.id
WHERE f.flag = 'fixed'
ORDER BY p_f.created_at DESC
Second approach; I have two tables posts and flags. The table posts has a flag_id column that references a flag in the flags table.
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
product_id integer NOT NULL REFERENCES products (id),
flag_id integer DEFAULT NULL REFERENCES flags (id)
);
CREATE TABLE flags
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
flag character varying(30) NOT NULL -- one of planned, in progress, fixed
);
For same data;
-- homepage posts- fixed posts tab
SELECT
p.*,
f.flag
FROM posts p
JOIN flags f
ON p.flag_id = f.id
WHERE f.flag = 'fixed'
ORDER BY p.created_at DESC
Third approach denormalized; I have only one table posts. Posts table has a flag column to store the flag assigned to the post.
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
product_id integer NOT NULL REFERENCES products (id),
flag character varying(30)
);
Here I would only have for same data;
-- homepage posts- fixed posts tab
SELECT
p.*,
FROM posts p
WHERE p.flag = 'fixed'
ORDER BY p.created_at DESC
I wonder if first approach is an overkill in terms of normalization of data in a RDBMS like Postgresql? For a post comment relationship that first approach would be great and indeed I make use of it. But I have some very few quantity data used as meta data for posts as badges, flags, tags. As you see in fact in the most normal form, the first approach, I already use some product_id etc for a using one less JOIN but to another table as a different relation, not to the flags. So, there my approach fits into my second approach. Should I use the more denormalized approach, the third one, having posts table and a flag column in it? What is the better approach in terms of performance, expansion, and maintainability?
Use the second approach.
The first is a many-to-many data structure and you say
A post can have only one flag at a time.
So you would then have to build the business logic in to the front-end or set up complex rules to check a post never have more than one flag.
The third approach will result in messy data, again unless you implement checks or rules to ensure the flags are not misspelled or new ones added.
Expansion and maintainability are provided in the second approach; it is also self documenting. Worry about performance when it actually becomes a problem, and not before.
Personally I would make the flag_id field in the posts table NULL, which would allow you to model a post without a flag.
Blending two approaches
Assuming your flag names are unique, you can use the flag name as a natural key. Your table structures would then be
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
... other fields
flag character varying(30) REFERENCES flags (flag)
);
CREATE TABLE flags
(
flag character varying(30) NOT NULL PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP
);
You then get the benefit of being able to write queries for flag without having to JOIN to the flags table while having flag names checked by the table reference.

How do I enforce uniqueness against four columns

In SQL Server 2012, I have a 'cross-reference' table containing four columns. The combination of the four columns must be unique. My initial thought was to simply create a a primary key containing all four columns, but some research has suggested that this might not be a good idea.
Background to question...
I am trying to implement a tagging service on a legacy web application. Some of the objects that need tagging use a uniqueidentifier as their primary key, whilst others use a simple integer id. I have approached this using a 'two-table' approach. One table contains the tags, whilst the other table provides a reference between the objects to be tagged and the tag table. This table I have name TagList...
CREATE TABLE TagList (
TagId nvarchar(40) NOT NULL,
ReferenceGuid uniqueidentifier NOT NULL,
ReferenceId int NOT NULL,
ObjectType nvarchar(40) NOT NULL
)
For example, to tag an object with a uniqueidentifier primary key with the word 'example', the TagList record would look like this:
TagList (
TagId 'example',
ReferenceGuid '1e93d578-321b-4f86-8b0f-32435d385bd7',
ReferenceId 0,
ObjectType 'Customer'
)
To tag an object with an integer primary key with the word 'example', the TagList record would look like this:
TagList (
TagId 'example',
ReferenceGuid '00000000-0000-0000-0000-000000000000',
ReferenceId 5639,
ObjectType 'Product'
)
In practice, either the TagId and the ReferenceGuid column must be unique or, if an int primary key object is being defined, the TagId, ReferenceId and ObjectType must be unique.
To simplify(?) things, making the combination of all four columns to be unique would also serve the same functional purpose.
Any advice would be appreciated.
Having a multi column primary key should do the trick
CREATE TABLE TagList (
TagId nvarchar(40) NOT NULL,
ReferenceGuid uniqueidentifier NOT NULL,
ReferenceId int NOT NULL,
ObjectType nvarchar(40) NOT NULL,
CONSTRAINT pk_TagList PRIMARY KEY (TagId,ReferenceGuid,ReferenceId,ObjectType)
)
If you only require a unique constraint, and not a primary key, this can be used:
ALTER TABLE TagList
ADD CONSTRAINT UK_TagList_1 UNIQUE
(
TagId,
ReferenceGuid,
ReferenceId,
ObjectType
)
I do not have all the information of the entities involved, but with the limited visibility I have I would try to suggest following first cut design :
Create two separate tables, one with TagId and ReferenceGuid and the other with TagId and ReferenceId. I am not sure of ObjectType though. If ObjectType is not implicit then this too can be maintained in both of these tables. Then a view can be created on top of these tables to shoot queries on which can contain all the required columns.
This way we can get around the issue of space wastage in current design.
Please give your input if this design doesn't solve the problem in hand.

SQL table design for conversation message system

In my web application backed by a MySQL database, I want to offer a messaging system where messages are grouped into conversations between multiple users, but I am stuck with designing a table structure that caters to some of my needs:
Multiple users can participate in one conversation
Users can join, read, leave and delete conversations
An inbox view should generate as little queries as possible
Now, the first requirement for a many-to-many relation can be solved by using a junction table. But it has proven to generate quite a lot of problems when writing the select queries for the inbox view.
The second requirement also proved to be a challenge. If a user leaves a conversation, it should still be available for him to read through old messages. New messages between the remaining users in the conversation should not be shared with the user who left. I first thought about using a tree-like structure for conversations. Each time a user joins or leaves a conversation, a new conversation is created with a reference to the parent conversation and new relations with the remaining participants in the junction table are created.
The third requirement also seems not to be trivial. The inbox view should display a list of conversations with a specific user as participant. Also, additional information should be displayed for each conversation: The names of all current participants, the last reply to the conversation and that reply's author. Think of the inbox view as a list of messages with additional information about the conversation they belong to.
My current approach looks like this:
CREATE TABLE `conversation` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`parentId` int(11) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `parentId` (`parentId`),
CONSTRAINT `conversation_ibfk_1` FOREIGN KEY (`parentId`) REFERENCES `conversation` (`id`)
);
CREATE TABLE `participant` (
`userId` int(11) unsigned NOT NULL,
`conversationId` int(11) unsigned NOT NULL,
`readAt` datetime DEFAULT NULL,
PRIMARY KEY (`userId`,`conversationId`),
KEY `conversationId` (`conversationId`),
CONSTRAINT `participant_ibfk_2` FOREIGN KEY (`conversationId`) REFERENCES `conversation` (`id`),
CONSTRAINT `participant_ibfk_1` FOREIGN KEY (`userId`) REFERENCES `user` (`id`)
);
CREATE TABLE `reply` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`conversationId` int(11) unsigned NOT NULL,
`userId` int(11) unsigned NOT NULL,
`text` text NOT NULL,
PRIMARY KEY (`id`),
KEY `conversationId` (`conversationId`),
KEY `userId` (`userId`),
CONSTRAINT `reply_ibfk_2` FOREIGN KEY (`userId`) REFERENCES `user` (`id`),
CONSTRAINT `reply_ibfk_1` FOREIGN KEY (`conversationId`) REFERENCES `conversation` (`id`)
);
CREATE TABLE `user` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`username` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`)
);
I have hit a wall here and can't find a solution that caters to all my needs. Maybe someone here can give me some advice on how to approach this database design.
USERS ACTIVITY
I still would make a junction table, but with indices on tuples.
USER_CONV
---------------
id_user
id_conversation
active_flag
begin_flag - flag indicating if id_first_reply == id_first_visible_reply
id_first_reply - first reply in the conversation
id_first_visible_reply - first visible for a given user
id_last_reply - last visible for a given user, user if active_flag = false, otherwise NULL
index (id_user, id_conversation, active_flag)
USER_REPLY
---------------
id_user
id_conversaion
id_reply
index (id_user, id_conversation, id_reply)
The USER_CONV should make it less painless to create inbox. Still you will need to join it with USER_REPLY, and then with REPLY.
INBOX
If you want to make it fast, then create a kind of CACHED_INBOX, and a table with MOST_RECENT_USER_ACTIVITY. The CACHED_INBOX table will contain all the inbox data that you do not need to do any joins to fetch the relevant data, but you only have to do a UNION with the data from MOST_RECENT_USER_ACTIVITY. The MOST_RECENT_USER_ACTIVITY should be quite small (work fast), and CACHED_INBOX will be 'static'. Then once a day, or every several days at the time when the server is less likely to be used do a batch CACHED_INBOX update. It will work unless your users decide to stay in conversations forever.
OPTIMISATION
Of course you will need to use explain a bit, to see how the query optimizer uses indexing (if it uses it at all). Mybe you will need a table like ACTIVE_CONVERSATIONS, which will not have any indexing, that would hit performance a bit. But then you would do some batch updates to USER_CONV, which would have more extensive indexing. You have to try a bit, and see what do you expect from your users behaviour, that should be one of the architecture drivers.
NOTE: regarding indexes, there is a good website devoted to the topic of indexing use-the-index-luke.com

Multiple yet mutually exclusive foreign keys - is this the way to go?

I have three tables: Users, Companies and Websites.
Users and companies have websites, and thus each user record has a foreign key into the Websites table. Also, each company record has a foreign key into the Websites table.
Now I want to include foreign keys in the Websites table back into their respective "parent" records. How do I do that? Should I have two foreign keys in each website record, with one of them always NULL? Or is there another way to go?
If we look into the model here, we will see the following:
A user is related to exactly one website
A company is related to exactly one website
A website is related to exactly one user or company
The third relation implies existence of a "user or company" entity whose PRIMARY KEY should be stored somewhere.
To store it you need to create a table that would store a PRIMARY KEY of a website owner entity. This table can also store attributes common for a user and a website.
Since it's a one-to-one relation, website attributes can be stored in this table too.
The attributes not shared by users and companies should be stored in the separate table.
To force the correct relationships, you need to make the PRIMARY KEY of the website composite with owner type as a part of it, and force the correct type in the child tables with a CHECK constraint:
CREATE TABLE website_owner (
type INT NOT NULL,
id INT NOT NULL,
website_attributes,
common_attributes,
CHECK (type IN (1, 2)) -- 1 for user, 2 for company
PRIMARY KEY (type, id)
)
CREATE TABLE user (
type INT NOT NULL,
id INT NOT NULL PRIMARY KEY,
user_attributes,
CHECK (type = 1),
FOREIGN KEY (type, id) REFERENCES website_owner
)
CREATE TABLE company (
type INT NOT NULL,
id INT NOT NULL PRIMARY KEY,
company_attributes,
CHECK (type = 2),
FOREIGN KEY (type, id) REFERENCES website_owner
)
you don’t need a parent column, you can lookup the parents with a simple select (or join the tables) on the users and companies table. if you want to know if this is a user or a company website i suggest using a boolean column in your websites table.
Why do you need a foreign key from website to user/company at all? The principle of not duplicating data would suggest it might be better to scan the user/company tables for a matching website id. If you really need to you could always store a flag in the website table that denotes whether a given website record is for a user or a company, and then scan the appropriate table.
The problem I have with the accepted answer (by Quassnoi) is that the object relationships are the wrong way around: company is not a sub-type of a website owner; we had companies before we had websites and we can have companies who are website owners. Also, it seems to me that website ownership is a relationship between a website and either a person or a company i.e. we should have a relationship table (or two) in the schema. It may be an acceptable approach to keep personal website ownership separate from corporate website ownership and only bring them together when required e.g. via VIEWs:
CREATE TABLE People
(
person_id CHAR(9) NOT NULL UNIQUE, -- external identifier
person_name VARCHAR(100) NOT NULL
);
CREATE TABLE Companies
(
company_id CHAR(6) NOT NULL UNIQUE, -- external identifier
company_name VARCHAR(255) NOT NULL
);
CREATE TABLE Websites
(
url CHAR(255) NOT NULL UNIQUE
);
CREATE TABLE PersonalWebsiteOwnership
(
person_id CHAR(9) NOT NULL UNIQUE
REFERENCES People ( person_id ),
url CHAR(255) NOT NULL UNIQUE
REFERENCES Websites ( url )
);
CREATE TABLE CorporateWebsiteOwnership
(
company_id CHAR(6) NOT NULL UNIQUE
REFERENCES Companies( company_id ),
url CHAR(255) NOT NULL UNIQUE
REFERENCES Websites ( url )
);
CREATE VIEW WebsiteOwnership AS
SELECT url, company_name AS website_owner_name
FROM CorporateWebsiteOwnership
NATURAL JOIN Companies
UNION
SELECT url, person_name AS website_owner_name
FROM PersonalWebsiteOwnership
NATURAL JOIN People;
The problem with the above is there is no way of using database constraints to enforce the rule that a website is either owned by a person or a company but not both.
If we can assuming the DBMS enforces check constraints (as the accepted answer does) then we can exploit the fact that a (human) person and a company are both legal persons and employ a super-type table (LegalPersons) but still retain relationship table approach (WebsiteOwnership), this time using the VIEWs to separate personal website ownership from separate from corporate website ownership but this time with strongly typed attributes:
CREATE TABLE LegalPersons
(
legal_person_id INT NOT NULL UNIQUE, -- internal artificial identifier
legal_person_type CHAR(7) NOT NULL
CHECK ( legal_person_type IN ( 'Company', 'Person' ) ),
UNIQUE ( legal_person_type, legal_person_id )
);
CREATE TABLE People
(
legal_person_id INT NOT NULL
legal_person_type CHAR(7) NOT NULL
CHECK ( legal_person_type = 'Person' ),
UNIQUE ( legal_person_type, legal_person_id ),
FOREIGN KEY ( legal_person_type, legal_person_id )
REFERENCES LegalPersons ( legal_person_type, legal_person_id ),
person_id CHAR(9) NOT NULL UNIQUE, -- external identifier
person_name VARCHAR(100) NOT NULL
);
CREATE TABLE Companies
(
legal_person_id INT NOT NULL
legal_person_type CHAR(7) NOT NULL
CHECK ( legal_person_type = 'Company' ),
UNIQUE ( legal_person_type, legal_person_id ),
FOREIGN KEY ( legal_person_type, legal_person_id )
REFERENCES LegalPersons ( legal_person_type, legal_person_id ),
company_id CHAR(6) NOT NULL UNIQUE, -- external identifier
company_name VARCHAR(255) NOT NULL
);
CREATE TABLE WebsiteOwnership
(
legal_person_id INT NOT NULL
legal_person_type CHAR(7) NOT NULL
UNIQUE ( legal_person_type, legal_person_id ),
FOREIGN KEY ( legal_person_type, legal_person_id )
REFERENCES LegalPersons ( legal_person_type, legal_person_id ),
url CHAR(255) NOT NULL UNIQUE
REFERENCES Websites ( url )
);
CREATE VIEW CorporateWebsiteOwnership AS
SELECT url, company_name
FROM WebsiteOwnership
NATURAL JOIN Companies;
CREATE VIEW PersonalWebsiteOwnership AS
SELECT url, person_name
FROM WebsiteOwnership
NATURAL JOIN Persons;
What we need are new DBMS features for 'distributed foreign keys' ("For each row in this table there must be exactly one row in one of these tables") and 'multiple assignment' to allow the data to be added into tables thus constrained in a single SQL statement. Sadly we are a far way from getting such features!
First of all, do you really need this bi-directional link? It is a good practice to avoid it unless absolutely needed.
I understand it that you wish to know whether the site belongs to a user or to a company. You can achieve that by having a simple boolean field in the Website table - [BelongsToUser]. If true, then you look up a user, if false - you look up a company.
A bit late, but all the existing answers seemed to fall somewhat short of the mark:
Owner to website is a 1:Many relation
Website to owner is a 1:1 relation
Users and Companies tables should not have a foreign key into the Websites table
None of the website data, common to users and companies or not, should be in the Users or Companies tables
None of the owner's information, common or not, should be in the Websites table
MySQL ignores, silently, CHECK constraints on tables (no enforcement of referential integrity)
The DBMS ought to handle the 'relation' logic, not the application using the database
Some of this is recognized in the answer from onedaywhen, yet that answer still missed the opportunity to make MySQL do the heavy lifting and enforce the referential integrity.
A website can only have one owner, legally, anyway. A person, or company, can have any number of websites, including none. A link in the database from owner to website can only be 1:1 at any level of normalization. In reality the relation is 1:Many, and would require having multiple table entries for each owner that happens to own more than one website. A link from website to owner is 1:1 in both database terms and in reality. Having the link from website to owner represents the model better. With an index in the website table, doing the 1:Many lookup for a given owner becomes reasonably efficient.
The CHECK attribute in SQL would be an excellent solution, if MySQL didn't happen to silently ignore it.
MySQL Docs 13.1.20 CREATE TABLE Syntax
The CHECK clause is parsed but ignored by all storage engines.
MySQL's functionality does offer two solutions as work-arounds to implement the behavior of CHECK and keep the referential integrity of the data. Triggers with stored procedures is one, and works well with all manner of constraints. Easier to implement, though less versatile, is using a VIEW with a WITH CHECK OPTION clause, which MySQL will implement.
MySQL Docs 24.5.4 The View WITH CHECK OPTION Clause
The WITH CHECK OPTION clause can be given for an updatable view to prevent inserts to rows for which the WHERE clause in the select_statement is not true. It also prevents updates to rows for which the WHERE clause is true but the update would cause it to be not true (in other words, it prevents visible rows from being updated to nonvisible rows).
The MySQLTUTORIAL site gives a good example of both options in their Introduction to the SQL CHECK constraint tutorial. (You have to think around the typos, but good otherwise.)
Having found this question while trying to resolve a similar mutually exclusive foreign key split and developing a solution, with hints generated by the answers, it seems only proper to share my solution in return.
Recommended Solution
For the minimum impact to the existing schema, and the application accessing the data, retain the Users and Companies tables as they are. Rename the Websites table and replace it with a VIEW named Websites which the application can continue to access. Except when dealing with the ownership information, all the old queries to Websites should still work. So:
The setup
-- Keep the `Users` table about "users"
CREATE TABLE `Users` (
`id` INT SERIAL PRIMARY KEY,
`name` VARCHAR(180),
-- user_attributes
);
-- Keep the `Companies` table about "companies"
CREATE TABLE `Companies` (
`id` SERIAL PRIMARY KEY,
`name` VARCHAR(180),
-- company_attributes
);
-- Attach ownership information about the website to the website's record in the `Websites` table, renamed to `WebsitesData`
CREATE TABLE `WebsitesData` (
`id` SERIAL PRIMARY KEY,
`name` VARCHAR(255),
`is_personal` BOOL,
`owner_user` BIGINT UNSIGNED DEFAULT NULL,
`owner_company` BIGINT UNSIGNED DEFAULT NULL,
website_attributes,
FOREIGN KEY `WebsiteOwner_User` (`owner_user`)
REFERENCES `Users` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY `WebsiteOwner_Company` (`owner_company`)
REFERENCES `Companies` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
);
-- Create a new `VIEW` with the original name of `Websites` as the gateway to the website records which can enforce the constraints you need
CREATE VIEW `Websites` AS
SELECT * FROM `WebsitesData` WHERE
(`is_personal`=TRUE AND `owner_user` IS NOT NULL AND `owner_company` IS NULL) OR
(`is_personal`=FALSE AND `owner_user` IS NULL AND `owner_company` IS NOT NULL)
WITH CHECK OPTION;
Usage
-- Use the Websites VIEW for the INSERT, UPDATE, and SELECT operations as you normally would and leave the WebsitesData table in the background.
INSERT INTO `Websites` SET
`is_personal`=TRUE,
`owner_user`=$userID;
INSERT INTO `Websites` SET
`is_personal`=FALSE,
`owner_company`=$companyID;
-- Or, using different field lists based on the type of owner
INSERT INTO `Websites` (`is_personal`,`owner_user`, ...)
VALUES (TRUE, $userID, ...);
INSERT INTO `Websites` (`is_personal`,`owner_company`, ...)
VALUES (FALSE, $companyID, ...);
-- Or, using a common field list, and placing NULL in the proper place
INSERT INTO `Websites` (`is_personal`,`owner_user`,`owner_company`,...)
VALUES (TRUE, $userID, NULL, ...);
INSERT INTO `Websites` (`is_personal`,`owner_user`,`owner_company`,...)
VALUES (FALSE, NULL, $companyID, ...);
-- Change the company that owns a website
-- Will ERROR if the site was owned by a User.
UPDATE `Websites` SET `owner_company`=$new_companyID;
-- Force change the ownership from a User to a Company
UPDATE `Websites` SET
`owner_company`=$new_companyID,
`owner_user`=NULL,
`is_personal`=FALSE;
-- Force change the ownership from a Company to a User
UPDATE `Websites` SET
`owner_user`=$new_userID,
`owner_company`=NULL,
`is_personal`=TRUE;
-- Selecting the owner of a site without needing to know if it is personal or not
(SELECT `Users`.`name` AS `Owner`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.`id`
WHERE `is_personal`=TRUE AND `Websites`.`id`=$siteID)
UNION
(SELECT `Companies`.`name` AS `Owner`
FROM `Websites`
JOIN `Companies` ON `Websites`.`owner_company`=`Companies`.`id`
WHERE `is_personal`=FALSE AND `Websites`.`id`=$siteID);
-- Selecting the sites owned by a User
SELECT `name` FROM `Websites`
WHERE `is_personal`=TRUE AND `id`=$userID;
SELECT `Websites`.`name`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.$userID
WHERE `is_personal`=TRUE AND `Users`.`name`="$user_name";
-- Selecting the sites owned by a Company
SELECT `name` FROM `Websites` WHERE `is_personal`=FALSE AND `id`=$companyID;
SELECT `Websites`.`name`
FROM `Websites`
JOIN `Comnpanies` ON `Websites`.`owner_company`=`Companies`.$userID
WHERE `is_personal`=FALSE AND `Companies`.`name`="$company_name";
-- Listing all websites and their owners
(SELECT `Websites`.`name` AS `Website`,`Users`.`name` AS `Owner`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.`id`
WHERE `is_personal`=TRUE)
UNION ALL
(SELECT `Websites`.`name` AS `Website`,`Companies`.`name` AS `Owner`
FROM `Websites`
JOIN `Companies` ON `Websites`.`owner_company`=`Companies`.`id`
WHERE `is_personal`=FALSE)
ORDER BY Website, Owner;
-- Listing all users or companies which own at least one website
(SELECT `Websites`.`name` AS `Website`,`Users`.`name` AS `Owner`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.`id`
WHERE `is_personal`=TRUE)
UNION DISTINCT
(SELECT `Websites`.`name` AS `Website`,`Companies`.`name` AS `Owner`
FROM `Websites`
JOIN `Companies` ON `Websites`.`owner_company`=`Companies`.`id`
WHERE `is_personal`=FALSE)
GROUP BY `Owner` ORDER BY `Owner`;
Normalization Level Up
As a technical note for normalization, the ownership information could be factored out of the Websites table and a new table created to hold the ownership data, including the is_normal column.
CREATE TABLE `Websites` (
`id` SERIAL PRIMARY KEY,
`name` VARCHAR(255),
`owner` BIGINT UNSIGNED DEFAULT NULL,
website_attributes,
FOREIGN KEY `Website_Owner` (`owner`)
REFERENCES `WebOwners` (id`)
ON DELETE RESTRICT ON UPDATE CASCADE
);
CREATE TABLE `WebOwnersData` (
`id` SERIAL PRIMARY KEY,
`is_personal` BOOL,
`user` BIGINT UNSIGNED DEFAULT NULL,
`company` BIGINT UNSIGNED DEFAULT NULL,
FOREIGN KEY `WebOwners_User` (`user`)
REFERENCES `Users` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY `WebOwners_Company` (`company`)
REFERENCES `Companies` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
);
CREATE VIEW `WebOwners` AS
SELECT * FROM WebsitesData WHERE
(`is_personal`=TRUE AND `user` IS NOT NULL AND `company` IS NULL) OR
(`is_personal`=FALSE AND `user` IS NULL AND `company` IS NOT NULL)
WITH CHECK OPTION;
I believe, however, that the created VIEW, with its constraints, prevents any of the anomalies that normalization aims to remove, and adds complexity that is not needed in the situation. The normalization process is always a trade off anyway.