SQL table design for conversation message system - sql

In my web application backed by a MySQL database, I want to offer a messaging system where messages are grouped into conversations between multiple users, but I am stuck with designing a table structure that caters to some of my needs:
Multiple users can participate in one conversation
Users can join, read, leave and delete conversations
An inbox view should generate as little queries as possible
Now, the first requirement for a many-to-many relation can be solved by using a junction table. But it has proven to generate quite a lot of problems when writing the select queries for the inbox view.
The second requirement also proved to be a challenge. If a user leaves a conversation, it should still be available for him to read through old messages. New messages between the remaining users in the conversation should not be shared with the user who left. I first thought about using a tree-like structure for conversations. Each time a user joins or leaves a conversation, a new conversation is created with a reference to the parent conversation and new relations with the remaining participants in the junction table are created.
The third requirement also seems not to be trivial. The inbox view should display a list of conversations with a specific user as participant. Also, additional information should be displayed for each conversation: The names of all current participants, the last reply to the conversation and that reply's author. Think of the inbox view as a list of messages with additional information about the conversation they belong to.
My current approach looks like this:
CREATE TABLE `conversation` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`parentId` int(11) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `parentId` (`parentId`),
CONSTRAINT `conversation_ibfk_1` FOREIGN KEY (`parentId`) REFERENCES `conversation` (`id`)
);
CREATE TABLE `participant` (
`userId` int(11) unsigned NOT NULL,
`conversationId` int(11) unsigned NOT NULL,
`readAt` datetime DEFAULT NULL,
PRIMARY KEY (`userId`,`conversationId`),
KEY `conversationId` (`conversationId`),
CONSTRAINT `participant_ibfk_2` FOREIGN KEY (`conversationId`) REFERENCES `conversation` (`id`),
CONSTRAINT `participant_ibfk_1` FOREIGN KEY (`userId`) REFERENCES `user` (`id`)
);
CREATE TABLE `reply` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`conversationId` int(11) unsigned NOT NULL,
`userId` int(11) unsigned NOT NULL,
`text` text NOT NULL,
PRIMARY KEY (`id`),
KEY `conversationId` (`conversationId`),
KEY `userId` (`userId`),
CONSTRAINT `reply_ibfk_2` FOREIGN KEY (`userId`) REFERENCES `user` (`id`),
CONSTRAINT `reply_ibfk_1` FOREIGN KEY (`conversationId`) REFERENCES `conversation` (`id`)
);
CREATE TABLE `user` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`username` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`)
);
I have hit a wall here and can't find a solution that caters to all my needs. Maybe someone here can give me some advice on how to approach this database design.

USERS ACTIVITY
I still would make a junction table, but with indices on tuples.
USER_CONV
---------------
id_user
id_conversation
active_flag
begin_flag - flag indicating if id_first_reply == id_first_visible_reply
id_first_reply - first reply in the conversation
id_first_visible_reply - first visible for a given user
id_last_reply - last visible for a given user, user if active_flag = false, otherwise NULL
index (id_user, id_conversation, active_flag)
USER_REPLY
---------------
id_user
id_conversaion
id_reply
index (id_user, id_conversation, id_reply)
The USER_CONV should make it less painless to create inbox. Still you will need to join it with USER_REPLY, and then with REPLY.
INBOX
If you want to make it fast, then create a kind of CACHED_INBOX, and a table with MOST_RECENT_USER_ACTIVITY. The CACHED_INBOX table will contain all the inbox data that you do not need to do any joins to fetch the relevant data, but you only have to do a UNION with the data from MOST_RECENT_USER_ACTIVITY. The MOST_RECENT_USER_ACTIVITY should be quite small (work fast), and CACHED_INBOX will be 'static'. Then once a day, or every several days at the time when the server is less likely to be used do a batch CACHED_INBOX update. It will work unless your users decide to stay in conversations forever.
OPTIMISATION
Of course you will need to use explain a bit, to see how the query optimizer uses indexing (if it uses it at all). Mybe you will need a table like ACTIVE_CONVERSATIONS, which will not have any indexing, that would hit performance a bit. But then you would do some batch updates to USER_CONV, which would have more extensive indexing. You have to try a bit, and see what do you expect from your users behaviour, that should be one of the architecture drivers.
NOTE: regarding indexes, there is a good website devoted to the topic of indexing use-the-index-luke.com

Related

correct db structure for subjects management in school management system

I have following db structure for a school information system:
With this db structure, I have to enter subjects of all levels (class1 ,class2 etc) for each academic year. Subjects of class 1 will remain the same for most of the academic years, (for eg: class1 subjects will be same for 2010,2011,2012). So the data entry personal will find this cumbersome to enter the subjects for each level per academic year even though the subjects remains the same.
There is a possibility that the course may change for each level at some interval of time and subjects for various level may change.
How can I structure my DB such that the data entry personal does not have to enter subjects every academic year and if course does not change and make necessary change when subjects change for certain level?
For eg: class 1 subjects have not changed for 2013 so data entry personal does not have to enter subjects again. class 2 subjects changed for 2013 and data entry personal enter subjects.
Also suggest if the current structure suits the best.
If I understand correctly then first idea is:
Delete table tbl_academic_years;
Create new table tbl_subjects for subjects;
Create new table tbl_relations.
Select:
After that if you need to select actual information - all you need is select from relation table with deleted = 0.
SELECT
l.name,
s.name
FROM
tbl_relations r
INNER JOIN tbl_levels l
ON r.level_id = l.id
INNER JOIN tbl_subjects s
ON r.subject_id = s.id
WHERE
r.deleted = 0
Add:
If you need to add new subject for levels first of all add new subject to tbl_subjects, then add record with new id subject to tbl_relations.
Delete:
If you need to delete subjects for levels - set value 0 to field deleted or delete rows in tbl_relations.
If subjects not changed - do nothing.
I think the best structure for your database will be the next script generated in mysql:
CREATE TABLE `tbl_levels` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255),
PRIMARY KEY (`id`)
);
CREATE TABLE `tbl_subjects` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255),
PRIMARY KEY (`id`)
);
CREATE TABLE `tbl_relations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`level_id` int(11),
`subject_id` int(11),
`deleted` tinyint(1) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `level_id` (`level_id`),
KEY `subject_id` (`subject_id`),
CONSTRAINT `tbl_relations_ibfk_1` FOREIGN KEY (`level_id`) REFERENCES `tbl_levels` (`id`),
CONSTRAINT `tbl_relations_ibfk_2` FOREIGN KEY (`subject_id`) REFERENCES `tbl_subjects` (`id`)
);
Logical structure correctness should prevail over usage convenience. Since your database design fits the logical business requirements for the software to function correctly, data entry convinience could be handled thorough provided an easy to use GUI where the common information is provided as a default for user to approve or update (instead of entering it from scratch) or via a setup program that runs part of the configuration process.
In your particular case, this process is not a daily routine. It is probably executed at year start or first use of the software, so I suggest you don't change the database design if it is correct for your business.
The simplest solution is to change the table tbl_subjects to have "from_year" and "to_year" columns. A class is taught at a given level at a given point in time if there's a record where "from_year" smaller than date, and "to_year" is greater than the date, OR "to_year" is null.

Handling a Many-to-Many Dimension when all dimensional values have 100% importance

I'll at least try to keep this succinct.
Let's suppose we're tracking the balances of accounts over time. So our fact table will have columns such as...
Account Balance Fact Table
(FK)AccountID
(FK)DateID
...
Balance
...
Obviously you have an Account Dimension Table and a Date Dimension Table. So now we can easily filter on Accounts or Dates (or date ranges, etc.).
But here's the kicker... Accounts can belong to Groups -- any number of Groups at a given Date. Groups are simply logical abstractions, and they have no tangible meaning aside from reporting purposes. An Account being in 0, 1, or 17 groups doesn't affect its Balance in any way. For example, AccountID 1 may be in Groups 38, 76, 104, and 159. Account 2 may be in Group 1 (which has a Group Description of "Ungrouped". Account 3 may be in seventeen groups (real example).
As an added bonus, our users are completely non-technical. They don't know SQL, they have no experience with relational databases, and have historically done all of their work in a convoluted Excel solution. Right now we're building a dimensional model they can slice and filter with PowerPivot, though these Account Groups are threatening to turn an otherwise ruthlessly simple Star Schema into something complex enough that the users will balk and return to their current spaghetti solution.
So let's look at our options...
Boolean Method
The Boolean method is not feasible. We have about 570,000 different accounts, but more importantly, 26,000 different groups. This would also be a devil for end-users to filter, since they're non-technical and are relying on very simple tools to get this done.
Multiple Column Method
In theory this could work, however, we do have some accounts that belong to 17 groups. Again, the groups are really just logical groups -- they have no meaning, but they are required by the business for reporting purposes. Having end-users filter out groups from 17 different columns isn't going to go over well in user-acceptance, and would likely result in users refusing to use the solution (and I don't blame them).
Bridge Table
This count work, but we do have 26,000 different groups. I'm not finding this to be user-friendly.
Since I'm not liking my options, I can only assume there's a better way other than snowflaking... unless snowflaking IS the only way. If someone could lend a hand and explain their rationale it'd be appreciated.
UPDATE: For clarification, an example I think everyone here can relate to is imagine you can list keyword skills on a resume. They all relate to the same person, but you can have any number of skills. The skills don't effect any of individual measures on a resume -- i.e. 'C++' isn't more valuable than 'C#' -- you can't put all the resume/skill combinations in the fact table or you'd end up double counting (or well more than double ;) ).
I think the best I'm going to be able to do here is to create an outrigger table for groups. I'm not a fan of it, but I think it's the only real option I have.
So now we have...
Account Balance Fact Table
(FK)AccountID
(FK)DateID
...
Balance
...
Account Dimension
(PK)AccountID
Account Name
...
(FK)Account Group Key
Account Group Outrigger
(PK)AccountGroupID
(PK)AccountID)
Account Group Name
I would say you've got to start from the interface. How would users like to do their filtering in an ideal world?
I think I would end up going for a bridge or factless fact table or something like that. Perhaps a surrogate key on the fact table and a many-many link table from that to group membership.
It's definitely tough - and the interface and usage cases has to be made workable, so I'd start from there. Perhaps something will shake out of how they do this reporting - like equivalence classes in the groups or some way they partition the account space. Maybe there is a hierarchy or organization to the groups which make it more manageable and may inform a simpler design.
If I correctly understood your question this should be okay:
CREATE TABLE IF NOT EXISTS `accounts` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1 ;
CREATE TABLE IF NOT EXISTS `accounts_groups` (
`account_id` int(11) NOT NULL,
`group_id` int(11) NOT NULL,
`start_date` date NOT NULL,
`end_date` date DEFAULT NULL,
UNIQUE KEY `account_group` (`account_id`,`group_id`,`start_date`),
KEY `group_id` (`group_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE IF NOT EXISTS `account_balances` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`account_id` int(11) NOT NULL,
`date` date NOT NULL,
`balance` decimal(11,2) NOT NULL,
PRIMARY KEY (`id`),
KEY `account_id` (`account_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1 ;
CREATE TABLE IF NOT EXISTS `groups` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1 ;
ALTER TABLE `accounts_groups`
ADD CONSTRAINT `accounts_groups_ibfk_1` FOREIGN KEY (`account_id`) REFERENCES `accounts` (`id`),
ADD CONSTRAINT `accounts_groups_ibfk_2` FOREIGN KEY (`group_id`) REFERENCES `groups` (`id`);
ALTER TABLE `account_balances`
ADD CONSTRAINT `account_balances_ibfk_1` FOREIGN KEY (`account_id`) REFERENCES `accounts` (`id`);

Multiple yet mutually exclusive foreign keys - is this the way to go?

I have three tables: Users, Companies and Websites.
Users and companies have websites, and thus each user record has a foreign key into the Websites table. Also, each company record has a foreign key into the Websites table.
Now I want to include foreign keys in the Websites table back into their respective "parent" records. How do I do that? Should I have two foreign keys in each website record, with one of them always NULL? Or is there another way to go?
If we look into the model here, we will see the following:
A user is related to exactly one website
A company is related to exactly one website
A website is related to exactly one user or company
The third relation implies existence of a "user or company" entity whose PRIMARY KEY should be stored somewhere.
To store it you need to create a table that would store a PRIMARY KEY of a website owner entity. This table can also store attributes common for a user and a website.
Since it's a one-to-one relation, website attributes can be stored in this table too.
The attributes not shared by users and companies should be stored in the separate table.
To force the correct relationships, you need to make the PRIMARY KEY of the website composite with owner type as a part of it, and force the correct type in the child tables with a CHECK constraint:
CREATE TABLE website_owner (
type INT NOT NULL,
id INT NOT NULL,
website_attributes,
common_attributes,
CHECK (type IN (1, 2)) -- 1 for user, 2 for company
PRIMARY KEY (type, id)
)
CREATE TABLE user (
type INT NOT NULL,
id INT NOT NULL PRIMARY KEY,
user_attributes,
CHECK (type = 1),
FOREIGN KEY (type, id) REFERENCES website_owner
)
CREATE TABLE company (
type INT NOT NULL,
id INT NOT NULL PRIMARY KEY,
company_attributes,
CHECK (type = 2),
FOREIGN KEY (type, id) REFERENCES website_owner
)
you don’t need a parent column, you can lookup the parents with a simple select (or join the tables) on the users and companies table. if you want to know if this is a user or a company website i suggest using a boolean column in your websites table.
Why do you need a foreign key from website to user/company at all? The principle of not duplicating data would suggest it might be better to scan the user/company tables for a matching website id. If you really need to you could always store a flag in the website table that denotes whether a given website record is for a user or a company, and then scan the appropriate table.
The problem I have with the accepted answer (by Quassnoi) is that the object relationships are the wrong way around: company is not a sub-type of a website owner; we had companies before we had websites and we can have companies who are website owners. Also, it seems to me that website ownership is a relationship between a website and either a person or a company i.e. we should have a relationship table (or two) in the schema. It may be an acceptable approach to keep personal website ownership separate from corporate website ownership and only bring them together when required e.g. via VIEWs:
CREATE TABLE People
(
person_id CHAR(9) NOT NULL UNIQUE, -- external identifier
person_name VARCHAR(100) NOT NULL
);
CREATE TABLE Companies
(
company_id CHAR(6) NOT NULL UNIQUE, -- external identifier
company_name VARCHAR(255) NOT NULL
);
CREATE TABLE Websites
(
url CHAR(255) NOT NULL UNIQUE
);
CREATE TABLE PersonalWebsiteOwnership
(
person_id CHAR(9) NOT NULL UNIQUE
REFERENCES People ( person_id ),
url CHAR(255) NOT NULL UNIQUE
REFERENCES Websites ( url )
);
CREATE TABLE CorporateWebsiteOwnership
(
company_id CHAR(6) NOT NULL UNIQUE
REFERENCES Companies( company_id ),
url CHAR(255) NOT NULL UNIQUE
REFERENCES Websites ( url )
);
CREATE VIEW WebsiteOwnership AS
SELECT url, company_name AS website_owner_name
FROM CorporateWebsiteOwnership
NATURAL JOIN Companies
UNION
SELECT url, person_name AS website_owner_name
FROM PersonalWebsiteOwnership
NATURAL JOIN People;
The problem with the above is there is no way of using database constraints to enforce the rule that a website is either owned by a person or a company but not both.
If we can assuming the DBMS enforces check constraints (as the accepted answer does) then we can exploit the fact that a (human) person and a company are both legal persons and employ a super-type table (LegalPersons) but still retain relationship table approach (WebsiteOwnership), this time using the VIEWs to separate personal website ownership from separate from corporate website ownership but this time with strongly typed attributes:
CREATE TABLE LegalPersons
(
legal_person_id INT NOT NULL UNIQUE, -- internal artificial identifier
legal_person_type CHAR(7) NOT NULL
CHECK ( legal_person_type IN ( 'Company', 'Person' ) ),
UNIQUE ( legal_person_type, legal_person_id )
);
CREATE TABLE People
(
legal_person_id INT NOT NULL
legal_person_type CHAR(7) NOT NULL
CHECK ( legal_person_type = 'Person' ),
UNIQUE ( legal_person_type, legal_person_id ),
FOREIGN KEY ( legal_person_type, legal_person_id )
REFERENCES LegalPersons ( legal_person_type, legal_person_id ),
person_id CHAR(9) NOT NULL UNIQUE, -- external identifier
person_name VARCHAR(100) NOT NULL
);
CREATE TABLE Companies
(
legal_person_id INT NOT NULL
legal_person_type CHAR(7) NOT NULL
CHECK ( legal_person_type = 'Company' ),
UNIQUE ( legal_person_type, legal_person_id ),
FOREIGN KEY ( legal_person_type, legal_person_id )
REFERENCES LegalPersons ( legal_person_type, legal_person_id ),
company_id CHAR(6) NOT NULL UNIQUE, -- external identifier
company_name VARCHAR(255) NOT NULL
);
CREATE TABLE WebsiteOwnership
(
legal_person_id INT NOT NULL
legal_person_type CHAR(7) NOT NULL
UNIQUE ( legal_person_type, legal_person_id ),
FOREIGN KEY ( legal_person_type, legal_person_id )
REFERENCES LegalPersons ( legal_person_type, legal_person_id ),
url CHAR(255) NOT NULL UNIQUE
REFERENCES Websites ( url )
);
CREATE VIEW CorporateWebsiteOwnership AS
SELECT url, company_name
FROM WebsiteOwnership
NATURAL JOIN Companies;
CREATE VIEW PersonalWebsiteOwnership AS
SELECT url, person_name
FROM WebsiteOwnership
NATURAL JOIN Persons;
What we need are new DBMS features for 'distributed foreign keys' ("For each row in this table there must be exactly one row in one of these tables") and 'multiple assignment' to allow the data to be added into tables thus constrained in a single SQL statement. Sadly we are a far way from getting such features!
First of all, do you really need this bi-directional link? It is a good practice to avoid it unless absolutely needed.
I understand it that you wish to know whether the site belongs to a user or to a company. You can achieve that by having a simple boolean field in the Website table - [BelongsToUser]. If true, then you look up a user, if false - you look up a company.
A bit late, but all the existing answers seemed to fall somewhat short of the mark:
Owner to website is a 1:Many relation
Website to owner is a 1:1 relation
Users and Companies tables should not have a foreign key into the Websites table
None of the website data, common to users and companies or not, should be in the Users or Companies tables
None of the owner's information, common or not, should be in the Websites table
MySQL ignores, silently, CHECK constraints on tables (no enforcement of referential integrity)
The DBMS ought to handle the 'relation' logic, not the application using the database
Some of this is recognized in the answer from onedaywhen, yet that answer still missed the opportunity to make MySQL do the heavy lifting and enforce the referential integrity.
A website can only have one owner, legally, anyway. A person, or company, can have any number of websites, including none. A link in the database from owner to website can only be 1:1 at any level of normalization. In reality the relation is 1:Many, and would require having multiple table entries for each owner that happens to own more than one website. A link from website to owner is 1:1 in both database terms and in reality. Having the link from website to owner represents the model better. With an index in the website table, doing the 1:Many lookup for a given owner becomes reasonably efficient.
The CHECK attribute in SQL would be an excellent solution, if MySQL didn't happen to silently ignore it.
MySQL Docs 13.1.20 CREATE TABLE Syntax
The CHECK clause is parsed but ignored by all storage engines.
MySQL's functionality does offer two solutions as work-arounds to implement the behavior of CHECK and keep the referential integrity of the data. Triggers with stored procedures is one, and works well with all manner of constraints. Easier to implement, though less versatile, is using a VIEW with a WITH CHECK OPTION clause, which MySQL will implement.
MySQL Docs 24.5.4 The View WITH CHECK OPTION Clause
The WITH CHECK OPTION clause can be given for an updatable view to prevent inserts to rows for which the WHERE clause in the select_statement is not true. It also prevents updates to rows for which the WHERE clause is true but the update would cause it to be not true (in other words, it prevents visible rows from being updated to nonvisible rows).
The MySQLTUTORIAL site gives a good example of both options in their Introduction to the SQL CHECK constraint tutorial. (You have to think around the typos, but good otherwise.)
Having found this question while trying to resolve a similar mutually exclusive foreign key split and developing a solution, with hints generated by the answers, it seems only proper to share my solution in return.
Recommended Solution
For the minimum impact to the existing schema, and the application accessing the data, retain the Users and Companies tables as they are. Rename the Websites table and replace it with a VIEW named Websites which the application can continue to access. Except when dealing with the ownership information, all the old queries to Websites should still work. So:
The setup
-- Keep the `Users` table about "users"
CREATE TABLE `Users` (
`id` INT SERIAL PRIMARY KEY,
`name` VARCHAR(180),
-- user_attributes
);
-- Keep the `Companies` table about "companies"
CREATE TABLE `Companies` (
`id` SERIAL PRIMARY KEY,
`name` VARCHAR(180),
-- company_attributes
);
-- Attach ownership information about the website to the website's record in the `Websites` table, renamed to `WebsitesData`
CREATE TABLE `WebsitesData` (
`id` SERIAL PRIMARY KEY,
`name` VARCHAR(255),
`is_personal` BOOL,
`owner_user` BIGINT UNSIGNED DEFAULT NULL,
`owner_company` BIGINT UNSIGNED DEFAULT NULL,
website_attributes,
FOREIGN KEY `WebsiteOwner_User` (`owner_user`)
REFERENCES `Users` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY `WebsiteOwner_Company` (`owner_company`)
REFERENCES `Companies` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
);
-- Create a new `VIEW` with the original name of `Websites` as the gateway to the website records which can enforce the constraints you need
CREATE VIEW `Websites` AS
SELECT * FROM `WebsitesData` WHERE
(`is_personal`=TRUE AND `owner_user` IS NOT NULL AND `owner_company` IS NULL) OR
(`is_personal`=FALSE AND `owner_user` IS NULL AND `owner_company` IS NOT NULL)
WITH CHECK OPTION;
Usage
-- Use the Websites VIEW for the INSERT, UPDATE, and SELECT operations as you normally would and leave the WebsitesData table in the background.
INSERT INTO `Websites` SET
`is_personal`=TRUE,
`owner_user`=$userID;
INSERT INTO `Websites` SET
`is_personal`=FALSE,
`owner_company`=$companyID;
-- Or, using different field lists based on the type of owner
INSERT INTO `Websites` (`is_personal`,`owner_user`, ...)
VALUES (TRUE, $userID, ...);
INSERT INTO `Websites` (`is_personal`,`owner_company`, ...)
VALUES (FALSE, $companyID, ...);
-- Or, using a common field list, and placing NULL in the proper place
INSERT INTO `Websites` (`is_personal`,`owner_user`,`owner_company`,...)
VALUES (TRUE, $userID, NULL, ...);
INSERT INTO `Websites` (`is_personal`,`owner_user`,`owner_company`,...)
VALUES (FALSE, NULL, $companyID, ...);
-- Change the company that owns a website
-- Will ERROR if the site was owned by a User.
UPDATE `Websites` SET `owner_company`=$new_companyID;
-- Force change the ownership from a User to a Company
UPDATE `Websites` SET
`owner_company`=$new_companyID,
`owner_user`=NULL,
`is_personal`=FALSE;
-- Force change the ownership from a Company to a User
UPDATE `Websites` SET
`owner_user`=$new_userID,
`owner_company`=NULL,
`is_personal`=TRUE;
-- Selecting the owner of a site without needing to know if it is personal or not
(SELECT `Users`.`name` AS `Owner`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.`id`
WHERE `is_personal`=TRUE AND `Websites`.`id`=$siteID)
UNION
(SELECT `Companies`.`name` AS `Owner`
FROM `Websites`
JOIN `Companies` ON `Websites`.`owner_company`=`Companies`.`id`
WHERE `is_personal`=FALSE AND `Websites`.`id`=$siteID);
-- Selecting the sites owned by a User
SELECT `name` FROM `Websites`
WHERE `is_personal`=TRUE AND `id`=$userID;
SELECT `Websites`.`name`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.$userID
WHERE `is_personal`=TRUE AND `Users`.`name`="$user_name";
-- Selecting the sites owned by a Company
SELECT `name` FROM `Websites` WHERE `is_personal`=FALSE AND `id`=$companyID;
SELECT `Websites`.`name`
FROM `Websites`
JOIN `Comnpanies` ON `Websites`.`owner_company`=`Companies`.$userID
WHERE `is_personal`=FALSE AND `Companies`.`name`="$company_name";
-- Listing all websites and their owners
(SELECT `Websites`.`name` AS `Website`,`Users`.`name` AS `Owner`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.`id`
WHERE `is_personal`=TRUE)
UNION ALL
(SELECT `Websites`.`name` AS `Website`,`Companies`.`name` AS `Owner`
FROM `Websites`
JOIN `Companies` ON `Websites`.`owner_company`=`Companies`.`id`
WHERE `is_personal`=FALSE)
ORDER BY Website, Owner;
-- Listing all users or companies which own at least one website
(SELECT `Websites`.`name` AS `Website`,`Users`.`name` AS `Owner`
FROM `Websites`
JOIN `Users` ON `Websites`.`owner_user`=`Users`.`id`
WHERE `is_personal`=TRUE)
UNION DISTINCT
(SELECT `Websites`.`name` AS `Website`,`Companies`.`name` AS `Owner`
FROM `Websites`
JOIN `Companies` ON `Websites`.`owner_company`=`Companies`.`id`
WHERE `is_personal`=FALSE)
GROUP BY `Owner` ORDER BY `Owner`;
Normalization Level Up
As a technical note for normalization, the ownership information could be factored out of the Websites table and a new table created to hold the ownership data, including the is_normal column.
CREATE TABLE `Websites` (
`id` SERIAL PRIMARY KEY,
`name` VARCHAR(255),
`owner` BIGINT UNSIGNED DEFAULT NULL,
website_attributes,
FOREIGN KEY `Website_Owner` (`owner`)
REFERENCES `WebOwners` (id`)
ON DELETE RESTRICT ON UPDATE CASCADE
);
CREATE TABLE `WebOwnersData` (
`id` SERIAL PRIMARY KEY,
`is_personal` BOOL,
`user` BIGINT UNSIGNED DEFAULT NULL,
`company` BIGINT UNSIGNED DEFAULT NULL,
FOREIGN KEY `WebOwners_User` (`user`)
REFERENCES `Users` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY `WebOwners_Company` (`company`)
REFERENCES `Companies` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
);
CREATE VIEW `WebOwners` AS
SELECT * FROM WebsitesData WHERE
(`is_personal`=TRUE AND `user` IS NOT NULL AND `company` IS NULL) OR
(`is_personal`=FALSE AND `user` IS NULL AND `company` IS NOT NULL)
WITH CHECK OPTION;
I believe, however, that the created VIEW, with its constraints, prevents any of the anomalies that normalization aims to remove, and adds complexity that is not needed in the situation. The normalization process is always a trade off anyway.

Scalable Database Tagging Schema

EDIT: To people building tagging systems. Don't read this. It is not what you are looking for. I asked this when I wasn't aware that RDBMS all have their own optimization methods, just use a simple many to many scheme.
I have a posting system that has millions of posts. Each post can have an infinite number of tags associated with it.
Users can create tags which have notes, date created, owner, etc. A tag is almost like a post itself, because people can post notes about the tag.
Each tag association has an owner and date, so we can see who added the tag and when.
My question is how can I implement this? It has to be fast searching posts by tag, or tags by post. Also, users can add tags to posts by typing the name into a field, kind of like the google search bar, it has to fill in the rest of the tag name for you.
I have 3 solutions at the moment, but not sure which is the best, or if there is a better way.
Note that I'm not showing the layout of notes since it will be trivial once I get a proper solution for tags.
Method 1. Linked list
tagId in post points to a linked list in tag_assoc, the application must traverse the list until flink=0
post: id, content, ownerId, date, tagId, notesId
tag_assoc: id, tagId, ownerId, flink
tag: id, name, notesId
Method 2. Denormalization
tags is simply a VARCHAR or TEXT field containing a tab delimited array of tagId:ownerId. It cannot be a fixed size.
post: id, content, ownerId, date, tags, notesId
tag: id, name, notesId
Method 3. Toxi
(from: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html,
also same thing here: Recommended SQL database design for tags or tagging)
post: id, content, ownerId, date, notesId
tag_assoc: ownerId, tagId, postId
tag: id, name, notesId
Method 3 raises the question, how fast will it be to iterate through every single row in tag_assoc?
Methods 1 and 2 should be fast for returning tags by post, but for posts by tag, another lookup table must be made.
The last thing I have to worry about is optimizing searching tags by name, I have not worked that out yet.
I made an ASCII diagram here: http://pastebin.com/f1c4e0e53
Here is how I'd do it:
posts: [postId], content, ownerId, date, noteId, noteType='post'
tag_assoc: [postId, tagName], ownerId, date, noteId, noteType='tagAssoc'
tags: [tagName], ownerId, date, noteId, noteType='tag'
notes: [noteId, noteType], ownerId, date, content
The fields in square brackets are the primary key of the respective table.
Define a constraint on noteType in each table: posts, tag_assoc, and tags. This prevents a given note from applying to both a post and a tag, for example.
Store tag names as a short string, not an integer id. That way you can use the covering index [postId, tagName] in the tag_assoc table.
Doing tag completion is done with an AJAX call. If the user types "datab" for a tag, your web page makes an AJAX call and on the server side, the app queries: SELECT tagName FROM tags WHERE tagName LIKE ?||'%'.
"A tag is almost like a post itself, because people can post notes about the tag." - this phrase makes me think you really just want one table for POST, with a primary key and a foreign key that references the POST table. Now you can have as many tags for each post as your disk space will allow.
I'm assuming there's no need for many to many between POST and tags, because a tag isn't shared across posts, based on this:
"Users can create tags which have notes, date created, owner, etc."
If creation date and owner are shared, those would be two additional foreign key relationships, IMO.
A linked list is almost certainly the wrong approach. It certainly means that your queries will be either complex or sub-optimal - which is ironic since the most likely reason for using a linked list is to keep the data in the correct sorted order. However, I don't see an easy way to avoid iteratively fetching a row, and then using the flink value retrieved to condition the select operation for the next row.
So, use a table-based approach with normal foreign key to primary key references. The one outlined by Bill Karwin looks similar to what I'd outline.
Bill I think I kind of threw you off, the notes are just in another table and there is a separate table with notes posted by different people. Posts have notes and tags, but tags also have notes, which is why tags are UNIQUE.
Jonathan is right about linked lists, I wont use them at all. I decided to implement the tags in the simplest normalized way that meats my needs:
DROP TABLE IF EXISTS `tags`;
CREATE TABLE IF NOT EXISTS `tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts`;
CREATE TABLE IF NOT EXISTS `posts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`content` TEXT NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts_notes`;
CREATE TABLE IF NOT EXISTS `posts_notes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
`note` TEXT NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts_tags`;
CREATE TABLE IF NOT EXISTS `posts_tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`tagId` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE,
FOREIGN KEY (`tagId`) REFERENCES tags(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
I'm not sure how fast this will be in the future, but it should be fine for a while as only a couple people use the database.

Preferred database design method for assigning user roles? (Hats vs. Groups)

I have medium sized MySQL database with a primary "persons" table which contains basic contact information about every human being connected to the theatre and theatre school for which I am responsible for maintaining and developing a number of web applications.
Some persons are just contacts - that is, their "persons" table record is all the information we need to store about them. Many others though have to be able to assume different roles for a variety of systems. Of these, most start out as students. Some start as employees. People who are students can become interns or performers; employees can become students; all teachers are employees and performers, etc.
In essence, their are a variety of different "hats" that any individual person may have to wear in order to access and interact with different parts of the system, as well as have information about them made available on public pages on our site.
My choice for implementing this model is to have several other tables which represent these "hats" - tables which contain meta-information to supplement the basic "person" info, all of which use the "persons" id as their primary key. For example, a person who is a teacher has a record in a teachers table containing his or her short biographical information and pay rate. All teachers are also employees (but not all employees are teachers), meaning they have a record in the employees table which allows them to submit their hours into our payroll system.
My question is, what are the drawbacks to implementing the model as such? The only other option I can think of is to inflate the persons table with fields that will be empty and useless for most entries and then have a cumbersome table of "groups" to which persons can belong, and then to have almost every table for every system have a person person_id foreign key and then depend on business logic to verify that the person_id referenced belongs to the appropriate group; But that's stupid, isn't it?
A few example table declarations follow below, which hopefully should demonstrate how I'm currently putting all this together, and hopefully show why I think it is a more sensible way to model the reality of the various situations the systems have to deal with.
Any and all suggestions and comments are welcome. I appreciate your time.
EDIT A few respondents have mentioned using ACLs for security - I did not mention in my original question that I am in fact using a separate ACL package for fine-grained access control for actual users of the different systems. My question is more about the best practices for storing metadata about people in the database schema.
CREATE TABLE persons (
`id` int(11) NOT NULL auto_increment,
`firstName` varchar(50) NOT NULL,
`middleName` varchar(50) NOT NULL default '',
`lastName` varchar(75) NOT NULL,
`email` varchar(100) NOT NULL default '',
`address` varchar(255) NOT NULL default '',
`address2` varchar(255) NOT NULL default '',
`city` varchar(75) NOT NULL default '',
`state` varchar(75) NOT NULL default '',
`zip` varchar(10) NOT NULL default '',
`country` varchar(75) NOT NULL default '',
`phone` varchar(30) NOT NULL default '',
`phone2` varchar(30) NOT NULL default '',
`notes` text NOT NULL default '',
`birthdate` date NOT NULL default '0000-00-00',
`created` datetime NOT NULL default '0000-00-00 00:00',
`updated` timestamp NOT NULL,
PRIMARY KEY (`id`),
KEY `lastName` (`lastName`),
KEY `email` (`email`)
) ENGINE=InnoDB;
CREATE TABLE teachers (
`person_id` int(11) NOT NULL,
`bio` text NOT NULL default '',
`image` varchar(150) NOT NULL default '',
`payRate` float(5,2) NOT NULL,
`active` boolean NOT NULL default 0,
PRIMARY KEY (`person_id`),
FOREIGN KEY(`person_id`) REFERENCES `persons` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE classes (
`id` int(11) NOT NULL auto_increment,
`teacher_id` int(11) default NULL,
`classstatus_id` int(11) NOT NULL default 0,
`description` text NOT NULL default '',
`capacity` tinyint NOT NULL,
PRIMARY KEY(`id`),
FOREIGN KEY(`teacher_id`) REFERENCES `teachers` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY(`classstatus_id`) REFERENCES `classstatuses` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
KEY (`teacher_id`,`level_id`),
KEY (`teacher_id`,`classstatus_id`)
) ENGINE=InnoDB;
CREATE TABLE students (
`person_id` int(11) NOT NULL,
`image` varchar(150) NOT NULL default '',
`note` varchar(255) NOT NULL default '',
PRIMARY KEY (`person_id`),
FOREIGN KEY(`person_id`) REFERENCES `persons` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE enrollment (
`id` int(11) NOT NULL auto_increment,
`class_id` int(11) NOT NULL,
`student_id` int(11) NOT NULL,
`enrollmenttype_id` int(11) NOT NULL,
`created` datetime NOT NULL default '0000-00-00 00:00',
`modified` timestamp NOT NULL,
PRIMARY KEY(`id`),
FOREIGN KEY(`class_id`) REFERENCES `classes` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY(`student_id`) REFERENCES `students` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE,
FOREIGN KEY(`enrollmenttype_id`) REFERENCES `enrollmenttypes` (`id`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
I went through a similar thing last year. There the question was: do we model our entities explicitly or generically? In your example, that would mean having entities/tables like teacher, student, etc with direct relationships between them or not.
In the end we went for a generic "Party" model. The Party model is as follows:
A Party represents a person or organisation;
Most Party types had a dependent table to store extra information depending on the party type eg Person, Organization, Company;
Things like Student or Teacher are Party Roles. A Party may have any number of Party Roles. A Person may be both a Teacher and a Student, for example;
Things like classes are handled as Party Role Relationships. For example, a relationship between a Teacher and Student role indicates a class relationship;
Party Role Relationships can have subtypes for extra information. A Teacher-Student Relationship in your model is an Enrolment and that could have the extra attributes you're talking about;
Parties don't have direct relationships with each other. Only Party Roles relate to each other; and
For common groupings of information, we created views if it helped because the SQL can be a bit convoluted as the relationships are more indirect (eg there are three tables in between the Party entities for a Teacher and Student).
It's an extremely powerful model, one that is pretty common in CRM type systems. This model pretty much came from "The Data Model Resource Book: Volume 1", which is an excellent resource for such things.
The groups and hats models you describe are convertible, one to the other. There's no real worry about data loss. Specifically, the "master groups" table can be produced by outer joins of the "hat person" table with the various "hat detail" tables.
If you're using a "hat" model you need to make sure that a given "hat table" accurately encapsulates the unique characteristics of that hat. There's a lot less forgiveness there than with the groups model.
You'll probably want to set up a few views for common tasks if you go this way - for example, if somebody's typing into a field for "teacher name" and you want to pop up some autocompletes, having a view which is basically
SELECT firstName, lastName
FROM persons
INNER JOIN teachers ON persons.id = teachers.person_id
will help enormously.
On a tangential note, one thing I've found to be useful is to call foreign keys by the same name as the primary key in their original table. That way you can just
INNER JOIN original_table USING (primary_key)
in your FROM instead of monkeying with WHERE or ON equivalencies.
Are teachers the only 'person' that has a pay rate? You may be limiting your design by doing it this way. What you may want to do is have an attributes table that stores additional attributes for a 'person'. This will allow for future modifications.
I like the Hat approach. In the past, I implemented a combination of hats and groups. Basically, there is a list of all possible actions (permissions) a user can do. I then have a table of groups. Each group can have 1 or many actions (permissions). I then assign users to groups.
This provides me with a lot of flexibility. I can get very fine grain in my permissioning. I also can change many peoples permissions quickly by just editing the group. In fact, I have the permissioning page setup to use the same permissions. This allows the end user (not me) to setup permissions for other users.
yes, teachers are the only people who have a pay rate as such. That field should more accurately be called "classPayRate" - it's a special case for teacher employees. Non-teacher employees submit their total hours as a separate line item in our payroll system.
I might change teachers to employees and add employee type.
However in no way shape or form would I ever store email, address, phone in the person table. These should all be separate tables of their own as people have multiple email addresses (work and home), multiple phone numbers (work, home, cell, fax) and multiple addresses (work, home1, home 2, school, etc.). I would put each in its own table and assign a type to it so you can identify which is what type of address, phone, etc.
Also for address, email, phone you might want a flag to identify which is the the main record to use for contacting first. We cal ours correspondence and it is a boolean that is kept up-to-date with a trigger as each person who has a record must have one and only one correspondence, so if it changes, the old one must automatically be reset as well as the new one go in and if it is the first record it is set automaticially and if the correspondence reciord is deleted , it will be assigned to the another one if there are remaining records.
For security I prefer to use Access Controls Lists (ACLs). With ACL's you have Principals (users or groups of users), Resources (such as a file, record or group of records) and Actions (such as read, update, delete).
By default nobody has any privilege. To grant a permission you add an entry like Bob has Read Access to File Abc.
You should be able to find code that helps you implement something like this. In Java the JAAS supports this method.
I'm using an ACL package for fine-grained permissions on the site - the heart of my question is more about how to store the metadata for individuals who have different roles and build a few safeguards into the system for data integrity (such that a person must have a teacher record in order to be set as the teacher of a class - the foreign key constraint references the teacher table, not the person table).
I used Party model before. I really solves most of the shortcomings.