Modelling Post and Flag relationship in SQL - sql

I am modeling the data for my web I am building. I use Postgresql database.
In the app there are posts like SO posts and also the flags for posts as Github flags or marks, whatever the correct term for it. A post can have only one flag at a time. There are plenty of posts ever increasing, but four or five flags and they will not increase.
First approach, normalized; I have modeled this part of my data with three tables; two for the corresponding entities posts and flags, and one for the relationship as post_flag. No reference in any of the entity tables mentioned to the other entity table for relationship. All relationship is recorded in the relationship table post_flag, and that is only the id pair for ids of a post and a flag.
Table structure in that case would be:
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
product_id integer NOT NULL REFERENCES products (id),
);
CREATE TABLE flags
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
flag character varying(30) NOT NULL -- planned, in progress, fixed
);
CREATE TABLE post_flag
(
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
post_id integer NOT NULL REFERENCES posts (id),
flag_id integer NOT NULL REFERENCES flags (id)
);
To get posts flagged as fixed I have to use:
-- homepage posts- fixed posts tab
SELECT
p.*,
f.flag
FROM posts p
JOIN post_flag p_f
ON p.id = p_f.post_id
JOIN flags f
ON p_f.flag_id = f.id
WHERE f.flag = 'fixed'
ORDER BY p_f.created_at DESC
Second approach; I have two tables posts and flags. The table posts has a flag_id column that references a flag in the flags table.
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
product_id integer NOT NULL REFERENCES products (id),
flag_id integer DEFAULT NULL REFERENCES flags (id)
);
CREATE TABLE flags
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
flag character varying(30) NOT NULL -- one of planned, in progress, fixed
);
For same data;
-- homepage posts- fixed posts tab
SELECT
p.*,
f.flag
FROM posts p
JOIN flags f
ON p.flag_id = f.id
WHERE f.flag = 'fixed'
ORDER BY p.created_at DESC
Third approach denormalized; I have only one table posts. Posts table has a flag column to store the flag assigned to the post.
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
title character varying(100),
text text,
score integer DEFAULT 0,
author_id integer NOT NULL REFERENCES users (id),
product_id integer NOT NULL REFERENCES products (id),
flag character varying(30)
);
Here I would only have for same data;
-- homepage posts- fixed posts tab
SELECT
p.*,
FROM posts p
WHERE p.flag = 'fixed'
ORDER BY p.created_at DESC
I wonder if first approach is an overkill in terms of normalization of data in a RDBMS like Postgresql? For a post comment relationship that first approach would be great and indeed I make use of it. But I have some very few quantity data used as meta data for posts as badges, flags, tags. As you see in fact in the most normal form, the first approach, I already use some product_id etc for a using one less JOIN but to another table as a different relation, not to the flags. So, there my approach fits into my second approach. Should I use the more denormalized approach, the third one, having posts table and a flag column in it? What is the better approach in terms of performance, expansion, and maintainability?

Use the second approach.
The first is a many-to-many data structure and you say
A post can have only one flag at a time.
So you would then have to build the business logic in to the front-end or set up complex rules to check a post never have more than one flag.
The third approach will result in messy data, again unless you implement checks or rules to ensure the flags are not misspelled or new ones added.
Expansion and maintainability are provided in the second approach; it is also self documenting. Worry about performance when it actually becomes a problem, and not before.
Personally I would make the flag_id field in the posts table NULL, which would allow you to model a post without a flag.
Blending two approaches
Assuming your flag names are unique, you can use the flag name as a natural key. Your table structures would then be
CREATE TABLE posts
(
id bigserial PRIMARY KEY,
... other fields
flag character varying(30) REFERENCES flags (flag)
);
CREATE TABLE flags
(
flag character varying(30) NOT NULL PRIMARY KEY,
created_at timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP
);
You then get the benefit of being able to write queries for flag without having to JOIN to the flags table while having flag names checked by the table reference.

Related

Auto increment column depending other columns value

Hi I'm very new to Postgresql or SQL in general, so my terminology will probably be off.
I'm trying to add a version number column to my text text table.
name
type
desc
text_id
uuid
unique id
post_id
uuid
id of the collection of versions
version_nr
INT
version number of document
created_at
TIMESTAMP
when the document was edited
So basically when I create a new row I want to increment the version number, but I don't want all the rows to share the same "increment". I want all rows sharing the same post_id to have their own "increment".
This is what I've come up with this far:
CREATE TABLE text (
text_id uuid PRIMARY KEY DEFAULT UUID_GENERATE_V4(),
post_id uuid NOT NULL,
version_nr SERIAL -- <--- I DONT KNOW
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
)
I just don't understand how to solve the version_nr part.
Thanks!

Postgresql left join issue

I have two tables
create table sources (
user_id bigint,
source varchar,
created timestamp default now(),
unique(user_id, source)
)
create table subscriptions (
user_id bigint unique primary key,
plan int not null references plans(id),
starts timestamp default now(),
ends timestamp default now() + interval '30' day
)
I try to select everything from sources when user has active subscription, I use this query
SELECT src.* FROM sources AS src LEFT JOIN subscriptions as sub ON sub.user_id=src.user_id WHERE now() < sub.ends
However, it does not return all data, the problem became clear when I tried
SELECT * FROM sources AS src LEFT JOIN subscriptions as sub ON sub.user_id=src.user_id
What may be the problem and how itspossible for me to get this info in the other way?
PS subscriptions table has rows with other user_id's.
Thank you very much!
UPDATE
Your left join is working. If your image is of SELECT * FROM sources AS src LEFT JOIN subscriptions as sub ON sub.user_id=src.user_id then some of your users have no subscriptions.
SELECT * FROM sources AS src LEFT JOIN subscriptions ... only ensures every matching row in sources will be fetched regardless of whether it has a matching row in subscriptions.
[I wrote the rest when I thought it was a different problem, but the advice remains.]
This is possible because sources.user_id is not set up as a foreign key so the database cannot enforce referential integrity. As far as the database is concerned, sources.user_id is just some number. If you delete a user, their sources and subscriptions will hang around.
In general...
Clever primary keys just bring complications. Give everything more complicated than a simple join table id bigserial primary key and be done with it.
Declare everything not null unless you have a good reason it should be null.
Use on delete cascade to clean up when a user is deleted, but see below.
create table sources (
id bigserial primary key,
-- When a user is deleted, its sources will also be deleted
user_id bigint not null references users(id) on delete cascade,
source varchar not null,
created timestamp not null default now(),
unique(user_id, source)
)
create table subscriptions (
id bigserial primary key,
-- When a user is deleted, its subscriptions will also be deleted
user_id bigint not null unique references foreign key users(id) on delete cascade,
-- Make foreign key names consistent, and use a consistent type for ids.
plan_id bigint not null references plans(id) on delete cascade,
starts timestamp not null default now(),
ends timestamp not null default now() + interval '30' day
)
Whether to use on delete cascade when a deleted user still has sources and subscriptions depends on how you want to handle that situation. You might want to prevent deleting a user which still has active sources and subs, in which case leave off on delete cascade and handle cleaning up the user's sources and subs manually.

Local Auto-Increment Key

What is the best way to implement an auto-increment key that is "local" to some other column (i.e. comment id starts from 1 for each blog post)?
For instance, on GitHub, the issue number is local to the repository: issue #1 means that it's the first issue of your repo, and makes life easier for everyone by not having to use longer and seemingly random IDs.
For instance given:
CREATE TABLE post (
id bigserial PRIMARY KEY
, title varchar(255) NOT NULL
);
CREATE TABLE "comment" (
post_id bigint REFERENCES post NOT NULL
, id bigint NOT NULL
, "comment" text NOT NULL
, PRIMARY KEY (id, post_id)
);
One way to solve the problem is to calculate the max id of all comments for a given post_id:
INSERT INTO post (id, title) VALUES (1, 'first post');
INSERT INTO "comment" (post_id, id, "comment") VALUES (
1,
(SELECT COALESCE(MAX(id) + 1, 1) FROM "comment" WHERE post_id = 1 LIMIT 1),
'1st comment of 1st post'
);
^ This feels like a kludge, and I am also worried about possible serialisability issues too.
I wonder what is the best way to implement this (under PostgreSQL)? Thanks!
I would say that the simple method is to forget about it. That is, create the table like this:
create table comments (
comment_id bigserial primary key,
post_id bigint REFERENCES post NOT NULL,
comment text NOT NULL
);
And then calculate the value on the fly:
create view v_comments as
select c.*,
row_number() over (partition by post_id order by comment_id) as post_seqnum
from comments c;
Of course, this is not exactly the same thing. For instance, the post_seqnum does not uniquely identify each row over time -- because a delete might change the ordering.
However, this still has a unique identifier for each row that can be used for such purposes. Plus, there is a single primary key column, which is generally preferable for foreign key references and debugging.

Serial numbers per group of rows for compound key

I am trying to maintain an address history table:
CREATE TABLE address_history (
person_id int,
sequence int,
timestamp datetime default current_timestamp,
address text,
original_address text,
previous_address text,
PRIMARY KEY(person_id, sequence),
FOREIGN KEY(person_id) REFERENCES people.id
);
I'm wondering if there's an easy way to autonumber/constrain sequence in address_history to automatically count up from 1 for each person_id.
In other words, the first row with person_id = 1 would get sequence = 1; the second row with person_id = 1 would get sequence = 2. The first row with person_id = 2, would get sequence = 1 again. Etc.
Also, is there a better / built-in way to maintain a history like this?
Don't. It has been tried many times and it's a pain.
Use a plain serial or IDENTITY column:
Auto increment table column
CREATE TABLE address_history (
address_history_id serial PRIMARY KEY
, person_id int NOT NULL REFERENCES people(id)
, created_at timestamp NOT NULL DEFAULT current_timestamp
, previous_address text
);
Use the window function row_number() to get serial numbers without gaps per person_id. You could persist a VIEW that you can use as drop-in replacement for your table in queries to have those numbers ready:
CREATE VIEW address_history_nr AS
SELECT *, row_number() OVER (PARTITION BY person_id
ORDER BY address_history_id) AS adr_nr
FROM address_history;
See:
Gap-less sequence where multiple transactions with multiple tables are involved
Or you might want to ORDER BY something else. Maybe created_at? Better created_at, address_history_id to break possible ties. Related answer:
Column with alternate serials
Also, the data type you are looking for is timestamp or timestamptz, not datetime in Postgres:
Ignoring time zones altogether in Rails and PostgreSQL
And you only need to store previous_address (or more details), not address, nor original_address. Both would be redundant in a sane data model.

Scalable Database Tagging Schema

EDIT: To people building tagging systems. Don't read this. It is not what you are looking for. I asked this when I wasn't aware that RDBMS all have their own optimization methods, just use a simple many to many scheme.
I have a posting system that has millions of posts. Each post can have an infinite number of tags associated with it.
Users can create tags which have notes, date created, owner, etc. A tag is almost like a post itself, because people can post notes about the tag.
Each tag association has an owner and date, so we can see who added the tag and when.
My question is how can I implement this? It has to be fast searching posts by tag, or tags by post. Also, users can add tags to posts by typing the name into a field, kind of like the google search bar, it has to fill in the rest of the tag name for you.
I have 3 solutions at the moment, but not sure which is the best, or if there is a better way.
Note that I'm not showing the layout of notes since it will be trivial once I get a proper solution for tags.
Method 1. Linked list
tagId in post points to a linked list in tag_assoc, the application must traverse the list until flink=0
post: id, content, ownerId, date, tagId, notesId
tag_assoc: id, tagId, ownerId, flink
tag: id, name, notesId
Method 2. Denormalization
tags is simply a VARCHAR or TEXT field containing a tab delimited array of tagId:ownerId. It cannot be a fixed size.
post: id, content, ownerId, date, tags, notesId
tag: id, name, notesId
Method 3. Toxi
(from: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html,
also same thing here: Recommended SQL database design for tags or tagging)
post: id, content, ownerId, date, notesId
tag_assoc: ownerId, tagId, postId
tag: id, name, notesId
Method 3 raises the question, how fast will it be to iterate through every single row in tag_assoc?
Methods 1 and 2 should be fast for returning tags by post, but for posts by tag, another lookup table must be made.
The last thing I have to worry about is optimizing searching tags by name, I have not worked that out yet.
I made an ASCII diagram here: http://pastebin.com/f1c4e0e53
Here is how I'd do it:
posts: [postId], content, ownerId, date, noteId, noteType='post'
tag_assoc: [postId, tagName], ownerId, date, noteId, noteType='tagAssoc'
tags: [tagName], ownerId, date, noteId, noteType='tag'
notes: [noteId, noteType], ownerId, date, content
The fields in square brackets are the primary key of the respective table.
Define a constraint on noteType in each table: posts, tag_assoc, and tags. This prevents a given note from applying to both a post and a tag, for example.
Store tag names as a short string, not an integer id. That way you can use the covering index [postId, tagName] in the tag_assoc table.
Doing tag completion is done with an AJAX call. If the user types "datab" for a tag, your web page makes an AJAX call and on the server side, the app queries: SELECT tagName FROM tags WHERE tagName LIKE ?||'%'.
"A tag is almost like a post itself, because people can post notes about the tag." - this phrase makes me think you really just want one table for POST, with a primary key and a foreign key that references the POST table. Now you can have as many tags for each post as your disk space will allow.
I'm assuming there's no need for many to many between POST and tags, because a tag isn't shared across posts, based on this:
"Users can create tags which have notes, date created, owner, etc."
If creation date and owner are shared, those would be two additional foreign key relationships, IMO.
A linked list is almost certainly the wrong approach. It certainly means that your queries will be either complex or sub-optimal - which is ironic since the most likely reason for using a linked list is to keep the data in the correct sorted order. However, I don't see an easy way to avoid iteratively fetching a row, and then using the flink value retrieved to condition the select operation for the next row.
So, use a table-based approach with normal foreign key to primary key references. The one outlined by Bill Karwin looks similar to what I'd outline.
Bill I think I kind of threw you off, the notes are just in another table and there is a separate table with notes posted by different people. Posts have notes and tags, but tags also have notes, which is why tags are UNIQUE.
Jonathan is right about linked lists, I wont use them at all. I decided to implement the tags in the simplest normalized way that meats my needs:
DROP TABLE IF EXISTS `tags`;
CREATE TABLE IF NOT EXISTS `tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts`;
CREATE TABLE IF NOT EXISTS `posts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`content` TEXT NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts_notes`;
CREATE TABLE IF NOT EXISTS `posts_notes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
`note` TEXT NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
DROP TABLE IF EXISTS `posts_tags`;
CREATE TABLE IF NOT EXISTS `posts_tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`tagId` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE,
FOREIGN KEY (`tagId`) REFERENCES tags(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
I'm not sure how fast this will be in the future, but it should be fine for a while as only a couple people use the database.