Interesting tree/hierarchical data structure problem - sql

Colleges have different ways of organizing their departments. Some schools go School -> Term -> Department. Others have steps in between, with the longest being School -> Sub_Campus -> Program -> Term -> Division -> Department.
School, Term, and Department are the only ones that always exist in a school's "tree" of departments. The order of these categories never changes, with the second example I gave you being the longest. Every step down is a 1:N relationship.
Now, I'm not sure how to set up the relationships between the tables. For example, what columns are in Term? Its parent could be a Program, Sub_Campus, or School. Which one it is depends on the school's system. I could conceive of setting up the Term table to have foreign keys for all of those (which all would default to NULL), but I'm not sure this is the canonical way of doing things here.

I suggest you better use a general table, called e.g. Entity which would contain id field and a self-referencing parent field.
Each relevant table would contain a field pointing to Entity's id (1:1). In a way each table would be a child of the Entity table.

Here's one design possibility:
This option takes advantage of your special constraints. Basically you generalize all hierarchies as that of the longest form by introducing generic nodes. If school doesn't have "sub campus" then just assign it a generic sub campus called "Main". For example, School -> Term -> Department can be thought of same as School -> Sub_Campus = Main -> Program=Main -> Term -> Division=Main -> Department. In this case, we assign a node called "Main" as default when school doesn't have that nodes. Now you can just have a boolean flag property for these generic nodes that indicates that they are just placeholders and this flag would allow you to filter it out in middle layer or in UX if needed.
This design will allow you to take advantage of all relational constraints as usual and simplify handling of missing node types in your code.

-- Enforcing a taxonomy by self-referential (recursive) tables.
-- Both the classes and the instances have a recursive structure.
-- The taxonomy is enforced mostly based on constraints on the classes,
-- the instances only need to check that {their_class , parents_class}
-- form a valid pair.
--
DROP schema school CASCADE;
CREATE schema school;
CREATE TABLE school.category
( id INTEGER NOT NULL PRIMARY KEY
, category_name VARCHAR
);
INSERT INTO school.category(id, category_name) VALUES
( 1, 'School' )
, ( 2, 'Sub_campus' )
, ( 3, 'Program' )
, ( 4, 'Term' )
, ( 5, 'Division' )
, ( 6, 'Department' )
;
-- This table contains a list of all allowable {child->parent} pairs.
-- As a convention, the "roots" of the trees point to themselves.
-- (this also avoids a NULL FK)
CREATE TABLE school.category_valid_parent
( category_id INTEGER NOT NULL REFERENCES school.category (id)
, parent_category_id INTEGER NOT NULL REFERENCES school.category (id)
);
ALTER TABLE school.category_valid_parent
ADD PRIMARY KEY (category_id, parent_category_id)
;
INSERT INTO school.category_valid_parent(category_id, parent_category_id)
VALUES
( 1,1) -- school -> school
, (2,1) -- subcampus -> school
, (3,1) -- program -> school
, (3,2) -- program -> subcampus
, (4,1) -- term -> school
, (4,2) -- term -> subcampus
, (4,3) -- term -> program
, (5,4) -- division --> term
, (6,4) -- department --> term
, (6,5) -- department --> division
;
CREATE TABLE school.instance
( id INTEGER NOT NULL PRIMARY KEY
, category_id INTEGER NOT NULL REFERENCES school.category (id)
, parent_id INTEGER NOT NULL REFERENCES school.instance (id)
-- NOTE: parent_category_id is logically redundant
-- , but needed to maintain the constraint
-- (without referencing a third table)
, parent_category_id INTEGER NOT NULL REFERENCES school.category (id)
, instance_name VARCHAR
); -- Forbid illegal combinations of {parent_id, parent_category_id}
ALTER TABLE school.instance ADD CONSTRAINT valid_cat UNIQUE (id,category_id);
ALTER TABLE school.instance
ADD FOREIGN KEY (parent_id, parent_category_id)
REFERENCES school.instance(id, category_id);
;
-- Forbid illegal combinations of {category_id, parent_category_id}
ALTER TABLE school.instance
ADD FOREIGN KEY (category_id, parent_category_id)
REFERENCES school.category_valid_parent(category_id, parent_category_id);
;
INSERT INTO school.instance(id, category_id
, parent_id, parent_category_id
, instance_name) VALUES
-- Zulo
(1,1,1,1, 'University of Utrecht' )
, (2,2,1,1, 'Uithof' )
, (3,3,2,2, 'Life sciences' )
, (4,4,3,3, 'Bacherlor' )
, (5,5,4,4, 'Biology' )
, (6,6,5,5, 'Evolutionary Biology' )
, (7,6,5,5, 'Botany' )
-- Nulo
, (11,1,11,1, 'Hogeschool Utrecht' )
, (12,4,11,1, 'Journalistiek' )
, (13,6,12,4, 'Begrijpend Lezen' )
, (14,6,12,4, 'Typvaardigheid' )
;
-- try to insert an invalid instance
INSERT INTO school.instance(id, category_id
, parent_id, parent_category_id
, instance_name) VALUES
( 15, 6, 3,3, 'Procreation' );
WITH RECURSIVE re AS (
SELECT i0.parent_id AS pa_id
, i0.parent_category_id AS pa_cat
, i0.id AS my_id
, i0.category_id AS my_cat
FROM school.instance i0
WHERE i0.parent_id = i0.id
UNION
SELECT i1.parent_id AS pa_id
, i1.parent_category_id AS pa_cat
, i1.id AS my_id
, i1.category_id AS my_cat
FROM school.instance i1
, re
WHERE re.my_id = i1.parent_id
)
SELECT re.*
, ca.category_name
, ins.instance_name
FROM re
JOIN school.category ca ON (re.my_cat = ca.id)
JOIN school.instance ins ON (re.my_id = ins.id)
-- WHERE re.my_id = 14
;
The output:
INSERT 0 11
ERROR: insert or update on table "instance" violates foreign key constraint "instance_category_id_fkey1"
DETAIL: Key (category_id, parent_category_id)=(6, 3) is not present in table "category_valid_parent".
pa_id | pa_cat | my_id | my_cat | category_name | instance_name
-------+--------+-------+--------+---------------+-----------------------
1 | 1 | 1 | 1 | School | University of Utrecht
11 | 1 | 11 | 1 | School | Hogeschool Utrecht
1 | 1 | 2 | 2 | Sub_campus | Uithof
11 | 1 | 12 | 4 | Term | Journalistiek
2 | 2 | 3 | 3 | Program | Life sciences
12 | 4 | 13 | 6 | Department | Begrijpend Lezen
12 | 4 | 14 | 6 | Department | Typvaardigheid
3 | 3 | 4 | 4 | Term | Bacherlor
4 | 4 | 5 | 5 | Division | Biology
5 | 5 | 6 | 6 | Department | Evolutionary Biology
5 | 5 | 7 | 6 | Department | Botany
(11 rows)
BTW: I left out the attributes. I propose they could be hooked to the relevant categories by means of a EAV type of data model.

I'm going to start by discussing implementing a single hierarchical model (just 1:N relationships) relationally.
Let's use your example School -> Term -> Department.
Here's code that I generated using MySQLWorkbench (I removed a few things to make it clearer):
-- -----------------------------------------------------
-- Table `mydb`.`school`
-- -----------------------------------------------------
-- each of these tables would have more attributes in a real implementation
-- using varchar(50)'s for PKs because I can -- :)
CREATE TABLE IF NOT EXISTS `mydb`.`school` (
`school_name` VARCHAR(50) NOT NULL ,
PRIMARY KEY (`school_name`)
);
-- -----------------------------------------------------
-- Table `mydb`.`term`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`term` (
`term_name` VARCHAR(50) NOT NULL ,
`school_name` VARCHAR(50) NOT NULL ,
PRIMARY KEY (`term_name`, `school_name`) ,
FOREIGN KEY (`school_name` )
REFERENCES `mydb`.`school` (`school_name` )
);
-- -----------------------------------------------------
-- Table `mydb`.`department`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`department` (
`dept_name` VARCHAR(50) NOT NULL ,
`term_name` VARCHAR(50) NOT NULL ,
`school_name` VARCHAR(50) NOT NULL ,
PRIMARY KEY (`dept_name`, `term_name`, `school_name`) ,
FOREIGN KEY (`term_name` , `school_name` )
REFERENCES `mydb`.`term` (`term_name` , `school_name` )
);
Here is the MySQLWorkbench version of the data model:
As you can see, school, at the top of the hierarchy, has only school_name as its key, whereas department has a three-part key including the keys of all of its parents.
Key points of this solution
uses natural keys -- but could be refactored to use surrogate keys (SO question -- along with UNIQUE constraints on multi-column foreign keys)
every level of nesting adds one column to the key
each table's PK is the entire PK of the table above it, plus an additional column specific to that table
Now for the second part of your question.
My interpretation of the question
There is a hierarchical data model. However, some applications require all of the tables, whereas others utilize only some of the tables, skipping the others. We want to be able to implement 1 single data model and use it for both of these cases.
You could use the solution given above, and, as ShitalShah mentioned, add a default value to any table which would not be used. Let's see some example data, using the model given above, where we only want to save School and Department information (no Terms):
+-------------+
| school_name |
+-------------+
| hogwarts |
| uCollege |
| uMatt |
+-------------+
3 rows in set (0.00 sec)
+-----------+-------------+
| term_name | school_name |
+-----------+-------------+
| default | hogwarts |
| default | uCollege |
| default | uMatt |
+-----------+-------------+
3 rows in set (0.00 sec)
+-------------------------------+-----------+-------------+
| dept_name | term_name | school_name |
+-------------------------------+-----------+-------------+
| defense against the dark arts | default | hogwarts |
| potions | default | hogwarts |
| basket-weaving | default | uCollege |
| history of magic | default | uMatt |
| science | default | uMatt |
+-------------------------------+-----------+-------------+
5 rows in set (0.00 sec)
Key points
there is a default value in term for every value in school -- this could be quite annoying if you had a table deep in the hierarchy that an application didn't need
since the table schema doesn't change, the same queries can be used
queries are easy to write and portable
SO seems to think default should be colored differently
There is another solution to storing trees in databases. Bill Karwin discusses it here, starting around slide 49, but I don't think this is the solution you want. Karwin's solution is for trees of any size, whereas your examples seem to be relatively static. Also, his solutions come with their own set of problems (but doesn't everything?).
I hope that helps with your question.

For the general problem of fitting hierarchical data in a relational database, the common solutions are adjacency lists (parent-child links like your example) and nested sets. As noted in the wikipedia article, Oracle's Tropashko propsed an alternative nested interval solution but it's still fairly obscure.
The best choice for your situation depends on how you will be querying the structure, and which DB you are using. Cherry picking the article:
Queries using nested sets can be expected to be faster than queries
using a stored procedure to traverse an adjacency list, and so are the
faster option for databases which lack native recursive query
constructs, such as MySQL
However:
Nested set are very slow for inserts because it requires updating lft
and rgt for all records in the table after the insert. This can cause
a lot of database thrash as many rows are rewritten and indexes
rebuilt.
Again, depending on how your structure will be queried, you may choose a NoSQL style denormalized Department table, with nullable foreign keys to all possible parents, avoiding recursive queries altogether.

I would develop this in a very flexible manner and what seems to mean to be the simplest as well:
There should only be one table, lets call it the category_nodes:
-- possible content, of this could be stored in another table and create a
-- 1:N -> category:content relationship
drop table if exists category_nodes;
create table category_nodes (
category_node_id int(11) default null auto_increment,
parent_id int(11) not null default 1,
name varchar(256),
primary key(category_node_id)
);
-- set the first 2 records:
insert into category_nodes (parent_id, name) values( -1, 'root' );
insert into category_nodes (parent_id, name) values( -1, 'uncategorized' );
So each record in the table has a unique id, a parent id, and a name.
Now after the first 2 inserts: in category_nodes where the category_node_id is 0 is the root node (the parent of all nodes no matter how many degres away. The second is just for a little helper, set an uncategorized node at the category_node_id = 1 which is also the defalt value of parent_id when inserting into the table.
Now imagining the root categories are School, Term, and Dept you would:
insert into category_nodes ( parent_id, name ) values ( 0, 'School' );
insert into category_nodes ( parent_id, name ) values ( 0, 'Term' );
insert into category_nodes ( parent_id, name ) values ( 0, 'Dept' );
Then to get all the root categories:
select * from category_nodes where parent_id = 0;
Now imagining a more complex schema:
-- School -> Division -> Department
-- CatX -> CatY
insert into category_nodes ( parent_id, name ) values ( 0, 'School' ); -- imaging gets pkey = 2
insert into category_nodes ( parent_id, name ) values ( 2, 'Division' ); -- imaging gets pkey = 3
insert into category_nodes ( parent_id, name ) values ( 3, 'Dept' );
--
insert into category_nodes ( parent_id, name ) values ( 0, 'CatX' ); -- 5
insert into category_nodes ( parent_id, name ) values ( 5, 'CatY' );
Now to get all the subcategories of School for example:
select * from category_nodes where parent_id = 2;
-- or even
select * from category_nodes where parent_id in ( select category_node_id from category_nodes
where name = 'School'
);
And so on. Thanks to a default = 1 with the parent_id, inserting into the 'uncategorized' category become simple:
<?php
$name = 'New cat name';
mysql_query( "insert into category_nodes ( name ) values ( '$name' )" );
Cheers

Related

Auto incremented column scoped to user id

I am really struggling how to implement requirement which is going to be best described with example.
Consider everything below to be written in pseudocode although I am interested in solutions for postgres.
id
id_for_user
note
created_by
1
1
Buy milk
1
1
2
Winter tyres
1
1
3
Read for 1h
1
2
1
Clean dishes
2
2
2
Learn how magnets work
2
INSERT INTO notes VALUES (note: 'Learn icelandic', created_by: 1);
id
id_for_user
note
created_by
1
1
Buy milk
1
2
2
Winter tyres
1
3
3
Read for 1h
1
4
1
Clean dishes
2
5
2
Learn how magnets work
2
6
4
Learn Icelandic
1
INSERT INTO notes VALUES (note: 'Are birds real?', created_by: 2);
id
id_for_user
note
created_by
1
1
Buy milk
1
2
2
Winter tyres
1
3
3
Read for 1h
1
4
1
Clean dishes
2
5
2
Learn how magnets work
2
6
4
Learn Icelandic
1
7
3
Are birds real?
2
I would like to achieve something like this:
CREATE TABLE notes (
id SERIAL,
id_for_user INT DEFAULT nextval(created_by) -- Dynamic name for sequence so every user gets its own,
note VARCHAR,
created_by INT,
PRIMARY KEY(id, id_for_user),
CONSTRAINT fk_notes_created_by
FOREIGN KEY(created_by)
REFERENCES users(created_by)
);
So that user 1 sees (notice how id_for_user is just id on front end)
id
note
1
Buy milk
2
Winter tyres
3
Read for 1h
4
Learn Icelandic
And user 2
id
note
1
Clean dishes
2
Learn how magnets work
3
Are birds real?
Basically I want to have auto incremented field for each user.
I am then also probably going to query for the record by id_for_user filling create_by on backend based on which user made the request.
Is something like this even possible? What are my options? I would really like to have this logic on db level.
https://www.db-fiddle.com/f/6eBvq4VCQPTmmR3W6fCnEm/2
Try with a sequence, this object will have control of the autonumeric of the ID
example:
CREATE SEQUENCE sequence_notes1
INCREMENT BY 1
MINVALUE 1
MAXVALUE 100;
CREATE SEQUENCE sequence_notes2
INCREMENT BY 1
MINVALUE 1
MAXVALUE 100;
CREATE TABLE notes (
id SERIAL,
id_for_user INT,
note VARCHAR,
created_by INT,
PRIMARY KEY(id)
);
INSERT INTO notes (id_for_user, note, created_by) VALUES (nextval('sequence_notes1'),'Foo', 1);
INSERT INTO notes (id_for_user, note, created_by) VALUES (nextval('sequence_notes1'),'Moo', 1);
INSERT INTO notes (id_for_user, note, created_by) VALUES (nextval('sequence_notes2'),'Boo', 2);
INSERT INTO notes (id_for_user, note, created_by) VALUES (nextval('sequence_notes2'),'Loo', 2);
You can have a separate table to store "the next ordinal value for each user". Then a trigger can fill the value and increment the related table.
For example:
create table usr (
id int primary key,
next_ordinal int default 1
);
create table note (
id int primary key,
note varchar(100),
created_by int references usr (id),
user_ord int
);
create or replace function add_user_ord() returns trigger as $$
begin
select next_ordinal into new.user_ord from usr where id = new.created_by;
update usr set next_ordinal = next_ordinal + 1 where id = new.created_by;
return new;
end;
$$ language plpgsql;
//
create trigger trg_note1 before insert on note
for each row execute procedure add_user_ord();
Then, the trigger will add the correct ordinal numbers automatically behind the scenes during INSERTs:
insert into usr (id) values (10), (20);
insert into note (id, note, created_by) values (1, 'Hello', 10);
insert into note (id, note, created_by) values (2, 'Lorem', 20);
insert into note (id, note, created_by) values (3, 'World', 10);
insert into note (id, note, created_by) values (4, 'Ipsum', 20);
Result:
id note created_by user_ord
-- ----- ---------- --------
1 Hello 10 1
2 Lorem 20 1
3 World 10 2
4 Ipsum 20 2
Note: This solution does not address multi-threading inserts. If your application needs this you'll need to add some isolation (or pessimistic, or optimistic locking) for it.
In MySQL, this is supported in the MyISAM storage engine.
https://dev.mysql.com/doc/refman/8.0/en/example-auto-increment.html says:
For MyISAM tables, you can specify AUTO_INCREMENT on a secondary
column in a multiple-column index. In this case, the generated value
for the AUTO_INCREMENT column is calculated as
MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is
useful when you want to put data into ordered groups.
CREATE TABLE animals (
grp ENUM('fish','mammal','bird') NOT NULL,
id MEDIUMINT NOT NULL AUTO_INCREMENT,
name CHAR(30) NOT NULL,
PRIMARY KEY (grp,id)
) ENGINE=MyISAM;
INSERT INTO animals (grp,name) VALUES
('mammal','dog'),('mammal','cat'),
('bird','penguin'),('fish','lax'),('mammal','whale'),
('bird','ostrich');
SELECT * FROM animals ORDER BY grp,id; Which returns:
+--------+----+---------+
| grp | id | name |
+--------+----+---------+
| fish | 1 | lax |
| mammal | 1 | dog |
| mammal | 2 | cat |
| mammal | 3 | whale |
| bird | 1 | penguin |
| bird | 2 | ostrich |
+--------+----+---------+
The reason this works in MyISAM is that MyISAM only supports table-level locking.
In a storage engine with row-level locking, you get race conditions if you try to have a primary key that works like this. This is why others on this thread have commented that implementing this with triggers requires some pessimistic locking. You have to use locking to ensure that only one client at a time is inserting, so they don't allocate the same value.
This will be limiting in a high-traffic application. InnoDB's auto-increment is implemented the way it is to allow applications in which many client threads are executing inserts concurrently.
So you could use MyISAM or you could use InnoDB and invent your own way of allocating new id's per user, but either way it will severely limit your app's scalability.

Normalize a table where there is a need to reference subsets of a column in another table and those subsets must be unique

How do I normalise this relation (i.e. make it conform to 1NF, 2NF, and 3NF)
CREATE TABLE IF NOT EXISTS series (
series_id SERIAL PRIMARY KEY,
dimension_ids INT[] UNIQUE,
dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS dimension (
dimension_id SERIAL PRIMARY KEY,
dim VARCHAR(50),
val VARCHAR(50),
dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE,
UNIQUE (dim, val, dataset_id)
);
Where subsets of dimension_id's uniquely identify records in the series table.
EDIT
To provide more information, the data I want to store comes from XML structures looking something like the following
<?xml version="1.0" encoding="utf-8"?>
<message:StructureSpecificData >
<message:Header>
<message:ID>IREF757740</message:ID>
<message:Test>false</message:Test>
<message:Prepared>2020-04-09T14:55:23</message:Prepared>
</message:Header>
<message:DataSet ss:dataScope="DataStructure" ss:structureRef="CPI" xsi:type="ns1:DataSetType">
<Series FREQ="M" GEOG_AREA="WC" UNIT="IDX">
<Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
</Series>
<Series FREQ="M" GEOG_AREA="NC" UNIT="IDX">
<Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
</Series>
</message:DataSet>
</message:StructureSpecificData>
There is a dataset that contains series (0...n) that contain observations (0...n). The series are uniquely identified by their XML attributes - what I call dimensions in my data model. In my example I have two series, differentiated by the geographical areas they cover. Any series can have an arbitrary number of dimensions. series are expected to be queried from its dimensions and the dimensions will also be queried using the series_id. The obvious solution is a bridging table:
CREATE TABLE series_dimension
series_id INT REFERENCES series(series_id) ON DELETE CASCADE,
dimension_id INT REFERENCES dimension(dimension_id)
);
This solution permits, however, the following scenario:
|--------------------------|
| series_dimension |
|-----------|--------------|
| series_id | dimension_id |
|-----------|--------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
|-----------|--------------|
That is, two different series with the same dimensions, so that if I query a series for a given set of dimensions I can't decide in the case of dimensions [1 2 3 4] whether I am looking for series_id=1 or series_id=2 which is unacceptable. Is it therefore the case that in such a situation, I must decide between having referential integrity and the uniqueness property I have just explained?
Given your expectation of around 20 dimensions, the example is limited to 60. It does require a controlled process to define each set of dimensions (series).
Reasoning
-- DIM is a valid numeric identifier for a dimension.
--
valid_dim {DIM}
PK {DIM}
CHECK ((DIM = 1) OR ((DIM > 1) AND (mod(DIM,2) = 0)))
-- data sample
(DIM)
---------
(2^0)
, (2^1)
, (2^2)
, ...
, (2^58)
, (2^59)
-- Dimension DIM, named DIM_NAME exists.
--
dimension {DIM, DIM_NAME}
PK {DIM}
AK {DIM_NAME}
FK {DIM} REFERENCES valid_dim {DIM}
-- data sample
(DIM, DIM_NAME)
---------------
(2^0, 'FREQ')
, (2^1, 'GEOG_AREA')
, (2^2, 'UNIT')
, ...
, (2^58, 'AGE_GROUP')
, (2^59, 'HAIR_COLOR')
Loading series and ser_dim can be done from a function, application, or whatever. However, this should be a controlled process.
SER is unique for a given set of dimensions.
Note that | is bitwise OR operator.
-- Series SER, named SER_NAME exists.
--
series {SER, SER_NAME}
PK {SER}
AK {SER_NAME}
-- data sample
(SER, SER_NAME)
--------------------------------
((2^0 | 2^1 | 2^2) , 'F-G-U')
, ((2^1 | 2^58) , 'G-A' )
, ((2^0 | 2^58 | 2^59), 'F-A-H')
-- Series SER has dimension DIM.
--
ser_dim {SER, DIM}
PK {SER, DIM}
FK1 {SER} REFERENCES series {SER}
FK2 {DIM} REFERENCES dimension {DIM}
CHECK ((DIM & SER) = DIM)
-- data sample
(SER, DIM)
--------------------------------
((2^0 | 2^1 | 2^2) , 2^0)
, ((2^0 | 2^1 | 2^2) , 2^1)
, ((2^0 | 2^1 | 2^2) , 2^2)
, ((2^1 | 2^58) , 2^1 )
, ((2^1 | 2^58) , 2^58)
, ((2^0 | 2^58 | 2^59), 2^0)
, ((2^0 | 2^58 | 2^59), 2^58)
, ((2^0 | 2^58 | 2^59), 2^59)
Note:
All attributes (columns) NOT NULL
PK = Primary Key
AK = Alternate Key (Unique)
FK = Foreign Key
PostgreSQL
-- DIM is a valid numeric identifier
-- for a dimension.
--
CREATE TABLE valid_dim (
DIM bigint NOT NULL
, CONSTRAINT pk_valid_dim PRIMARY KEY (DIM)
, CONSTRAINT chk_valid_dim
CHECK ( (DIM = 1)
OR ( (DIM > 1)
AND (mod(DIM, 2) = 0) )
)
);
-- define some of valid DIMs
INSERT INTO valid_dim (DIM)
VALUES
((2^ 0)::bigint)
, ((2^ 1)::bigint)
, ((2^ 2)::bigint)
-- fill this gap
, ((2^58)::bigint)
, ((2^59)::bigint) ;
-- Dimension DIM, named DIM_NAME exists.
--
CREATE TABLE dimension (
DIM bigint NOT NULL
, DIM_NAME text NOT NULL
, CONSTRAINT pk_dim PRIMARY KEY (DIM)
, CONSTRAINT ak_dim UNIQUE (DIM_NAME)
, CONSTRAINT
fk_dim FOREIGN KEY (DIM)
REFERENCES valid_dim (DIM)
);
-- define few dimensions
INSERT INTO dimension (DIM, DIM_NAME)
VALUES
((2^ 0)::bigint, 'FREQ')
, ((2^ 1)::bigint, 'GEOG_AREA')
, ((2^ 2)::bigint, 'UNIT')
, ((2^58)::bigint, 'AGE_GROUP')
, ((2^59)::bigint, 'HAIR_COLOR') ;
-- Series SER, named SER_NAME exists.
--
CREATE TABLE series (
SER bigint NOT NULL
, SER_NAME text NOT NULL
, CONSTRAINT pk_series PRIMARY KEY (SER)
, CONSTRAINT ak_series UNIQUE (SER_NAME)
);
-- define three series
INSERT INTO series (SER, SER_NAME)
SELECT bit_or(DIM) as SER, 'F-G-U' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
UNION
SELECT bit_or(DIM) as SER, 'G-A' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('GEOG_AREA', 'AGE_GROUP')
UNION
SELECT bit_or(DIM) as SER, 'F-A-H' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'AGE_GROUP', 'HAIR_COLOR') ;
-- Series SER has dimension DIM.
--
CREATE TABLE ser_dim (
SER bigint NOT NULL
, DIM bigint NOT NULL
, CONSTRAINT pk_ser_dim PRIMARY KEY (SER, DIM)
, CONSTRAINT
fk1_ser_dim FOREIGN KEY (SER)
REFERENCES series (SER)
, CONSTRAINT
fk2_ser_dim FOREIGN KEY (DIM)
REFERENCES dimension (DIM)
, CONSTRAINT
chk_ser_dim CHECK ((DIM & SER) = DIM)
);
-- populate ser_dim
INSERT INTO ser_dim (SER, DIM)
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;
Another option would be to use a (materialized) view for ser_dim. That depends on the rest of the model: if a FK is needed to {SER, DIM} keep the table, otherwise a view would be better.
-- An option, instead of the table.
--
CREATE VIEW ser_dim
AS
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;
Test
-- Show already defined series
-- and their dimensions.
SELECT SER_NAME, DIM_NAME
FROM ser_dim
JOIN series USING (SER)
JOIN dimension USING (DIM)
ORDER BY SER_NAME, DIM_NAME ;
-- Get SER for a set of dimensions;
-- use this when defining a series.
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT') ;
-- Find already defined series,
-- given a set of dimensions.
SELECT x.SER
FROM (
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
) AS x
WHERE EXISTS
(SELECT 1 FROM series AS s WHERE s.SER = x.SER) ;
Summary
Unfortunately standard SQL implementations do not support assertions, database-wide constraints. SQL standard actually defines them, but no luck yet. Hence, not every business constraint can be done in SQL elegantly, usually some creativity and compromise is required.
My conclusion that this relationship (where a column refers to attributes whose number is not known in advance) requires that normalisation lead to a many-to-many or one-to-many relationsip being created, and this precludes a unique mapping.
Conversely, for a relationship where a column refers to attributes whose number is not known in advance, the way to make the relationship one-to-one/unique is to group those attributes into unique subsets which violates 1NF.
There is only one way to specify UNIQUE constraints, and that is on column(s)
My example requires that each series_id references a variable number of columns/dimensions
I therefore stack my columns in rows with the result that the UNIQUE construct is not available
The solution has each series_id relate to an array that specifies the row subsets, I can now specify that this column of arrays is UNIQUE
This violates 1NF, therefore this relationship cannot be normalised

Insert multiple values with foreign key Postgresql

I am having trouble figuring out how to insert multiple values to a table, which checks if another table has the needed values stored. I am currently doing this in a PostgreSQL server, but will be implementing it in PreparedStatements for my java program.
user_id is a foreign key which references the primary in mock2. I have been trying to check if mock2 has values ('foo1', 'bar1') and ('foo2', 'bar2').
After this I am trying to insert new values into mock1 which would have a date and integer value and reference the primary key of the row in mock2 to the foreign key in mock1.
mock1 table looks like this:
===============================
| date | time | user_id |
| date | integer | integer |
| | | |
And the table mock2 is:
==================================
| Id | name | program |
| integer | text | test |
Id is a primary key for the table and the name is UNIQUE.
I've been playing around with this solution https://dba.stackexchange.com/questions/46410/how-do-i-insert-a-row-which-contains-a-foreign-key
However, I haven't been able to make it work. Could someone please point out what the correct syntax is for this, I would be really appreciative.
EDIT:
The create table statements are:
CREATE TABLE mock2(
id SERIAL PRIMARY KEY UNIQUE,
name text NOT NULL,
program text NOT NULL UNIQUE
);
and
CREATE TABLE mock1(
date date,
time_spent INTEGER,
user_id integer REFERENCES mock2(Id) NOT NULL);
Ok so I found an answer to my own question.
WITH ins (date,time_spent, id) AS
( VALUES
( '22/08/2012', 170, (SELECT id FROM mock3 WHERE program ='bar'))
)
INSERT INTO mock4
(date, time_spent, user_id)
SELECT
ins.date, ins.time_spent, mock3.id
FROM
mock3 JOIN ins
ON ins.id = mock3.id ;
I was trying to take the 2 values from the first table, match these and then insert 2 new values to the next table, but I realised that I should be using the Primary and Foreign keys to my advantage.
I instead now JOIN on the ID and then just select the key I need by searching it from the values with (SELECT id FROM mock3 WHERE program ='bar') in the third row.

Foreign key from a NON audit trail table to an audit trail table

In an existing schema I am introducing audit trail/milestone support only for one of the tables. For discussing the question we'll use following simplified toy example for better illustration of a question.
DEPARTMENT
ID | NAME | CREATED_TS | MODIFIED_TS | VERSION
1 | "ABC" | 11/20/2015 | 12/1/9999 | 1
Now when every time update request for department record for id=1 comes in, system does following. Basically each update is an INSERT - which changes existing record's MODIFIED_TS to the time UPDATE request came in and latest record's MODIFIED_TS always remains as 12/1/9999 (a.k.a INFINITY TIMESTAMP)
Say updating ID=1 record where name was changed , following is what it looks like in db.
DEPARTMENT
ID | NAME | CREATED_TS | MODIFIED_TS | VERSION
1 | "ABC" | 11/20/2015 | 11/22/2015 | 1
1 | "XYZ" | 11/22/2015 | 12/1/9999 | 2
Now assume there is an existing EMPLOYEE table with DEPT_ID as a foreign key. Note that EMPLOYEE table doesn't have an audit trail requirement so there is no INFINITY timestamp or VERSIONing concept there. EMPLOYEE table looks like as below
EMPLOYEE
ID | NAME | AGE | DEPT_ID
1 | "John" | 31 | 1
Everything was fine in terms of EMPLOYEE having a FK relationship with DEPT where DEPT had ID as a PK before. Now DEPARTMENT table's PK is changed to a composite PK (ID,VERSION)
After these schema changes for DEPARTMENT table to create auit trail for its data, FK in EMPLOYEE will have to somehow create a FK in such a way where not only DEPT_ID but also include INFINITY timestamp (MODIFIED_TS ) of DEPT record into it because EMPLOYEE will always be referring the latest DEPT record (requirement)
What's the best way to change EMPLOYEE table's FK for pointing to the most current record in DEPT table?
This is a terrible data-modelling hack to solve your problem; don't try this at home...
Since the PK for department is {id, modified_ts}, the FK from employee must also be composite (or it could point to another unique set of columns).
So the solution is to give it what it wants: a column with a constant value of 'infinity'. ('infinity' is a valid value for dates and timestamps in postgres, you don't need to invent your own sentinel values)
CREATE TABLE department
( id INTEGER NOT NULL
, name varchar
, created_ts timestamp
, modified_ts timestamp
, version integer not null default 0
, PRIMARY KEY (id, modified_ts)
);
INSERT INTO department (id, name, created_ts, modified_ts, version) VALUES
(1 , 'ABC', '2015-11-20' , '2015-11-22' , 1) ,
(1 , 'XYZ', '2015-11-22' , 'infinity' , 2) ;
-- CREATE UNIQUE INDEX ON department (id) WHERE modified_ts = 'infinity'::timestamp;
-- Now assume there is an existing EMPLOYEE table with DEPT_ID as a foreign key. Note that EMPLOYEE table doesn't have an audit trail requirement so there is no INFINITY timestamp or VERSIONing concept there. EMPLOYEE table looks like as below
CREATE TABLE employee
( id INTEGER NOT NULL PRIMARY KEY
, name varchar
, age integer not null default 0
, modified_ts timestamp NOT NULL DEFAULT 'infinity'::timestamp CHECK (modified_ts = 'infinity'::timestamp)
, dept_id INTEGER NOT NULL
, FOREIGN KEY (id,modified_ts)
REFERENCES department(id, modified_ts)
);
INSERT INTO employee(id ,name,age,dept_id) VALUES
(1 , 'John' , 31 , 1);
a FK really SHOULD NOT point to a moving target
for the above to work, the FK constraint will need DEFERRABLE INITIALLY DEFERRED
but in practice you should use stable keys, (and move the trail to a separate table)
and in real life, both employee and department should allow history, and so should their (M:N) junction table.

Redshift psql auto increment on even number

I am trying to create a table with an auto-increment column as below. Since Redshift psql doesn't support SERIAL, I had to use IDENTITY data type:
IDENTITY(seed, step)
Clause that specifies that the column is an IDENTITY column. An IDENTITY column contains unique auto-generated values. These values start with the value specified as seed and increment by the number specified as step. The data type for an IDENTITY column must be either INT or BIGINT.`
My create table statement looks like this:
CREATE TABLE my_table(
id INT IDENTITY(1,1),
name CHARACTER VARYING(255) NOT NULL,
PRIMARY KEY( id )
);
However, when I tried to insert data into my_table, rows increment only on the even number, like below:
id | name |
----+------+
2 | anna |
4 | tom |
6 | adam |
8 | bob |
10 | rob |
My insert statements look like below:
INSERT INTO my_table ( name )
VALUES ( 'anna' ), ('tom') , ('adam') , ('bob') , ('rob' );
I am also having trouble with bringing the id column back to start with 1. There are solutions for SERIAL data type, but I haven't seen any documentation for IDENTITY.
Any suggestions would be much appreciated!
You have to set your identity as follows:
id INT IDENTITY(0,1)
Source: http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_examples.html
And you can't reset the id to 0. You will have to drop the table and create it back again.
Set your seed value to 1 and your step value to 1.
Create table
CREATE table my_table(
id bigint identity(1, 1),
name varchar(100),
primary key(id));
Insert rows
INSERT INTO organization ( name )
VALUES ('anna'), ('tom') , ('adam'), ('bob'), ('rob');
Results
id | name |
----+------+
1 | anna |
2 | tom |
3 | adam |
4 | bob |
5 | rob |
For some reason, if you set your seed value to 0 and your step value to 1 then the integer will increase in steps of 2.
Create table
CREATE table my_table(
id bigint identity(0, 1),
name varchar(100),
primary key(id));
Insert rows
INSERT INTO organization ( name )
VALUES ('anna'), ('tom') , ('adam'), ('bob'), ('rob');
Results
id | name |
----+------+
0 | anna |
2 | tom |
4 | adam |
6 | bob |
8 | rob |
This issue is discussed at length in AWS forum.
https://forums.aws.amazon.com/message.jspa?messageID=623201
The answer from the AWS.
Short answer to your question is seed and step are only honored if you
disable both parallelism and the COMPUPDATE option in your COPY.
Parallelism is disabled if and only if you're loading your data from a
single file, which is what we normally do not recommend, and hence
will be an unlikely scenario for most users.
Parallelism impacts things because in order to ensure that there is no
single point of contention in assigning identity values to rows, there
end up being gaps in the value assignment. When parallelism is
disabled, the load is happening serially, and therefore, there is no
issue with assigning different id values in parallel.
The reason COMPUPDATE impacts things is when it's enabled, the COPY is
actually making 2 passes over your data. During the first pass, it
internally increments the identity values, and as a result, your
initial value starts with a larger value than you'd expect.
We'll update the doc to reflect this.
Also multiple nodes seems to cause such effect with IDENTITY column. In essence it can only provide you with guaranteed unique IDs.