How to efficiently insert tree-like data structure into postgres - sql

Essentially, I want to efficiently store a tree-like data structure in a table with Postgres. Each row has an ID (auto-generated upon insert), a parent ID (referencing another row in the same table, possibly null), and some additional metadata. All of that data comes in at once, so I'm trying to store it all at once as efficiently as possible.
My current thought is to group all the data by which level of the tree they're at, and batch insert one level at a time. That way I can set parent IDs using the IDs generated from the previous level's inserts. This way the amount of batches is correlated with the number of levels in the tree.
This is probably "good enough", but I'm wondering if there's a better way to do this kind of thing? It still seems like a lot of back and forth and unnecessary logic to me, when I have the whole tree of data already in memory and structured correctly.

Let me show how I would do it if I had some information on who is whose child record.
In my case, I use a staging table containing the info as it comes from the source. The records have a char based primary key id, and a self-referencing,nullable, foreign key boss_id .
Here goes:
-- the input table with "business identifiers".
DROP TABLE IF EXISTS rec_input;
CREATE TABLE rec_input (
id CHAR(4)
, first_name VARCHAR(32)
, last_name VARCHAR(32)
, boss_id CHAR(4)
)
;
-- some data for it ...
INSERT INTO rec_input(id,first_name,last_name,boss_id)
SELECT 'A01','Arthur','Dent' ,NULL
UNION ALL SELECT 'A02','Ford','Prefect' ,'A01'
UNION ALL SELECT 'A03','Zaphod','Beeblebrox' ,'A01'
UNION ALL SELECT 'A04','Tricia','McMillan' ,'A01'
UNION ALL SELECT 'A05','Gag','Halfrunt' ,'A02'
UNION ALL SELECT 'A06','Prostetnic Vogon','Jeltz','A02'
UNION ALL SELECT 'A07','Lionel','Prosser' ,'A04'
UNION ALL SELECT 'A08','Benji','Mouse' ,'A04'
UNION ALL SELECT 'A09','Frankie','Mouse' ,'A04'
UNION ALL SELECT 'A10','Svlad','Cjelli' ,'A03'
;
-- create a lookup table. The surrogate key is created here.
DROP TABLE IF EXISTS lookup_help;
CREATE TABLE lookup_help (
sk SERIAL NOT NULL -- < here is the surrogate auto increment key
, id CHAR(3)
);
-- fill the lookup table
INSERT INTO lookup_help(id)
SELECT id FROM rec_input;
-- test query
SELECT * FROM lookup_help;
-- this is the target table, with auto increment
-- and matching surrogate foreign key.
DROP TABLE IF EXISTS rec;
CREATE TABLE rec (
sk INTEGER NOT NULL -- surrogate key
, id CHAR(4). -- "business id"
, first_name VARCHAR(32)
, last_name VARCHAR(32)
, boss_id CHAR(4). -- "business foreign key", not needed really
, boss_sk INTEGER. -- internal foreign key
)
;
INSERT INTO rec
SELECT
l.sk -- from lookup table, inner joined
, i.id -- from input table
, i.first_name
, i.last_name
, i.boss_id
, b.sk -- from lookup table, left outer joined
FROM rec_input i
JOIN lookup_help l USING(id) -- for the main sk
LEFT JOIN lookup_help b ON i.boss_id=b.id -- for the foreign sk
;
-- test query
SELECT * FROM rec;
-- out sk | id | first_name | last_name | boss_id | boss_sk
-- out ----+------+------------------+------------+---------+---------
-- out 2 | A02 | Ford | Prefect | A01 | 1
-- out 3 | A03 | Zaphod | Beeblebrox | A01 | 1
-- out 4 | A04 | Tricia | McMillan | A01 | 1
-- out 6 | A06 | Prostetnic Vogon | Jeltz | A02 | 2
-- out 5 | A05 | Gag | Halfrunt | A02 | 2
-- out 10 | A10 | Svlad | Cjelli | A03 | 3
-- out 7 | A07 | Lionel | Prosser | A04 | 4
-- out 8 | A08 | Benji | Mouse | A04 | 4
-- out 9 | A09 | Frankie | Mouse | A04 | 4
-- out 1 | A01 | Arthur | Dent | |
-- out (10 rows)

Perhaps with your use case, you could try NoSql at the moment, querying such data would be far efficient and faster. Maybe give it a shot.
For development you've options like Apache CouchDB, redis, etc.

Related

Result of query as column value

I've got three tables:
Lessons:
CREATE TABLE lessons (
id SERIAL PRIMARY KEY,
title text NOT NULL,
description text NOT NULL,
vocab_count integer NOT NULL
);
+----+------------+------------------+-------------+
| id | title | description | vocab_count |
+----+------------+------------------+-------------+
| 1 | lesson_one | this is a lesson | 3 |
| 2 | lesson_two | another lesson | 2 |
+----+------------+------------------+-------------+
Lesson_vocabulary:
CREATE TABLE lesson_vocabulary (
lesson_id integer REFERENCES lessons(id),
vocabulary_id integer REFERENCES vocabulary(id)
);
+-----------+---------------+
| lesson_id | vocabulary_id |
+-----------+---------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 2 |
| 2 | 4 |
+-----------+---------------+
Vocabulary:
CREATE TABLE vocabulary (
id integer PRIMARY KEY,
hiragana text NOT NULL,
reading text NOT NULL,
meaning text[] NOT NULL
);
Each lesson contains multiple vocabulary, and each vocabulary can be included in multiple lessons.
How can I get the vocab_count column of the lessons table to be calculated and updated whenevr I add more rows to the lesson_vocabulary table. Is this possible, and how would I go about doing this?
Thanks
You can use SQL triggers to serve your purpose. This would be similar to mysql after insert trigger which updates another table's column.
The trigger would look somewhat like this. I am using Oracle SQL, but there would just be minor tweaks for any other implementation.
CREATE TRIGGER vocab_trigger
AFTER INSERT ON lesson_vocabulary
FOR EACH ROW
begin
for lesson_cur in (select LESSON_ID, COUNT(VOCABULARY_ID) voc_cnt from LESSON_VOCABULARY group by LESSON_ID) LOOP
update LESSONS
set VOCAB_COUNT = LESSON_CUR.VOC_CNT
where id = LESSON_CUR.LESSON_ID;
end loop;
END;
It's better to create a view that calculates that (and get rid of the column in the lessons table):
select l.*, lv.vocab_count
from lessons l
left join (
select lesson_id, count(*)
from lesson_vocabulary
group by lesson_id
) as lv(lesson_id, vocab_count) on l.id = lv.lesson_id
If you really want to update the lessons table each time the lesson_vocabulary changes, you can run an UPDATE statement like this in a trigger:
update lessons l
set vocab_count = t.cnt
from (
select lesson_id, count(*) as cnt
from lesson_vocabulary
group by lesson_id
) t
where t.lesson_id = l.id;
I would recommend using a query for this information:
select l.*,
(select count(*)
from lesson_vocabulary lv
where lv.lesson_id = l.lesson_id
) as vocabulary_cnt
from lessons l;
With an index on lesson_vocabulary(lesson_id), this should be quite fast.
I recommend this over an update, because the data remains correct.
I recommend this over a trigger, because it is simpler.
I recommend this over a subquery with aggregation because it should be faster, particularly if you are filtering on the lessons table.

Prolog to SQL: Any way to improve SQL code for unit tests and fix an edge case elegantly?

Inspired by this StackOverflow question:
Find mutual element in different facts in swi-prolog
We have the following
Problem statement
Given a database of "actors starring in movies"
(starsin is the relation linking actor "bob" to movie "a" for example)
starsin(a,bob).
starsin(c,bob).
starsin(a,maria).
starsin(b,maria).
starsin(c,maria).
starsin(a,george).
starsin(b,george).
starsin(c,george).
starsin(d,george).
And given set of movies M, find those actors that starred in all the movies of M.
The question was initially for Prolog.
Prolog solution
In Prolog, an elegant solution involves the predicate
setof/3,
which collects possible variable instantiations into a set (which is really list without
duplicate values):
actors_appearing_in_movies(MovIn,ActOut) :-
setof(
Ax,
MovAx^(setof(Mx,starsin(Mx,Ax),MovAx), subset(MovIn,MovAx)),
ActOut
).
I won't go into details about this, but let's look at the test code, which is of interest here.
Here are five test cases:
actors_appearing_in_movies([],ActOut),permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a],ActOut),permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a,b],ActOut),permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c],ActOut),permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c,d],ActOut),permutation([george],ActOut),!.
A test is a call to the predicate actors_appearing_in_movies/2, which is given
the input list of movies (e.g. [a,b]) and which captures the resulting list of
actors in ActOut.
Subsequently, we just need to test whether ActOut is a permutation of the expected
set of actors, hence for example:
permutation([george, maria],ActOut)`
"Is ActOut a list that is a permutation of the list [george,maria]?.
If that call succeeds (think, doesn't return with false), the test passes.
The terminal ! is the cut operator and is used to tell the Prolog engine to not
reattempt to find more solutions, because we are good at that point.
Note that for the empty set of movies, we get all the actors. This is arguably correct:
every actors stars in all the movies of the empty set (Vacuous Truth).
Now in SQL.
This problem is squarely in the domain of relational algebra, and there is SQL, so let's have
a go at this. Here, i'm using MySQL.
First, set up the facts.
DROP TABLE IF EXISTS starsin;
CREATE TABLE starsin (movie CHAR(20) NOT NULL, actor CHAR(20) NOT NULL);
INSERT INTO starsin VALUES
( "a" , "bob" ),
( "c" , "bob" ),
( "a" , "maria" ),
( "b" , "maria" ),
( "c" , "maria" ),
( "a" , "george" ),
( "b" , "george" ),
( "c" , "george" ),
( "d", "george" );
Regarding the set of movies given as input, giving them in the form of a
(temporary) table sounds natural. In MySQL, "temporary tables" are local to the session. Good.
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b");
Approach:
The results can now be obtained by getting, for each actor, the intersection of the set of
movies denoted by movies_in and the set of movies in which an actor ever appeared
(created for each actor via the inner join), then counting (for each actor) whether the
resulting set has at least as many entries as the set movies_in.
Wrap the query into a procedure for practical reasons.
A delimiter is useful here:
DELIMITER $$
DROP PROCEDURE IF EXISTS actors_appearing_in_movies;
CREATE PROCEDURE actors_appearing_in_movies()
BEGIN
SELECT
d.actor
FROM
starsin d, movies_in q
WHERE
d.movie = q.movie
GROUP BY
actor
HAVING
COUNT(*) >= (SELECT COUNT(*) FROM movies_in);
END$$
DELIMITER ;
Run it!
Problem A appears:
Is there a better way than edit + copy-paste table creation code,
issue a CALL and check the results "by hand"?
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
CALL actors_appearing_in_movies();
Empty set!
Problem B appears:
The above is not desired, I want "all actors", same as for the Prolog solution.
As I do not want to tack a weird edge case exception onto the code, my approach must
be wrong. Is there one which naturally covers this case but doesn't become too complex?
T-SQL and PostgreSQL one-liners are fine too!
The other test cases yield expected data:
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b");
CALL actors_appearing_in_movies();
+--------+
| actor |
+--------+
| george |
| maria |
+--------+
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b"), ("c");
CALL actors_appearing_in_movies();
+--------+
| actor |
+--------+
| george |
| maria |
+--------+
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b"), ("c"), ("d");
CALL actors_appearing_in_movies();
+--------+
| actor |
+--------+
| george |
+--------+
And given set of movies M, find those actors that starred in all the movies of M.
I would use:
select si.actor
from starsin si
where si.movie in (<M>)
group by si.actor
having count(*) = <n>;
If you have to deal with an empty set, then you need a left join:
select a.actor
from actors a left join
starsin si
on a.actor = si.actor and si.movie in (<M>)
group by a.actor
having count(si.movie) = <n>;
<n> here is the number of movies in <M>.
Update: The second approach in extended form
create or replace temporary table
actor (actor char(20) primary key)
as select distinct actor from starsin;
select
a.actor,
si.actor,si.movie -- left in for docu
from
actor a left join starsin si
on a.actor = si.actor
and si.movie in (select * from movies_in)
group
by a.actor
having
count(si.movie) = (select count(*) from movies_in);
Then for empty movies_in:
+--------+-------+-------+
| actor | actor | movie |
+--------+-------+-------+
| bob | NULL | NULL |
| george | NULL | NULL |
| maria | NULL | NULL |
+--------+-------+-------+
and for this movies_in for example:
+-------+
| movie |
+-------+
| a |
| b |
+-------+
movie here is the top of the group:
+--------+--------+-------+
| actor | actor | movie |
+--------+--------+-------+
| george | george | a |
| maria | maria | a |
+--------+--------+-------+
The following solution involves counting and an UPDATE
Writeup here: A Simple Relational Database Operation
We are using MariaDB/MySQL SQL.
T-SQL or PL/SQL are more complete.
Manual page for CREATE TABLE
Manual page for CREATE PROCEDURE
Manual page for data types in MariaDB
Note that SQL has no vector data types that can be passed to procedures. Gotta work without that.
Enter facts as table:
CREATE OR REPLACE TABLE starsin
(movie CHAR(20) NOT NULL, actor CHAR(20) NOT NULL,
PRIMARY KEY (movie, actor));
INSERT INTO starsin VALUES
( "a" , "bob" ),
( "c" , "bob" ),
( "a" , "maria" ),
( "b" , "maria" ),
( "c" , "maria" ),
( "a" , "george" ),
( "b" , "george" ),
( "c" , "george" ),
( "d", "george" );
Enter a procedure to compute solution and actually ... print it out.
DELIMITER $$
CREATE OR REPLACE PROCEDURE actors_appearing_in_movies()
BEGIN
-- collect all the actors
CREATE OR REPLACE TEMPORARY TABLE tmp_actor (actor CHAR(20) PRIMARY KEY)
AS SELECT DISTINCT actor from starsin;
-- table of "all actors x (input movies + '--' placeholder)"
-- (combinations that are needed for an actor to show up in the result)
-- and a flag indicating whether that combination shows up for real
CREATE OR REPLACE TEMPORARY TABLE tmp_needed
(actor CHAR(20),
movie CHAR(20),
actual TINYINT NOT NULL DEFAULT 0,
PRIMARY KEY (actor, movie))
AS
(SELECT ta.actor, mi.movie FROM tmp_actor ta, movies_in mi)
UNION
(SELECT ta.actor, "--" FROM tmp_actor ta);
-- SELECT * FROM tmp_needed;
-- Mark those (actor, movie) combinations which actually exist
-- with a numeric 1
UPDATE tmp_needed tn SET actual = 1 WHERE EXISTS
(SELECT * FROM starsin si WHERE
si.actor = tn.actor AND si.movie = tn.movie);
-- SELECT * FROM tmp_needed;
-- The result is the set of actors in "tmp_needed" which have as many
-- entries flagged "actual" as there are entries in "movies_in"
SELECT actor FROM tmp_needed GROUP BY actor
HAVING SUM(actual) = (SELECT COUNT(*) FROM movies_in);
END$$
DELIMITER ;
Testing
There is no ready-to-use unit testing framework for MariaDB, so we
"test by hand" and write a procedure, the out of which we check manually.
Variadic arguments don't exist, vector data types don't exist.
Let's accept up to 4 movies as input and check the result manually.
DELIMITER $$
CREATE OR REPLACE PROCEDURE
test_movies(IN m1 CHAR(20),IN m2 CHAR(20),IN m3 CHAR(20),IN m4 CHAR(20))
BEGIN
CREATE OR REPLACE TEMPORARY TABLE movies_in (movie CHAR(20) PRIMARY KEY);
CREATE OR REPLACE TEMPORARY TABLE args (movie CHAR(20));
INSERT INTO args VALUES (m1),(m2),(m3),(m4); -- contains duplicates and NULLs
INSERT INTO movies_in (SELECT DISTINCT movie FROM args WHERE movie IS NOT NULL); -- clean
DROP TABLE args;
CALL actors_appearing_in_movies();
END$$
DELIMITER ;
The above passes all the manual tests, in particular:
CALL test_movies(NULL,NULL,NULL,NULL);
+--------+
| actor |
+--------+
| bob |
| george |
| maria |
+--------+
3 rows in set (0.003 sec)
For example, for CALL test_movies("a","b",NULL,NULL);
First set up the table with all actors against in all the movies in the input set, including the
"doesn't exist" movie represented by a placeholder --.
+--------+--------+-------+
| actual | actor | movie |
+--------+--------+-------+
| 0 | bob | -- |
| 0 | bob | a |
| 0 | bob | b |
| 0 | george | -- |
| 0 | george | a |
| 0 | george | b |
| 0 | maria | -- |
| 0 | maria | a |
| 0 | maria | b |
+--------+--------+-------+
Then mark those rows with a 1 where the actor-movie combination actually exists in starsin.
+--------+--------+-------+
| actual | actor | movie |
+--------+--------+-------+
| 0 | bob | -- |
| 1 | bob | a |
| 0 | bob | b |
| 0 | george | -- |
| 1 | george | a |
| 1 | george | b |
| 0 | maria | -- |
| 1 | maria | a |
| 1 | maria | b |
+--------+--------+-------+
Finally select an actor for inclusion in the solution if the SUM(actual) is equal to the
number of entries in the input movies table (it cannot be larger), as that means that the
actor indeed appears in all movies of the input movies table. In the special case where that
table is empty, the actor-movie combination table will only contain
+--------+--------+-------+
| actual | actor | movie |
+--------+--------+-------+
| 0 | bob | -- |
| 0 | george | -- |
| 0 | maria | -- |
+--------+--------+-------+
and thus all actors will be selected, which is what we want.

Merge a one-to-many and a many-to-many relation when joining tables

I have a table of things and a table things_persons in PostgreSQL 9.6 that represents a many-to-many relationship between things and persons. But each thing also has a column owner, representing a one-to-many relationship.
A minimal schema for my problem is the following:
CREATE TABLE things(
thing_id SERIAL,
owner integer
);
CREATE TABLE things_persons(
thing_id integer,
person_id integer
);
INSERT INTO things VALUES(1,10);
INSERT INTO things VALUES(2,10);
INSERT INTO things_persons VALUES(1,10);
INSERT INTO things_persons VALUES(1,11);
INSERT INTO things_persons VALUES(2,11);
The following query is a part of what I want to do:
SELECT * FROM things
LEFT JOIN things_persons USING(thing_id)
The result is:
| thing_id | owner | person_id |
|----------|-------|-----------|
| 1 | 10 | 10 |
| 1 | 10 | 11 |
| 2 | 10 | 11 |
What I actually want to do is to treat the owner as if it was just another entry in the things_persons table. I want to list each person that is associated with each thing, regardless if they are an owner or in the things_persons table.
The following represents the result I want to achieve:
| thing_id | person_id |
|----------|-----------|
| 1 | 10 |
| 1 | 11 |
| 2 | 10 |
| 2 | 11 |
For thing 1, the owner is also among the person_ids, so it should not be duplicated. For thing 2, the owner is not among the person_ids, so it should be added there.
Changing the schema is not an option here, but I can't think of a way to write a query that gives me my desired result. Any idea on how to write such a query?
I think you just want a union or union all:
SELECT tp.thing_id, tp.person_id
FROM things_persons tp
UNION ALL
SELECT t.thing_id, t.owner_id
FROM things t;
You would use union if you wanted the query to remove duplicates.

Insert into multiple tables

A brief explanation on the relevant domain part:
A Category is composed of four data:
Gender (Male/Female)
Age Division (Mighty Mite to Master)
Belt Color (White to Black)
Weight Division (Rooster to Heavy)
So, Male Adult Black Rooster forms one category. Some combinations may not exist, such as mighty mite black belt.
An Athlete fights Athletes of the same Category, and if he classifies, he fights Athletes of different Weight Divisions (but of the same Gender, Age and Belt).
To the modeling. I have a Category table, already populated with all combinations that exists in the domain.
CREATE TABLE Category (
[Id] [int] IDENTITY(1,1) NOT NULL,
[AgeDivision_Id] [int] NULL,
[Gender] [int] NULL,
[BeltColor] [int] NULL,
[WeightDivision] [int] NULL
)
A CategorySet and a CategorySet_Category, which forms a many to many relationship with Category.
CREATE TABLE CategorySet (
[Id] [int] IDENTITY(1,1) NOT NULL,
[Championship_Id] [int] NOT NULL,
)
CREATE TABLE CategorySet_Category (
[CategorySet_Id] [int] NOT NULL,
[Category_Id] [int] NOT NULL
)
Given the following result set:
| Options_Id | Championship_Id | AgeDivision_Id | BeltColor | Gender | WeightDivision |
|------------|-----------------|----------------|-----------|--------|----------------|
1. | 2963 | 422 | 15 | 7 | 0 | 0 |
2. | 2963 | 422 | 15 | 7 | 0 | 1 |
3. | 2963 | 422 | 15 | 7 | 0 | 2 |
4. | 2963 | 422 | 15 | 7 | 0 | 3 |
5. | 2964 | 422 | 15 | 8 | 0 | 0 |
6. | 2964 | 422 | 15 | 8 | 0 | 1 |
7. | 2964 | 422 | 15 | 8 | 0 | 2 |
8. | 2964 | 422 | 15 | 8 | 0 | 3 |
Because athletes may fight two CategorySets, I need CategorySet and CategorySet_Category to be populated in two different ways (it can be two queries):
One Category_Set for each row, with one CategorySet_Category pointing to the corresponding Category.
One Category_Set that groups all WeightDivisions in one CategorySet in the same AgeDivision_Id, BeltColor, Gender. In this example, only BeltColor varies.
So the final result would have a total of 10 CategorySet rows:
| Id | Championship_Id |
|----|-----------------|
| 1 | 422 |
| 2 | 422 |
| 3 | 422 |
| 4 | 422 |
| 5 | 422 |
| 6 | 422 |
| 7 | 422 |
| 8 | 422 |
| 9 | 422 | /* groups different Weight Division for BeltColor 7 */
| 10 | 422 | /* groups different Weight Division for BeltColor 8 */
And CategorySet_Category would have 16 rows:
| CategorySet_Id | Category_Id |
|----------------|-------------|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 1 | /* groups different Weight Division for BeltColor 7 */
| 9 | 2 | /* groups different Weight Division for BeltColor 7 */
| 9 | 3 | /* groups different Weight Division for BeltColor 7 */
| 9 | 4 | /* groups different Weight Division for BeltColor 7 */
| 10 | 5 | /* groups different Weight Division for BeltColor 8 */
| 10 | 6 | /* groups different Weight Division for BeltColor 8 */
| 10 | 7 | /* groups different Weight Division for BeltColor 8 */
| 10 | 8 | /* groups different Weight Division for BeltColor 8 */
I have no idea how to insert into CategorySet, grab it's generated Id, then use it to insert into CategorySet_Category
I hope I've made my intentions clear.
I've also created a SQLFiddle.
Edit 1: I commented in Jacek's answer that this would run only once, but this is false. It will run a couple of times a week. I have the option to run as SQL Command from C# or a stored procedure. Performance is not crucial.
Edit 2: Jacek suggested using SCOPE_IDENTITY to return the Id. Problem is, SCOPE_IDENTITY returns only the last inserted Id, and I insert more than one row in CategorySet.
Edit 3: Answer to #FutbolFan who asked how the FakeResultSet is retrieved.
It is a table CategoriesOption (Id, Price_Id, MaxAthletesByTeam)
And tables CategoriesOptionBeltColor, CategoriesOptionAgeDivision, CategoriesOptionWeightDivison, CategoriesOptionGender. Those four tables are basically the same (Id, CategoriesOption_Id, Value).
The query look like this:
SELECT * FROM CategoriesOption co
LEFT JOIN CategoriesOptionAgeDivision ON
CategoriesOptionAgeDivision.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionBeltColor ON
CategoriesOptionBeltColor.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionGender ON
CategoriesOptionGender.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionWeightDivision ON
CategoriesOptionWeightDivision.CategoriesOption_Id = co.Id
The solution described here will work correctly in multi-user environment and when destination tables CategorySet and CategorySet_Category are not empty.
I used schema and sample data from your SQL Fiddle.
First part is straight-forward
(ab)use MERGE with OUTPUT clause.
MERGE can INSERT, UPDATE and DELETE rows. In our case we need only to INSERT. 1=0 is always false, so the NOT MATCHED BY TARGET part is always executed. In general, there could be other branches, see docs. WHEN MATCHED is usually used to UPDATE; WHEN NOT MATCHED BY SOURCE is usually used to DELETE, but we don't need them here.
This convoluted form of MERGE is equivalent to simple INSERT, but unlike simple INSERT its OUTPUT clause allows to refer to the columns that we need.
MERGE INTO CategorySet
USING
(
SELECT
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,Category.Id AS Category_Id
FROM
FakeResultSet
INNER JOIN Category ON
Category.AgeDivision_Id = FakeResultSet.AgeDivision_Id AND
Category.Gender = FakeResultSet.Gender AND
Category.BeltColor = FakeResultSet.BeltColor AND
Category.WeightDivision = FakeResultSet.WeightDivision
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(Championship_Id
,Price_Id
,MaxAthletesByTeam)
VALUES
(Src.Championship_Id
,Src.Price_Id
,Src.MaxAthletesByTeam)
OUTPUT inserted.id AS CategorySet_Id, Src.Category_Id
INTO CategorySet_Category (CategorySet_Id, Category_Id)
;
FakeResultSet is joined with Category to get Category.id for each row of FakeResultSet. It is assumed that Category has unique combinations of AgeDivision_Id, Gender, BeltColor, WeightDivision.
In OUTPUT clause we need columns from both source and destination tables. The OUTPUT clause in simple INSERT statement doesn't provide them, so we use MERGE here that does.
The MERGE query above would insert 8 rows into CategorySet and insert 8 rows into CategorySet_Category using generated IDs.
Second part
needs temporary table. I'll use a table variable to store generated IDs.
DECLARE #T TABLE (
CategorySet_Id int
,AgeDivision_Id int
,Gender int
,BeltColor int);
We need to remember the generated CategorySet_Id together with the combination of AgeDivision_Id, Gender, BeltColor that caused it.
MERGE INTO CategorySet
USING
(
SELECT
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,FakeResultSet.AgeDivision_Id
,FakeResultSet.Gender
,FakeResultSet.BeltColor
FROM
FakeResultSet
GROUP BY
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,FakeResultSet.AgeDivision_Id
,FakeResultSet.Gender
,FakeResultSet.BeltColor
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(Championship_Id
,Price_Id
,MaxAthletesByTeam)
VALUES
(Src.Championship_Id
,Src.Price_Id
,Src.MaxAthletesByTeam)
OUTPUT
inserted.id AS CategorySet_Id
,Src.AgeDivision_Id
,Src.Gender
,Src.BeltColor
INTO #T(CategorySet_Id, AgeDivision_Id, Gender, BeltColor)
;
The MERGE above would group FakeResultSet as needed and insert 2 rows into CategorySet and 2 rows into #T.
Then join #T with Category to get Category.IDs:
INSERT INTO CategorySet_Category (CategorySet_Id, Category_Id)
SELECT
TT.CategorySet_Id
,Category.Id AS Category_Id
FROM
#T AS TT
INNER JOIN Category ON
Category.AgeDivision_Id = TT.AgeDivision_Id AND
Category.Gender = TT.Gender AND
Category.BeltColor = TT.BeltColor
;
This will insert 8 rows into CategorySet_Category.
Here is not the full answer, but direction which you can use to solve this:
1st query:
select row_number() over(order by t, Id) as n, Championship_Id
from (
select distinct 0 as t, b.Id, a.Championship_Id
from FakeResultSet as a
inner join
Category as b
on
a.AgeDivision_Id=b.AgeDivision_Id and
a.Gender=b.Gender and
a.BeltColor=b.BeltColor and
a.WeightDivision=b.WeightDivision
union all
select distinct 1, BeltColor, Championship_Id
from FakeResultSet
) as q
2nd query:
select q2.CategorySet_Id, c.Id as Category_Id from (
select row_number() over(order by t, Id) as CategorySet_Id, Id, BeltColor
from (
select distinct 0 as t, b.Id, null as BeltColor
from FakeResultSet as a
inner join
Category as b
on
a.AgeDivision_Id=b.AgeDivision_Id and
a.Gender=b.Gender and
a.BeltColor=b.BeltColor and
a.WeightDivision=b.WeightDivision
union all
select distinct 1, BeltColor, BeltColor
from FakeResultSet
) as q
) as q2
inner join
Category as c
on
(q2.BeltColor is null and q2.Id=c.Id)
OR
(q2.BeltColor = c.BeltColor)
of course this will work only for empty CategorySet and CategorySet_Category tables, but you can use select coalese(max(Id), 0) from CategorySet to get current number and add it to row_number, thus you will get real ID which will be inserted into CategorySet row for second query
What I do when I run into these situations is to create one or many temporary tables with row_number() over clauses giving me identities on the temporary tables. Then I check for the existence of each record in the actual tables, and if they exist update the temporary table with the actual record ids. Finally I run a while exists loop on the temporary table records missing the actual id and insert them one at a time, after the insert I update the temporary table record with the actual ids. This lets you work through all the data in a controlled manner.
##IDENTITY is your friend to the 2nd part of question
https://msdn.microsoft.com/en-us/library/ms187342.aspx
and
Best way to get identity of inserted row?
Some API (drivers) returns int from update() function, i.e. ID if it is "insert". What API/environment do You use?
I don't understand 1st problem. You should not insert identity column.
Below query will give final result For CategorySet rows:
SELECT
ROW_NUMBER () OVER (PARTITION BY Championship_Id ORDER BY Championship_Id) RNK,
Championship_Id
FROM
(
SELECT
Championship_Id
,BeltColor
FROM #FakeResultSet
UNION ALL
SELECT
Championship_Id,BeltColor
FROM #FakeResultSet
GROUP BY Championship_Id,BeltColor
)BASE

improve database table design depending on a value of a type in a column

I have the following:
1. A table "patients" where I store patients data.
2. A table "tests" where I store data of tests done to each patient.
Now the problem comes as I have 2 types of tests "tests_1" and "tests_2"
So for each test done to particular patient I store the type and id of the type of test:
CREATE TABLE IF NOT EXISTS patients
(
id_patient INTEGER PRIMARY KEY,
name_patient VARCHAR(30) NOT NULL,
sex_patient VARCHAR(6) NOT NULL,
date_patient DATE
);
INSERT INTO patients values
(1,'Joe', 'Male' ,'2000-01-23');
INSERT INTO patients values
(2,'Marge','Female','1950-11-25');
INSERT INTO patients values
(3,'Diana','Female','1985-08-13');
INSERT INTO patients values
(4,'Laura','Female','1984-12-29');
CREATE TABLE IF NOT EXISTS tests
(
id_test INTEGER PRIMARY KEY,
id_patient INTEGER,
type_test VARCHAR(15) NOT NULL,
id_type_test INTEGER,
date_test DATE,
FOREIGN KEY (id_patient) REFERENCES patients(id_patient)
);
INSERT INTO tests values
(1,4,'test_1',10,'2004-05-29');
INSERT INTO tests values
(2,4,'test_2',45,'2005-01-29');
INSERT INTO tests values
(3,4,'test_2',55,'2006-04-12');
CREATE TABLE IF NOT EXISTS tests_1
(
id_test_1 INTEGER PRIMARY KEY,
id_patient INTEGER,
data1 REAL,
data2 REAL,
data3 REAL,
data4 REAL,
data5 REAL,
FOREIGN KEY (id_patient) REFERENCES patients(id_patient)
);
INSERT INTO tests_1 values
(10,4,100.7,1.8,10.89,20.04,5.29);
CREATE TABLE IF NOT EXISTS tests_2
(
id_test_2 INTEGER PRIMARY KEY,
id_patient INTEGER,
data1 REAL,
data2 REAL,
data3 REAL,
FOREIGN KEY (id_patient) REFERENCES patients(id_patient)
);
INSERT INTO tests_2 values
(45,4,10.07,18.9,1.8);
INSERT INTO tests_2 values
(55,4,17.6,1.8,18.89);
Now I think this approach is redundant or not to good...
So I would like to improve queries like
select * from tests WHERE id_patient=4;
select * from tests_1 WHERE id_patient=4;
select * from tests_2 WHERE id_patient=4;
Is there a better approach?
In this example I have 1 test of type tests_1 and 2 tests of type tests_2 for patient with id=4.
Here is a fiddle
Add a table testtype (id_test,name_test) and use it an FK to the id_type_test field in the tests table. Do not create seperate tables for test_1 and test_2
It depends on the requirement
For OLTP I would do something like the following
STAFF:
ID | FORENAME | SURNAME | DATE_OF_BIRTH | JOB_TITLE | ...
-------------------------------------------------------------
1 | harry | potter | 2001-01-01 | consultant | ...
2 | ron | weasley | 2001-02-01 | pathologist | ...
PATIENT:
ID | FORENAME | SURNAME | DATE_OF_BIRTH | ...
-----------------------------------------------
1 | hermiony | granger | 2013-01-01 | ...
TEST_TYPE:
ID | CATEGORY | NAME | DESCRIPTION | ...
--------------------------------------------------------
1 | haematology | abg | arterial blood gasses | ...
REQUEST:
ID | TEST_TYPE_ID | PATIENT_ID | DATE_REQUESTED | REQUESTED_BY | ...
----------------------------------------------------------------------
1 | 1 | 1 | 2013-01-02 | 1 | ...
RESULT_TYPE:
ID | TEST_TYPE_ID | NAME | UNIT | ...
---------------------------------------
1 | 1 | co2 | kPa | ...
2 | 1 | o2 | kPa | ...
RESULT:
ID | REQUEST_ID | RESULT_TYPE_ID | DATE_RESULTED | RESULTED_BY | RESULT | ...
-------------------------------------------------------------------------------
1 | 1 | 1 | 2013-01-02 | 2 | 5 | ...
2 | 1 | 2 | 2013-01-02 | 2 | 5 | ...
A concern I have with the above is with the unit of the test result, these can sometimes (not often) change. It may be better to place the unit un the result table.
Also consider breaking these into the major test categories as my understanding is they can be quite different e.g. histopathology and xrays are not resulted in the similar ways as haematology and microbiology are.
For OLAP I would combine request and result into one table adding derived columns such as REQUEST_TO_RESULT_MINS and make a single dimension from RESULT_TYPE and TEST_TYPE etc.
You can do this in a few ways. without knowing all the different type of cases you need to deal with.
The simplest would be 5 tables
Patients (like you described it)
Tests (like you described it)
TestType (like Declan_K suggested)
TestResultCode
TestResults
TestRsultCode describe each value that is stored for each test. TestResults is a pivoted table that can store any number of test-results per test,:
Create table TestResultCode
(
idTestResultCode int
, Code varchar(10)
, Description varchar(200)
, DataType int -- 1= Real, 2 = Varchar, 3 = int, etc.
);
Create Table TestResults
(
idPatent int -- FK
, idTest int -- FK
, idTestType int -- FK
, idTestResultCode int -- FK
, ResultsI real
, ResultsV varchar(100)
, Resultsb int
, Created datetime
)
so, basically you can fit the results you wanted to add into the tables "tests_1" and "tests_2" and any other tests you can think of.
The application reading this table, can load each test and all its values. Of course the application needs to know how to deal with each case, but you can store any type of test in this structure.