Using sql EXCEPT to build an EAVT fact store - sql

I'm exploring the datomic database, and in so doing having a go at taking some of its ideas and implementing them in sql in an incremental way so as to adjust to the new ways of data modelling. This question is really entirely about SQL though, I just mention that for background, to explain the why of what I'm doing here (though might be interesting for those interested in datomic too, which is why I also added the datomic tag to the question).
Generally we are getting rid of separate tables per type, but I will retain a users table for this example, rather than simply use an entities table (may try that later, but not yet).
create table users (
id uuid,
identity text -- e.g. 'the yankees', 'man born as john in birmingham on date x/y/z'
);
Then we have an EAVT store, also with an added boolean to specify add or retract. This table is append only. We will never issue update or delete on it.
create table eavt_log (
user_id uuid,
attribute text,
value text,
added boolean,
created_at timestamp
);
Now some data to illustrate usage intended
-- insert person number 12345 (imagine as national identity or birth certificate no.)
insert into users(id, identity) values (uuid_generate_v4(), 'p-12345');
-- lets insert some facts about a person previously known as john smith:
insert into eavt_log(user_id, attribute, value, added, created_at) values
((select id from users where identity='p-12345'),
'name', 'John Smith', true, '1911-01-01'),
((select id from users where identity='p-12345'),
'name', 'John Smith', false, '1931-01-01'),
((select id from users where identity='p-12345'),
'name', 'John Bontine Smith', true, '1931-01-01');
To make this useful (any database must provide leverage, as Hickey says), lets try to find all the unretracted names for the person previously known as John Smith.
Here's my (bad) attempt
-- find all currently unretracted names for person previously known as John Smith. This could
-- be 0, 1 (we hope), or more - it just depends though, and should, on what data has been input.
(select attribute, value from eavt_log
where user_id = (select id from users where identity='p-12345')
and attribute = 'name'
and added = true
order by created_at desc) -- <- can sneak this in w/o upsetting the except, as it's not in the select.
except
(select attribute, value from eavt_log
where user_id = (select id from users where identity='p-12345')
and attribute = 'name'
and added = false);
That gives:
attribute | value
-----------+--------------------
name | John Bontine Smith
(1 row)
Which is correct for the test data we gave it.
Then we can try to generalise to
create view unretracted as (
(select user_id, attribute, value from eavt_log
where added = true
order by created_at)
except
(select user_id, attribute, value from eavt_log
where added = false)
);
Problem is, both of these are flawed, because this simple except will give incorrect result for the case when a fact has been added, retracted, then added again. i.e. if we add
((select id from users where identity='p-12345'),
'name', 'John Smith', false, '1941-01-01');
to the facts inserted above, to denote that person-12345, in 1941, adopted the name John Smith again (without retracting the name 'John Bontine Smith', so in this case we want the system to return two values for his name).
With this data, the earlier retract of this identical value will cause this later re-assertion of the same value to be excluded from the result set, even though its been reasserted, due to the way EXCEPT is working (we did not do a linear table scan which i think may be required here?)
My question (finally!) -- is there a way to achieve this in SQL? Can SQL give us more leverage here?
It seems as if we need a where after the except which reaches back into the first select... but that seems impossible in set theory terms, so I wonder what else SQL can do here.

This is edited for your update, although I think there is still something wrong. You added an additional retracted row, which seems to contradict your text. Assuming that the row is actually added instead of retracted, we can use the below query.
You can use DISTINCT ON in postgres to get the last value per user. If you use that in a sub-select, you can only select the rows for which added = true:
SELECT attribute, value
FROM (
SELECT distinct on (eavt_log.user_id, attribute, value)
attribute, value, added
FROM eavt_log
JOIN users ON eavt_log.user_id = users.id
WHERE attribute = 'name'
ORDER BY eavt_log.user_id, attribute, value, created_at desc) sub
WHERE added = 't';
Edit: here's a fiddle

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

Inserting multiple records in database table using PK from another table

I have DB2 table "organization" which holds organizations data including the following columns
organization_id (PK), name, description
Some organizations are deleted so lot of "organization_id" (i.e. rows) doesn't exist anymore so it is not continuous like 1,2,3,4,5... but more like 1, 2, 5, 7, 11,12,21....
Then there is another table "title" with some other data, and there is organization_id from organization table in it as FK.
Now there is some data which I have to insert for all organizations, some title it is going to be shown for all of them in web app.
In total there is approximately 3000 records to be added.
If I would do it one by one it would look like this:
INSERT INTO title
(
name,
organization_id,
datetime_added,
added_by,
special_fl,
title_type_id
)
VALUES
(
'This is new title',
XXXX,
CURRENT TIMESTAMP,
1,
1,
1
);
where XXXX represent "organization_id" which I should get from table "organization" so that insert do it only for existing organization_id.
So only "organization_id" is changing matching to "organization_id" from table "organization".
What would be best way to do it?
I checked several similar qustions but none of them seems to be equal to this?
SQL Server 2008 Insert with WHILE LOOP
While loop answer interates over continuous IDs, other answer also assumes that ID is autoincremented.
Same here:
How to use a SQL for loop to insert rows into database?
Not sure about this one (as question itself is not quite clear)
Inserting a multiple records in a table with while loop
Any advice on this? How should I do it?
If you seriously want a row for every organization record in Title with the exact same data something like this should work:
INSERT INTO title
(
name,
organization_id,
datetime_added,
added_by,
special_fl,
title_type_id
)
SELECT
'This is new title' as name,
o.organization_id,
CURRENT TIMESTAMP as datetime_added,
1 as added_by,
1 as special_fl,
1 as title_type_id
FROM
organizations o
;
you shouldn't need the column aliases in the select but I am including for readability and good measure.
https://www.ibm.com/support/knowledgecenter/ssw_i5_54/sqlp/rbafymultrow.htm
and for good measure in case you process errors out or whatever... you can also do something like this to only insert a record in title if that organization_id and title does not exist.
INSERT INTO title
(
name,
organization_id,
datetime_added,
added_by,
special_fl,
title_type_id
)
SELECT
'This is new title' as name,
o.organization_id,
CURRENT TIMESTAMP as datetime_added,
1 as added_by,
1 as special_fl,
1 as title_type_id
FROM
organizations o
LEFT JOIN Title t
ON o.organization_id = t.organization_id
AND t.name = 'This is new title'
WHERE
t.organization_id IS NULL
;

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

Insert into table some values which are selected from other table

I have my database structure like this ::
Database structure ::
ATT_table- ActID(PK), assignedtoID(FK), assignedbyID(FK), Env_ID(FK), Product_ID(FK), project_ID(FK), Status
Product_table - Product_ID(PK), Product_name
Project_Table- Project_ID(PK), Project_Name
Environment_Table- Env_ID(PK), Env_Name
Employee_Table- Employee_ID(PK), Name
Employee_Product_projectMapping_Table -Emp_ID(FK), Project_ID(FK), Product_ID(FK)
Product_EnvMapping_Table - Product_ID(FK), Env_ID(FK)
I want to insert values in ATT_Table. Now in that table I have some columns like assignedtoID, assignedbyID, envID, ProductID, project_ID which are FK in this table but primary key in other tables they are simply numbers).
Now when I am inputting data from the user I am taking that in form of string like a user enters Name (Employee_Table), product_Name (Product_table) and not ID directly. So I want to first let the user enter the name (of Employee or product or Project or Env) and then value of its primary key (Emp_ID, product_ID, project_ID, Env_ID) are picked up and then they are inserted into ATT_table in place of assignedtoID, assignedbyID, envID, ProductID, project_ID.
Please note that assignedtoID, assignedbyID are referenced from Emp_ID in Employee_Table.
How to do this ? I have got something like this but its not working ::
INSERT INTO ATT_TABLE(Assigned_To_ID,Assigned_By_ID,Env_ID,Product_ID,Project_ID)
VALUES (A, B, Env_Table.Env_ID, Product_Table.Product_ID, Project_Table.Project_ID)
SELECT Employee_Table.Emp_ID AS A,Employee_Table.Emp_ID AS B, Env_Table.Env_ID, Project_Table.Project_ID, Product_Table.Product_ID
FROM Employee_Table, Env_Table, Product_Table, Project_Table
WHERE Employee_Table.F_Name= "Shantanu" or Employee_Table.F_Name= "Kapil" or Env_Table.Env_Name= "SAT11A" or Product_Table.Product_Name = "ABC" or Project_Table.Project_Name = "Project1";
The way this is handled is by using drop down select lists. The list consists of (at least) two columns: one holds the Id's teh database works with, the other(s) store the strings the user sees. Like
1, "CA", "Canada"
2, "USA", 'United States"
...
The user sees
CA | Canada
USA| United States
...
The value that gets stored in the database is 1, 2, ... whatever row the user selected.
You can never rely on the exact, correct input of users. Sooner or later they will make typo's.
I extend my answer, based on your remark.
The problem with the given solution (get the Id's from the parent tables by JOINing all those parent tables together by the entered text and combining those with a number of AND's) is that as soon as one given parameter has a typo, you will get not a single record back. Imagine the consequences when the real F_name of the employee is "Shant*anu*" and the user entered "Shant*aun*".
The best way to cope with this is to get those Id's one by one from the parent tables. Suppose some FK's have a NOT NULL constraint. You can check if the F_name is filled in and inform the user when he didn't fill that field. Suppose the user eneterd "Shant*aun*" as name, the program will not warn the user, as something is filled in. But that is not the check the database will do, because the NOT NULL constraints are defined on the Id's (FK). When you get the Id's one by one from the parent tables. You can verify if they are NOT NULL or not. When the text is filled in, like "Shant*aun*", but the returned Id is NULL, you can inform the user of a problem and let him correct his input: "No employee by the name 'Shantaun' could be found."
SELECT $Emp_ID_A = Emp_ID
FROM Employee_Table
WHERE F_Name= "Shantanu"
SELECT $Emp_ID_B = Emp_ID
FROM Employee_Table
WHERE B.F_Name= "Kapil"
SELECT $Env_ID = Env_ID
FROM Env_Table
WHERE Env_Table.Env_Name= "SAT11A"
SELECT $Product_ID = Product_ID
FROM Product_Table
WHERE Product_Table.Product_Name = "ABC"
SELECT $Project_ID = Project_ID
FROM Project_Table
WHERE Project_Name = "Project1"
Please use AND instead of OR.
INSERT INTO ATT_TABLE(Assigned_To_ID,Assigned_By_ID,Env_ID,Product_ID,Project_ID)
SELECT A.Emp_ID, B.Emp_ID, Env_Table.Env_ID, Project_Table.Project_ID, Product_Table.Product_ID
FROM Employee_Table A, Employee_Table B, Env_Table, Product_Table, Project_Table
WHERE A.F_Name= "Shantanu"
AND B.F_Name= "Kapil"
AND Env_Table.Env_Name= "SAT11A"
AND Product_Table.Product_Name = "ABC"
AND Project_Table.Project_Name = "Project1";
But it is best practice to use drop down list in your scenario, i guess.

How to search on levelOrder values un SQL?

I have a table in SQL Server that contains the following columns :
Id Name ParentId LevelOrder
8 vehicle 0 0/8/
9 car 8 0/8/9/
10 bike 8 0/8/10/
11 House 0 0/11/
...
This creates a tree.
Say that I have the LevelOrder 0/8/, this should return only the car and bike rows, but how do I handle this in SQL Server?
I have tried :
Select * FROM MyTable WHERE LevelOrder >= '0/8/'
but that does not work.
The underscore character will guarantee at least one character comes after '0/8/', so you don't get a match on the "vehicle" row.
SELECT *
FROM MyTable
WHERE LevelOrder LIKE '0/8/_%'
This code allows you to select values that start with 0/8/
Select * FROM MyTable WHERE LevelOrder like '0/8/%'
Okay -
While #Joe's answer is the simplest and easiest to implement (and possibly better performing than what I'm about to propose...), there are some issues with update anomalies.
Specifically:
You already have a parentId column. You need to synchronize both this and the levelOrder column, or risk inconsistent data. (I believe this also violates 1NF, although my understanding of the exact definition is a little sketchy...)
levelOrder contains the entire heirarchy. If any one parent is moved, all children rows must have levelOrder modified to reflect this (potentially very messy).
In light of this, here's what I recommend:
Drop the levelOrder column, as its existence will (generally) cause problems.
Use a recursive CTE and the parentId column to build the heirarchy dynamically. Either leave the column where it is, or move it to a dedicated relationship table. Moving one parent then requires only one cell to be updated, and cannot result in any (data, not semantic) anomalies. The CTE should look similar to this form (will need to be adjusted for purpose):
WITH heir_parent (parentId, id) as (SELECT parentId, id
FROM table
WHERE id =
UNION ALL
SELECT b.parentId, b.id
FROM heir_parent as a
JOIN table as b
ON b.parentId = a.id)
At the moment, the CTE returns a list of all children of the given id, with their id and their immediate parent. It can be adjusted to return a number of other things as well - although I recommend that the CTE be used only to generate the relationship, and join externally to get the remaining data.