PostgreSQL: Get last updates by joining 2 tables - sql

I have 2 tables that I need to join to get the last/latest update in the 2nd table based on valid rows in the 1st table.
Code below is en example.
Table 1: Registered users
This table contains a list of users registered in the system.
When a user gets registered it gets added into this table. A user is registered with a name, and a registration time.
A user can get de-registered from the system. When this is done, the de-registration column gets updated to the time that the user was removed. If this value is NULL, it means that the user is still registered.
CREATE TABLE users (
entry_idx SERIAL PRIMARY KEY,
name TEXT NOT NULL,
reg_time TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
dereg_time TIMESTAMP WITH TIME ZONE DEFAULT NULL
);
Table 2: User updates
This table contains updates on the users. Each time a user changes a property (example position) the change gets stored in this table. No updates must be removed since there is a requirement to keep history in the table.
CREATE TABLE user_updates (
entry_idx SERIAL PRIMARY KEY,
name TEXT NOT NULL,
position INTEGER NOT NULL,
time TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
Required output
So given the above information, I need to get a new table that contains only the last update for the current registered users.
Test Data
The following data can be used as test data for the above tables:
-- Register 3 users
INSERT INTO users(name) VALUES ('Person1');
INSERT INTO users(name) VALUES ('Person2');
INSERT INTO users(name) VALUES ('Person3');
-- Add some updates for all users
INSERT INTO user_updates(name, position) VALUES ('Person1', 0);
INSERT INTO user_updates(name, position) VALUES ('Person1', 1);
INSERT INTO user_updates(name, position) VALUES ('Person1', 2);
INSERT INTO user_updates(name, position) VALUES ('Person2', 1);
INSERT INTO user_updates(name, position) VALUES ('Person3', 1);
-- Unregister the 2nd user
UPDATE users SET dereg_time = NOW() WHERE name = 'Person2';
From the above, I want the last updates for Person 1 and Person 3.
Failed attempt
I have tried using joins and other methods but the results are not what I am looking for. The question is almost the same as one asked here. I have used the solution in answer 1 and it does give the correct answer, but it takes too long to get too the answer in my system.
Based on the above link I have created the following query that 'works':
SELECT
t1.*
, t2.*
FROM
users t1
JOIN (
SELECT
t.*,
row_number()
OVER (
PARTITION BY
t.name
ORDER BY t.entry_idx DESC
) rn
FROM user_updates t
) t2
ON
t1.name = t2.name
AND
t2.rn = 1
WHERE
t1.dereg_time IS NULL;
Problem
The problem with the above query is that it takes very long to complete. Table 1 contains a small list of users, while table 2 contains a huge amount of updates. I think that the query might be inefficient in the way that it handles the 2 tables (based on my limited understanding of the query). From pgAdmin's explain it does a lot of sorting and aggregation on the updates 1st before joining with the registered table.
Question
How can I formulate a query to efficiently and fast get the latest updates for registered users?

PostgreSQL have a special distinct on syntax for such type of queries:
select distinct on(t1.name)
--it's better to specify columns explicitly, * just for example
t1.*, t2.*
from users as t1
left outer join user_updates as t2 on t2.name = t1.name
where t1.dereg_time is null
order by t1.name, t2.entry_idx desc
sql fiddle demo
you can try it, but for me your query should work fine too.

I am using q1 to get the last update of each user. Then joining with users to remove entries that have been deregistered. Then joining with q2 to get rest of user_update fields.
select users.*,q2.* from users
join
(select name,max(time) t from user_updates group by name) q1
on users.name=q1.name
join user_updates q2 on q1.t=q2.time and q1.name=q2.name
where
users.dereg_time is null
(I haven't tested it. have edited some things)

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

Best approach to populate new tables in a database

I have a problem I have been working on the past several hours. It is complex (for me) and I don't expect someone to do it for me. I just need the right direction.
Problem: We had the tables (below) added to our database and I need to update them based off of data already in our DailyCosts table. The tricky part is that I need to take DailyCosts.Notes and move it to PurchaseOrder.PoNumber. Notes is where we currenlty have the PONumbers.
I started with the Insert below, testing it out on one WellID. This is Inserting records from our DailyCosts table to the new PurchaseOrder table:
Insert Into PurchaseOrder (PoNumber,WellId,JObID,ID)
Select Distinct Cast(Notes As nvarchar(20)), WellID, JOBID,
DailyCosts.DailyCostID
From DailyCosts
Where WellID = '24A-23'
It affected 1973 rows (The Notes are in Ntext)
However, I need to update the other new tables because we need to see the actual PONumbers in the application.
This next Insert is Inserting records from our DailyCost table and new PurchaseOrder table (from above) to a new table called PurchaseOrderDailyCost
Insert Into PurchaseOrderDailyCost (WellID, JobID, ReportNo, AccountCode, PurchaseOrderID,ID,DailyCostSeqNo, DailyCostID)
Select Distinct DailyCosts.WellID,DailyCosts.JobID,DailyCosts.ReportNo,DailyCosts.AccountCode,
PurchaseOrder.ID,NEWID(),0,DailyCosts.DailyCostID
From DailyCosts join
PurchaseOrder ON DailyCosts.WellID = PurchaseOrder.WellID
Where DailyCosts.WellID = '24A-23'
Unfortunately, this produces 3,892,729 records. The Notes field contains the same list of PONumbers each day. This is by design so that the people inputting the data out in the field can easily track their PO numbers. The new PONumber column that we are moving the Notes to would store just unique POnumbers. I modified the query by replacing NEWID() with DailyCostID and the Join to ON DailyCosts.DailyCostID = PurchaseOrder.ID
This affected 1973 rows the same as the first Insert.
The next Insert looks like this:
Insert Into PurchaseOrderAccount (WellID, JobID, PurchaseOrderID, ID, AccountCode)
Select PurchaseOrder.WellID, PurchaseOrder.JobID, PurchaseOrder.ID, PurchaseOrderDailyCost.DailyCostID,PurchaseOrderDailyCost.AccountCode
From PurchaseOrder Inner Join
PurchaseOrderDailyCost ON PurchaseOrder.ID = PurchaseOrderDailyCost.DailyCostID
Where PurchaseOrder.WellID = '24A-23'
The page in the application now shows the PONumbers in the correct column. Everything looks like I want it to.
Unfortunately, it slows down the application to an unacceptable level. I need to figure out how to either modify my Insert or delete duplicate records. The problem is that there are multiple foreign key constraints. I have some more information below for reference.
This shows the application after the inserts. These are all duplicate records that I am hoping to elminate
Here is some additional information I received from the vendor about the tables:
-- add a new purchase order
INSERT INTO PurchaseOrder
(WellID, JobID, ID, PONumber, Amount, Description)
VALUES ('MyWell', 'MyJob', NEWID(), 'PO444444', 500.0, 'A new Purchase Order')
-- link a purchase order with id 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3' to a DailyCost record with SeqNo 0 and AccountCode 'MyAccount'
INSERT INTO PurchaseOrderDailyCost
(WellID, JobID, ReportNo, AccountCode, DailyCostSeqNo, PurchaseOrderID, ID)
VALUES ('MyWell', 'MyJob', 4, 'MyAccount', 0, 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3', NEWID())
-- link a purchase order with id 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3' to an account code 'MyAccount'
-- (i.e. make it choosable from the DailyCost PO-column dropdown for any DailyCost record whose account code is 'MyAccount')
INSERT INTO PurchaseOrderAccount
(WellID, JobID, PurchaseOrderID, ID, AccountCode)
VALUES ('MyWell', 'MyJob', 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3', NEWID(), 'MyAccount')
-- link a purchase order with id 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3' to an AFE No. 'MyAFENo'
-- (same behavior as with the account codes above)
INSERT INTO PurchaseOrderAFE
(WellID, JobID, PurchaseOrderID, ID, AFENo)
VALUES ('MyWell', 'MyJob', 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3', NEWID(), 'MyAFENo')
So it turns out I missed some simple joining principles. The better I get the more silly mistakes I seem to make. Basically, on my very first insert, I did not include a Group By. Adding this took my INSERT from 1973 to 93. Then on my next insert, I joined DailyCosts.Notes on PurchaseOrder.PONumber since these are the only records from DailyCosts I needed. This was previously INSERT 2 on my question. From there basically, everything came together. Two steps forward an one step back. Thanks to everyone that responded to this.

SSIS incremental data load error

I am trying to perform incremental insert from staging table (cust_reg_dim_stg) to the warehouse table (dim_cust_reg). This is the query I am using.
insert into dim_cust_reg WITH(TABLOCK)
(
channel_id
,cust_reg_id
,cust_id
,status
,date_created
,date_activated
,date_archived
,custodian_id
,reg_type_id
,reg_flags
,acc_name
,acc_number
,sr_id
,sr_type
,as_of_date
,ins_timestamp
)
select channel_id
,cust_reg_id
,cust_id
,status
,date_created
,date_activated
,date_archived
,reg_type_id
,reg_flags
,acc_name
,acc_number
,sr_id
,sr_type
,as_of_date
,getdate() ins_timestamp
from umpdwstg..cust_reg_dim_stg stg with(nolock)
join lookup_channel ch with(nolock) on stg.channel_name = ch.channel_name
where not exists
(select * from dim_cust_reg dest
where dest.cust_reg_id=stg.cust_reg_id
and dest.sr_id=stg.sr_id
and dest.channel_id=ch.channel_id )
Here channel_id is not there in the staging table and is taken using a channel lookup table (lookup_channel). On running this query I am getting the following error.
Violation of PRIMARY KEY constraint 'PK__dim_cust__4A293521A789A5FA'.
Cannot insert duplicate key in object 'dbo.dim_cust_reg'.
What is wrong with the query? channel_id,sr_id and cust_reg_id forms the unique key combination. There seems to be no data error.
There are 2 areas where you will need to troubleshoot:-
In this code below:
join lookup_channel ch with(nolock) on stg.channel_name = ch.channel_name
The incoming channel_name in the staging table may have a different channel name as compared to the record in the destination dimension.
OR
it may be because of this join condition inside the NOT EXISTS condition:
and dest.sr_id=stg.sr_id
and dest.channel_id=ch.channel_id
Here, again the incoming channel_id may be different when you compare the staged data to the one in the destination. So, suggestion is to ignore the channel id once and try to troubleshoot. Once this data is loaded in the target you can get the exact reason whether error was because of the channel_id.
Happy troubleshooting!
If there is already a duplicate entries in the table - custr_regr_dim_stg - then the SELECT query will produce both those records and will try to insert the same into the dim_cust_reg table. So use DISTINCT in the SELECT statement.

Oracle sql: insert into by copying from two different tables

Although it's much more complex than I'm about to explain, I'll try to only stick to the relevant bits of what I want to accomplish. Our data model is quite complex, and the terms are also a bit confusing. We basically have a Request, and this request can have an active Request_Status (which has an Enum_Value to indicate it's current status), as well as previous Request_Statuses that aren't relevant anymore (to preserve history). A Person is linked to this Request, but the Values that are entered are linked to the current Request_Status.
So here are those tables, and the relevant columns:
Persons:
person_id
unique_code
Some other values
Requests:
request_id
fk_person_id
year
Some other values
Enum_Values:
enum_value_id
value
Some other values
Request_Statuses:
request_status_id
fk_request_id
fk_enum_value_id
created_date
Some other values
Values:
value_id
fk_request_status_id
Some other values
I have: A list of Person.unique_codes.
I want to achieve two things:
For each Person.unique_code I want to get the Request of the year 2017, and then create a new Request_Status with fk_enum_value_id set to 4, linked to this existing Request.
Create copies of the Values that were linked to the previously active Request_Status, and set their fk_request_status_id to the currently active Request_Status (the records I've created in step 1).
I've been able to do step 1 myself with a monstrous query (but it works..)
Here is the monstrous query for step 1:
Some things to note:
- There will only be a single Request of a given year.
- There can be more than one Request_Statuses for a given Request, so finding the active is the one with the highest created_date.
- p.unique_code IN ('12345','67890') is privatized and reduced code. In reality I have about 500 person.unique_codes.
- SELECT rs1.fk_request_id, 4 /*, some other irrelevant values */ FROM Request_Statuses rs1 LEFT JOIN Request_Statuses rs2 ON (rs1.fk_request_id = rs2.fk_request_id AND rs1.created_date < rs2.created_date) WHERE rs2.created_date IS NULL is copied from this SO answer for the question "Retrieving the last record in each group". I've used the windowing function at the top before, but it wasn't really suitable for sub-queries in combination with Oracle SQL, so I've used the (probably slightly slower) original method that was posted in 2009, which does work as intended.
INSERT_INTO Request_Statuses (fk_request_id, fk_enum_value_id /*, some other irrelevant values */)
(SELECT rs1.fk_request_id, 4 /*, some other irrelevant values */ FROM Request_Statuses rs1
LEFT JOIN Request_Statuses rs2 ON (rs1.fk_request_id = rs2.fk_request_id AND rs1.created_date < rs2.created_date)
WHERE rs2.created_date IS NULL AND rs1.fk_request_id IN (SELECT r.request_id FROM Requests r
WHERE r.fk_person_id IN (SELECT p.person_id FROM Persons p
WHERE p.unique_code IN ('12345','67890')) AND r.year = 2017));
And I'm currently working on step 2.
I currently have this:
INSERT INTO Values (fk_request_status_id /* some other irrelevant values */)
(SELECT /*TODO: Get request_status_id created in step 1*/, /* some other irrelevant values */
FROM Values v1 WHERE v1.fk_request_status_id IN (SELECT rs.status_id FROM Request_Statuses rs
WHERE rs.fk_request_id IN (SELECT r.request_id FROM Requests r
WHERE r.fk_person_id IN (SELECT p.person_id FROM Persons p
WHERE p.bsn IN ('12345','67890')) AND r.year = 2017) AND (SELECT COUNT(*) FROM Values v2
WHERE v2.fk_request_status_id = rs.status_id) > 0));
All I need is to get the request_status_id of the Request_Statuses I've created in step 1, based on the same person.unique_code, and insert it at the TODO..
I've also been thinking about using a default value for now, and then update just the fk_request_status_id with a third (monstrous) query. Unfortunately, the fk_request_status_id in combination with a second column in the Values table form an unique constraint, and fk_request_status_id cannot be empty, so I can't just insert any value here to update later.. Maybe I should remove the constraints temporarily, and add them later again after the query..
PS: Performance isn't that important. I've only got around 500-750 person.unique_codes for which I have to create one new Request_Status each (and zero to about 50 Values that are potentially linked to the previous active Request_Status). It should work in under 4 hours, though. ;)

How do I process data in Multivalued Column in Oracle PLSQL?

I am working on creating a small database in Oracle for a project at work. One of the columns will need to have multiple values recorded in it. What's the query to create a multivalued column?
If you need a user to enter multiple email addresses, I would consider creating a USER_EMAIL table to store such records.
Create Table User_Email (User_Id int, Email varchar(100));
User_Id would be a foreign key that goes back to your USER table.
Then you can have a 1-n number of email address per user. This is generally the best practice for database normalization. If your emails have different types (ie work, personal, etc.), you could have another column in that table for the type.
If you need to return the rows in a single column, you could then look at using LISTAGG:
select u.id,
listagg(ue.email, ', ') within group (order by ue.email) email_addresses
from users u
left join user_email ue on u.id = ue.user_id
group by u.id
SQL Fiddle Demo
You can try to use VARRAY columns in a Oracle column.
Look at this page: https://www.orafaq.com/wiki/VARRAY
You can see there:
Declaration of a type:
CREATE OR REPLACE TYPE vcarray AS VARRAY(10) OF VARCHAR2(128);
Declaration of a table:
CREATE TABLE varray_table (id number, col1 vcarray);
Insertion:
INSERT INTO varray_table VALUES (3, vcarray('D', 'E', 'F'));
Selection:
SELECT t1.id, t2.column_value
FROM varray_table t1, TABLE(t1.col1) t2
WHERE t2.column_value = 'A' OR t2.column_value = 'D'