Converting PostgreSQL Subqueries into Joins - sql

Below is an example schema with 3 tables. I'm trying to run a query that returns all Jobs where all child Shifts are of status 6. If a Job has a child Shift with a status of 5, the Job should not be returned. The proper response for a query from the sample data inserted below is no rows returned.
There is a working query below with the comment "Works". I am trying to refactor the "works" query to use joins instead of subqueries. The query with the comment "Does not work" is my attempt.
-- begin setup and table creation: only run this section once.
CREATE EXTENSION "uuid-ossp";
CREATE TABLE jobs
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
CONSTRAINT jobs_pkey PRIMARY KEY (id)
);
CREATE TABLE bookings
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
job_id uuid,
CONSTRAINT bookings_pkey PRIMARY KEY (id)
);
CREATE TABLE shifts
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
booking_id uuid,
status integer,
CONSTRAINT shifts_pkey PRIMARY KEY (id)
);
insert into jobs (id) values ('e857c86c-bc31-11e6-9aae-57793f585d49');
insert into bookings (id, job_id) values ('736da82c-bc32-11e6-b9b8-f36753d321ac', 'e857c86c-bc31-11e6-9aae-57793f585d49');
insert into bookings (id, job_id) values ('7d839e5c-bc32-11e6-8bb3-4fa95be86a74', 'e857c86c-bc31-11e6-9aae-57793f585d49');
insert into shifts (booking_id, status) values ('736da82c-bc32-11e6-b9b8-f36753d321ac', 6);
insert into shifts (booking_id, status) values ('7d839e5c-bc32-11e6-8bb3-4fa95be86a74', 5);
-- end setup and table creation
We want all jobs where all child shifts are of status 6. If a job has a child shift with a status of 5, the job should not be returned. The proper response for a query from the sample data inserted above is no rows returned.
Does not work :(
SELECT "jobs".*
FROM "jobs"
inner join bookings b1 on jobs.id = b1.job_id
inner join shifts s1 on b1.id = s1.booking_id
left outer join bookings b2 on jobs.id = b2.job_id
left outer join shifts s2 on b2.id = s2.booking_id and s2.status IN (2,3,4,5)
WHERE s1.status = 6
AND s2.id IS NULL
GROUP BY "jobs"."id";
Works
SELECT "jobs".*
FROM "jobs"
WHERE jobs.id IN (
SELECT job_id
FROM bookings
WHERE bookings.id IN (
SELECT booking_id FROM shifts WHERE status = 6
)
) AND jobs.id NOT IN (
SELECT job_id FROM bookings WHERE bookings.id IN (
SELECT booking_id FROM shifts WHERE status IN (2,3,4,5)
)
)
GROUP BY "jobs"."id";
How can I refactor the "works" query to use joins instead of subqueries? The "does not work" query is my attempt.

Try this (haven't tested so there may be typos):
with prohibited_jobs as (
select distinct jobs.id
from jobs
join bookings on jobs.id == bookings.job_id
join shifts on shifts.booking_id = booking.job_id
where shift.status != 6
)
select jobs.*
from jobs
left outer join prohibited_jobs p on p.id = jobs.id
where
p.id IS NULL
It's not completely free from subqueries (doing everything with joins would most certainly be less efficient), but it removes some unnecessary checks, so may be a little bit faster (which I suspect is your goal).
There is a small difference to your working query, in that it returns all jobs where all shifts have status 6 (as you said you want), whereas your query also ensures that the job has at least one shift (of status 6).

Related

Inner join removes some rows unnecessarily

I have 3 tables defined like so
CREATE TABLE participants(
id SERIAL PRIMARY KEY,
Name TEXT NOT NULL,
Title TEXT NOT NULL
);
CREATE TABLE meetings (
id SERIAL PRIMARY KEY,
Subject TEXT NOT NULL,
Organizer TEXT NOT NULL,
StartTime TIMESTAMP NOT NULL,
EndTime TIMESTAMP NOT NULL
);
CREATE TABLE meetings_participants(
meeting_id int not null,
participant_id int not null,
primary key (meeting_id, participant_id),
foreign key(meeting_id) references meetings(id),
foreign key(participant_id) references participants(id)
);
I want to find meetings happening today with participants in them.
When I run this query I basically get them
SELECT * from meetings
INNER JOIN meetings_participants ON meetings.id = meetings_participants.meeting_id
INNER JOIN participants ON meetings_participants.participant_id = participants.id
WHERE starttime::date = NOW()::date;
Problem is this query discards meetings where there are no participants yet, I still wish to include them into my query result. How can I modify my query to work like that ?
You need a LEFT JOIN instead of INNER. Using ::date casting you are implying that you are only interested them to be taking place today, whether or not it might already ended. Still you should include EndTime in your query, taking into consideration that there might be meetings that span over several days:
SELECT * from meetings
left join meetings_participants on meetings.id = meetings_participants.meeting_id
left join participants on meetings_participants.participant_id = participants.id
WHERE starttime::date <= NOW()::date and endtime::date >= NOW()::date ;
DBFiddle demo here.
EDIT: Participants' name and title as JSON array:
SELECT id, subject, organizer, starttime, endtime, jsonb_pretty(tmp.participants)
from meetings m
left join lateral (
select jsonb_agg(row_to_json(tp)) as participants
from (select p.name, p.title
from meetings_participants mp
inner join participants p on mp.participant_id = p.id
where mp.meeting_id = m.id
) tp
) tmp on true
WHERE starttime::date <= NOW()::date
and endtime::date >= NOW()::date;
DBFiddle demo for participants added as JSON
You did not mention whether you want each participant on a separate row or as an aggregate (e.g. a comma separated list). If former then change inner to left join. For the latter case you could:
SELECT meetings.*, (
SELECT string_agg(participants.name, ', ')
FROM meetings_participants
JOIN participants ON meetings_participants.participant_id = participants.id
WHERE meetings_participants.meeting_id = meetings.id
) AS participants_list
FROM meetings
WHERE starttime::date = current_date

optmize delete based on "count" of elements in table with foreign key

I has two tables, "tracks" (header of track and track_points - points in track).
Schema:
CREATE TABLE tracks(
id INTEGER PRIMARY KEY ASC,
start_time TEXT NOT NULL
);
CREATE TABLE track_points (
id INTEGER PRIMARY KEY AUTOINCREMENT,
data BLOB,
track_id INTEGER NOT NULL,
FOREIGN KEY(track_id) REFERENCES tracks(id) ON DELETE CASCADE
);
CREATE INDEX track_id_idx ON track_points (track_id);
CREATE INDEX start_time_idx ON tracks (start_time);
And I want delete all "tracks" that has 0 or 1 point.
Note if 0 points in tracks, then it has no rows in "track_points".
I write such query:
DELETE FROM tracks WHERE tracks.id IN
(SELECT track_id FROM
(SELECT tracks.id as track_id, COUNT(track_points.id) as track_len FROM tracks
LEFT JOIN track_points ON tracks.id=track_points.track_id GROUP BY tracks.id)
WHERE track_len<=1)
it seems to work, but I wonder is it possible to optmize such query?
I mean time of work (now 10 seconds on big table on my machine).
Or may be simplification of this SQL code is possible (with preservance of work time of course)?
You can simplify your code by removing 1 level of your subqueries, because you can achieve the same with a HAVING clause instead of an outer WHERE clause:
DELETE FROM tracks
WHERE id IN (
SELECT t.id
FROM tracks t LEFT JOIN track_points p
ON t.id = p.track_id
GROUP BY t.id
HAVING COUNT(p.id) <= 1
);
The above code may not make any difference, but it's simpler.
The same logic could also be applied by using EXCEPT:
DELETE FROM tracks
WHERE id IN (
SELECT id FROM tracks
EXCEPT
SELECT track_id
FROM track_points
GROUP BY track_id
HAVING COUNT(*) > 1
);
What you can try is a query that does not involve the join of the 2 tables.
Aggregate only in track_points and get the track_ids with 2 or more occurrences. Then delete all the rows from tracks with ids that are not included in the result of the previous query:
DELETE FROM tracks
WHERE id NOT IN (
SELECT track_id
FROM track_points
GROUP BY track_id
HAVING COUNT(*) > 1
);

PostgreSQL | Need values from right table where is no match in m:n index

I have a three tables issue with PostgreSQL
table_left, table_index, table_right
table_index is m:n ...
I want to get all values from right table matching (m:n) and not-matching (NULL) values based on the values of left table.
SELECT field_left, field_index1, field_index2, field_right
FROM table_left
LEFT JOIN table_index ON left_id = index_left
LEFT JOIN table_right ON index_right = right_id
Using this query I get all values from left to right, but I'm not getting values from table_right were are not based in m:n table_index
If I do something like this ...
SELECT field_left, field_index1, field_index2, field_right
FROM table_left
LEFT JOIN table_index ON left_id = index_left
LEFT JOIN table_right ON index_right = right_id OR right_id NOT IN (1,2,3)
... I will get some strange results ...
field_index1, field_index2 using values from m:n but should be NULL because there is no dependency.
Any suggestions?
EDIT:
Have added some data ... Thx to #jarlh
DROP TABLE IF EXISTS "table_index";
CREATE TABLE "public"."table_index" (
"index_left" integer NOT NULL,
"index_right" integer NOT NULL,
"index_data1" character varying NOT NULL,
"index_data2" character varying NOT NULL,
CONSTRAINT "table_index_index_left_index_right" PRIMARY KEY ("index_left", "index_right"),
CONSTRAINT "table_index_index_left_fkey" FOREIGN KEY (index_left) REFERENCES table_left(left_id) NOT DEFERRABLE,
CONSTRAINT "table_index_index_right_fkey" FOREIGN KEY (index_right) REFERENCES table_right(right_id) NOT DEFERRABLE
) WITH (oids = false);
INSERT INTO "table_index" ("index_left", "index_right", "index_data1", "index_data2") VALUES
(1, 1, 'index-Left-A', 'index-Right-A'),
(1, 2, 'index-Left-A', 'index-Right-B'),
(1, 3, 'index-Left-A', 'index-Right-C'),
(2, 1, 'index-Left-B', 'index-Right-A');
DROP TABLE IF EXISTS "table_left";
DROP SEQUENCE IF EXISTS table_left_left_id_seq;
CREATE SEQUENCE table_left_left_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 2147483647 START 1 CACHE 1;
CREATE TABLE "public"."table_left" (
"left_id" integer DEFAULT nextval('table_left_left_id_seq') NOT NULL,
"left_data" character varying NOT NULL,
CONSTRAINT "table_left_left_id" PRIMARY KEY ("left_id")
) WITH (oids = false);
INSERT INTO "table_left" ("left_id", "left_data") VALUES
(1, 'Left-A'),
(2, 'Left-B'),
(3, 'Left-C');
DROP TABLE IF EXISTS "table_right";
DROP SEQUENCE IF EXISTS table_right_right_id_seq;
CREATE SEQUENCE table_right_right_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 2147483647 START 1 CACHE 1;
CREATE TABLE "public"."table_right" (
"right_id" integer DEFAULT nextval('table_right_right_id_seq') NOT NULL,
"right_data" character varying NOT NULL,
CONSTRAINT "table_right_right_id" PRIMARY KEY ("right_id")
) WITH (oids = false);
INSERT INTO "table_right" ("right_id", "right_data") VALUES
(1, 'Right-A'),
(2, 'Right-B'),
(3, 'Right-C');
Using this query ....
SELECT left_id, left_data, index_left, index_right, index_data1, index_data2, right_id, right_data
FROM table_left
LEFT JOIN table_index ON left_id = index_left
LEFT JOIN table_right ON index_right = right_id
... I get some NULL values as expected...
Using the original database I'm not getting these kind of values. Have seen there is an id col within the index table. Primary isn't set to both id values from left/right like my test. Have changed this in my local db with the same result as my test before. I'm getting these NULL values as expected.
I want to get all values from right table matching (m:n) and not-matching (NULL) values based on the values of left table.
If you want all values from the right table, then left join is the right way to go. However the right table should be the first table in the from clause:
SELECT field_left, field_index1, field_index2, field_right
FROM table_right r LEFT JOIN
table_index i
ON i.index_right = r.right_id LEFT JOIN
table_left l
ON l.left_id = i.index_left;
Notes:
There is a bit of cognitive dissonance (in English) because the roles of left/right are reversed.
I recommend that you use table aliases for your actual tables.
I strongly, strongly recommend that you qualify all column references so it is clear what tables they come from.
You could also use RIGHT JOIN. However, I also recommend using LEFT JOIN for this type of logic. It is easier to follow query logic that says: "Keep all rows in the table you just read" rather than "Keep all rows in some table that you will see much further down in the FROM clause."
EDIT:
Based on your comments, I suspect you want a Cartesian product of all left and right values along with a flag that indicates if it is in the junction table. Something like this:
SELECT field_left, field_index1, field_index2, field_right
FROM table_right r CROSS JOIN
table_left l LEFT JOIN
table_index i
ON i.index_right = r.right_id AND
i.index_left = l.left_id;

Querying for who worked on an item first and second

I have a table that looks like this:
Id (PK, int, not null)
ReviewedBy (nvarchar(255), not null)
ReviewDateTime(datetime, not null)
Decision_id (int, not null)
Item_id (FK, int, not null)
The business process with this table is that each Item (shown by Item_id foreign key) is to be worked on by 2 people.
How can I query this table to determine who (ReviewedBy) reviewed the item first and who reviewed it second.
I'm really struggling to figure this out because I neglected adding a Type column to my table that would determine which the user was acting as. :(
Edit
Given the following data
Id,ReviewedBy,ReviewedWhen,SomeOtherId,
16,111111,2011-12-14 22:06:54,1,
17,187935,2011-12-14 22:07:03,1,
18,187935,2011-12-14 22:07:18,2,
19,187935,2011-12-14 22:07:20,3,
20,111111,2011-12-14 22:07:23,2,
21,187935,2011-12-14 22:07:26,3,
22,123456,2011-12-14 22:27:50,4,
with schema
CREATE TABLE [Reviews] (
[Id] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[ReviewedBy] NVARCHAR(6) NOT NULL,
[ReviewedWhen] TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
[SomeOtherId] INTEGER NOT NULL
);
Executing the following to get a list of people who did second reviews will return rows where there is only one review for SomeOtherId.
select t1.*
from Reviews as t1
left outer join Reviews as t2
on (t1.SomeOtherId = t2.SomeOtherId and t1.ReviewedWhen < t2.ReviewedWhen)
where t2.SomeOtherId is null;
Solution
-- First checks
select t1.ReviewedBy, count(t1.Id)
from Reviews as t1
left outer join Reviews as t2
on (t2.SomeOtherId = t1.SomeOtherId and t1.ReviewedWhen > t2.ReviewedWhen)
where t2.SomeOtherID is null
group by t1.ReviewedBy;
-- Second checks
select t1.ReviewedBy, count(t1.Id)
from Reviews as t1
left outer join Reviews as t2
on (t2.SomeOtherId = t1.SomeOtherId and t1.ReviewedWhen < t2.ReviewedWhen)
where t2.SomeOtherID is null
and t1.Id not in (select Id from Reviews group by SomeOtherId having count(SomeOtherId) = 1)
group by t1.ReviewedBy;
Essentially, it was counting items where there was only one review as both a first and second check. All I had to do was ensure that when I'm counting second checks that I'm not including rows with only one review.
I thought I could achieve this in one query but guess not.
Try this:
select
t1.ReviewedBy FirstReviewer,
t2.ReviewedBy SecondReviewer
from
Table t1
left outer join Table t2 on t1.Item_Id = t2.Item_Id and t2.ReviewDateTime > t1.ReviewDateTime
If you want to only return rows that have been reviewed by two people, change the left outer join to an inner join.
If ReviewDateTime is never updated and Id is an identity column you can change the join to join on Id rather ReviewDateTime, which will be faster.

Oracle sql query running for (almost) forever

An application of mine is trying to execute a count(*) query which returns after about 30 minutes. What's strange is that the query is very simple and the tables involved are large, but not gigantic (10,000 and 50,000 records).
The query which takes 30 minutes is:
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
The database schema is essentially:
create table BATCH (
BATCH_ID int not null,
[other columns]...,
CONSTRAINT PK_BATCH PRIMARY KEY (BATCH_ID)
);
create table GROUP (
GROUP_ID int not null,
BATCH_ID int,
ENABLED char(1) not null,
[other columns]...,
CONSTRAINT PK_GROUP PRIMARY KEY (GROUP_ID),
CONSTRAINT FK_GROUP_BATCH_ID FOREIGN KEY (BATCH_ID)
REFERENCES BATCH (BATCH_ID),
CONSTRAINT CHK_GROUP_ENABLED CHECK(ENABLED in ('Y', 'N'))
);
create table RECORD (
GROUP_ID int not null,
RECORD_NUMBER int not null,
[other columns]...,
CONSTRAINT PK_RECORD PRIMARY KEY (GROUP_ID, RECORD_NUMBER),
CONSTRAINT FK_RECORD_GROUP_ID FOREIGN KEY (GROUP_ID)
REFERENCES GROUP (GROUP_ID)
);
create index IDX_GROUP_BATCH_ID on GROUP(BATCH_ID);
I checked whether there are any blocks in the database and there are none. I also ran the following pieces of the query and all except the last two returned instantly:
select count(*) from RECORD -- 55,501
select count(*) from GROUP -- 11,693
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
-- 55,501
select count(*)
from GROUP g
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
-- 3,112
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1
-- 27,742 - took around 5 minutes to run
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.ENABLED = 'Y'
-- 51,749 - took around 5 minutes to run
Can someone explain what's going on? How can I improve the query's performance? Thanks.
A coworker figured out the issue. It's because the table statistics weren't being updated and the last time the table was analyzed was a couple of months ago (when the table was essentially empty). I ran analyze table RECORD compute statistics and now the query is returning in less than a second.
I'll have to talk to the DBA about why the table statistics weren't being updated.
SELECT COUNT(*)
FROM RECORD R
LEFT OUTER JOIN GROUP G ON G.GROUP_ID = R.GROUP_ID
AND G.BATCH_ID = 1
AND G.ENABLED = 'Y'
Try that and let me know how it turns out. Not saying this IS the answer, but since I don't have access to a DB right now, I can't test it. Hope it works for ya.
An explain plan would be a good place to start.
See here:
Strange speed changes with sql query
for how to use the explain plan syntax (and query to see the result.)
If that doesn't show anything suspicious, you'll probably want to look at a trace.