PostgreSQL Query trimming results unnecessarily - sql

I'm working on my first assignment using SQL on our class' PostgreSQL server. A sample database has the (partial here) schema:
CREATE TABLE users (
id int PRIMARY KEY,
userStatus varchar(100),
userType varchar(100),
userName varchar(100),
email varchar(100),
age int,
street varchar(100),
city varchar(100),
state varchar(100),
zip varchar(100),
CONSTRAINT users_status_fk FOREIGN KEY (userStatus) REFERENCES userStatus(name),
CONSTRAINT users_types_fk FOREIGN KEY (userType) REFERENCES userTypes(name)
);
CREATE TABLE events (
id int primary key,
title varchar(100),
edate date,
etime time,
location varchar(100),
user_id int, -- creator of the event
CONSTRAINT events_user_fk FOREIGN KEY (user_id) REFERENCES users(id)
);
CREATE TABLE polls (
id int PRIMARY KEY,
question varchar(100),
creationDate date,
user_id int, --creator of the poll
CONSTRAINT polls_user_fk FOREIGN KEY (user_id) REFERENCES users(id)
);
and a bunch of sample data (in particular, 127 sample users).
I have to write a query to find the number of polls created by a user within the past year, as well as the number of events created by a user that occurred in the past year. The trick is, I should have rows with 0s for both columns if the user had no such polls/events.
I have a query which seems to return the correct data, but only for 116 of the 127 users, and I cannot understand why the query is trimming these 11 users, when the WHERE clause only checks attributes of the poll/event. Following is my query:
SELECT u.id, u.userStatus, u.userType, u.email, -- Return user details
COUNT(DISTINCT e.id) AS NumEvents, -- Count number of events
COUNT(DISTINCT p.id) AS NumPolls -- Count number of polls
FROM (users AS u LEFT JOIN events AS e ON u.id = e.user_id) LEFT JOIN polls AS p ON u.id = p.user_id
WHERE (p.creationDate IS NULL OR ((now() - p.creationDate) < INTERVAL '1' YEAR) OR -- Only get polls created within last year
e.edate IS NULL OR ((now() - e.edate) < INTERVAL '1' YEAR)) -- Only get events that happened during last year
GROUP BY u.id, u.userStatus, u.userType, u.email;
Any help would be much appreciated.

Using a different query seemed to work. Here's what I ended up with:
SELECT u.id, u.userStatus, u.userType, u.email, COUNT(DISTINCT e.id) AS numevents, COUNT(DISTINCT p.id) AS numpolls
FROM users AS u LEFT OUTER JOIN (SELECT * FROM events WHERE ((now() - edate) < INTERVAL '1' YEAR)) AS e ON u.id = e.user_id
LEFT OUTER JOIN (SELECT * FROM polls WHERE ((now() - creationDate) < INTERVAL '1' YEAR)) AS p ON u.id = p.user_id
GROUP BY u.id, u.userStatus, u.userType, u.email
;

Try to avoid using DISTINCT with sub-queries for example.

Related

How to show clients with 0 reservations in certain year? (SQL)

I have these tables:
CREATE TABLE tour
(
id bigserial NOT NULL,
end_date DATE,
initial_price float8 NOT NULL,
start_date DATE,
destination_id int8,
guide_id int8,
PRIMARY KEY (id)
);
CREATE TABLE client_data
(
id bigserial NOT NULL,
name VARCHAR(255),
passport_number VARCHAR(255),
surname VARCHAR(255),
user_data_id int8,
PRIMARY KEY (id)
);
CREATE TABLE reservation
(
id bigserial not null,
actual_price float8 not null,
client_id int8,
tour_id int8,
PRIMARY KEY (id)
);
Where every reservation is connected to client_data and tour.
My goal is to show all clients that has not made any reservation in certain year eg. clients that have no reservations in 2022.
I tried something like this:
SELECT client_data.name, reservation.id, COUNT(reservation.id)
FROM client_data
LEFT OUTER JOIN reservation ON client_data.id = reservation.client_id
LEFT OUTER JOIN tour ON tour.id = reservation.tour_id
GROUP BY client_data.name, reservation.id
HAVING COUNT(reservation.id) = 0;
Or this:
SELECT client_data.name, reservation.id, COUNT(reservation.id)
FROM client_data
LEFT OUTER JOIN reservation ON client_data.id = reservation.client_id
LEFT OUTER JOIN tour ON tour.id = reservation.tour_id
WHERE reservation.id IS NULL
GROUP BY client_data.name, reservation.id;
These both ways work and show me clients that have no reservations IN GENERAL but I also need to show clients from certain year.
When I try to include
WHERE tour.start_date BETWEEN '2022-01-01' AND '2022-12-31'
the SQL statement returns 0 rows.
Any ideas how to do this?
EDIT:
I'll add full data and schema i work with.
schema: https://pastebin.com/ETvrW1tQ
data: https://pastebin.com/h1WHT0zZ
You've gotten it almost right. The reason why WHERE tour.start_date BETWEEN '2022-01-01' AND '2022-12-31' returns 0 rows is because it filters out all those clients who didn't make a reservation in that period as WHERE is applied to whole result set. So, instead of adding the date condition in the WHERE clause, I'd suggest adding it in the join condition for tour. Moreover I believe an OUTER JOIN wouldn't be required here either as you just want all the clients so, a LEFT JOIN should be sufficient. I think the following should work:
SELECT client_data.name, reservation.id, COUNT(reservation.id)
FROM client_data
LEFT JOIN reservation ON client_data.id = reservation.client_id
LEFT JOIN tour ON tour.id = reservation.tour_id and tour.start_date BETWEEN '2022-01-01' AND '2022-12-31'
WHERE reservation.id IS NULL
GROUP BY client_data.name, reservation.id;
Hope it helps
Edit
As OP mentioned the above query doesn't work as intended, I think we'll have to resort to using a subquery (or cte) here which I previously wanted to avoid due to performance reasons but maybe we're getting too ahead of ourselves on that. It's possible we can avoid it but I can't think of the correct way at the moment so here's a solution with subquery that will hopefully work.
select * from client_data where id not in (
select distinct client_id from reservation r
join tour t on r.tour_id = t.id
where t.start_date BETWEEN '2022-01-01' AND '2022-12-31'
);
In this we first find out the client_ids that did make a reservation in the said time frame and filter them out from the client data.
Have attached a fiddle in which you can play around it a bit
This will return the tour id's in 2022 that do not have a corresponding tour id in reservation:
select id as tour_id
from tour
where start_date between '2022-01-01' and '2022-12-31'
except
select tour_id
from reservation;
But since TOUR does not have a client_id, then how would you expect to get the client_id or client_name?

Inner join removes some rows unnecessarily

I have 3 tables defined like so
CREATE TABLE participants(
id SERIAL PRIMARY KEY,
Name TEXT NOT NULL,
Title TEXT NOT NULL
);
CREATE TABLE meetings (
id SERIAL PRIMARY KEY,
Subject TEXT NOT NULL,
Organizer TEXT NOT NULL,
StartTime TIMESTAMP NOT NULL,
EndTime TIMESTAMP NOT NULL
);
CREATE TABLE meetings_participants(
meeting_id int not null,
participant_id int not null,
primary key (meeting_id, participant_id),
foreign key(meeting_id) references meetings(id),
foreign key(participant_id) references participants(id)
);
I want to find meetings happening today with participants in them.
When I run this query I basically get them
SELECT * from meetings
INNER JOIN meetings_participants ON meetings.id = meetings_participants.meeting_id
INNER JOIN participants ON meetings_participants.participant_id = participants.id
WHERE starttime::date = NOW()::date;
Problem is this query discards meetings where there are no participants yet, I still wish to include them into my query result. How can I modify my query to work like that ?
You need a LEFT JOIN instead of INNER. Using ::date casting you are implying that you are only interested them to be taking place today, whether or not it might already ended. Still you should include EndTime in your query, taking into consideration that there might be meetings that span over several days:
SELECT * from meetings
left join meetings_participants on meetings.id = meetings_participants.meeting_id
left join participants on meetings_participants.participant_id = participants.id
WHERE starttime::date <= NOW()::date and endtime::date >= NOW()::date ;
DBFiddle demo here.
EDIT: Participants' name and title as JSON array:
SELECT id, subject, organizer, starttime, endtime, jsonb_pretty(tmp.participants)
from meetings m
left join lateral (
select jsonb_agg(row_to_json(tp)) as participants
from (select p.name, p.title
from meetings_participants mp
inner join participants p on mp.participant_id = p.id
where mp.meeting_id = m.id
) tp
) tmp on true
WHERE starttime::date <= NOW()::date
and endtime::date >= NOW()::date;
DBFiddle demo for participants added as JSON
You did not mention whether you want each participant on a separate row or as an aggregate (e.g. a comma separated list). If former then change inner to left join. For the latter case you could:
SELECT meetings.*, (
SELECT string_agg(participants.name, ', ')
FROM meetings_participants
JOIN participants ON meetings_participants.participant_id = participants.id
WHERE meetings_participants.meeting_id = meetings.id
) AS participants_list
FROM meetings
WHERE starttime::date = current_date

Find top 5 famous people

I have a case in hand where I need to find the top 5 people with most likes on their posts overall.
Here's the schema:
CREATE TABLE users (
ID SERIAL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
username VARCHAR(30) NOT NULL,
);
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
url VARCHAR(300) NOT NULL,
user_id INTEGER NOT NULL REFERENCES users(id) ON DELETE CASCADE,
);
CREATE TABLE likes (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
contents VARCHAR(240) NOT NULL,
user_id INTEGER NOT NULL REFERENCES users(id) ON DELETE CASCADE,
post_id INTEGER REFERENCES posts(id) ON DELETE CASCADE,
comment_id INTEGER REFERENCES comments(id) ON DELETE CASCADE,
-- 👉 either associated with post or comment 👈 --
CHECK(
COALESCE((post_id)::boolean::integer, 0) +
COALESCE((comment_id)::boolean::integer, 0) = 1
),
-- user can like post/comment once --
UNIQUE (user_id, post_id, comment_id)
);
My Attempts
Both are giving different outputs, not sure which one is correct. Also, I would appreciate an ideal (scalable) solution for this:
1.
WITH FAMOUS AS (
SELECT likes.id, users.username AS username, users.id AS user_id
FROM likes
JOIN posts ON posts.user_id = likes.post_id
JOIN users ON users.id = likes.user_id
WHERE likes.comment_id IS null
)
SELECT COUNT(*) AS num, username FROM FAMOUS
GROUP BY username
ORDER BY num DESC LIMIT 5;
2.
WITH LIKES_DATA AS (
SELECT post_id, COUNT(*) AS num_likes_per_post FROM likes
WHERE likes.comment_id IS NULL
GROUP BY post_id
)
SELECT users.username, SUM(num_likes_per_post) as num_likes
FROM LIKES_DATA
JOIN posts ON posts.id = LIKES_DATA.post_id
JOIN users ON users.id = posts.user_id
GROUP BY users.username
ORDER BY num_likes DESC LIMIT 5;
I simply do not understand the thought process for the second query.
Based on your description, I think just using JOINs and GROUP BY is sufficient:
SELECT u.username AS username, u.id AS user_id, COUNT(*)
FROM likes l JOIN
posts p
ON p.user_id = l.post_id JOIN
users u
ON u.id = l.user_id
WHERE likes.comment_id IS NULL -- don't know what this is for
GROUP BY u.username, u.id
ORDER BY COUNT(*) DESC
LIMIT 5;

Multiple selects on joined tables with group by?

I have three tables with the structures outlined below:
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
username VARCHAR(255) UNIQUE
);
CREATE TABLE posts (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users(id) NOT NULL,
category BIGINT REFERENCES categories(id) NOT NULL,
text TEXT NOT NULL
);
CREATE TABLE posts_votes (
user_id BIGINT REFERENCES users(id) NOT NULL,
post_id BIGINT REFERENCES posts(id) NOT NULL
value SMALLINT NOT NULL,
PRIMARY KEY(user_id, post_id)
);
I was able to compose a query that gets each post with its user and its total value using the below query:
SELECT p.id, p.text, u.username, COALESCE(SUM(v.value), 0) AS vote_value
FROM posts p
LEFT JOIN posts_votes v ON p.id=t.post_id
JOIN users u ON p.user_id=u.id
WHERE posts.category=1337
GROUP BY p.id, p.text, u.username
But now I want to also return a column that returns the result of SELECT COALESCE((SELECT value FROM posts_votes WHERE user_id=1234 AND post_id=n), 0) for each post_id n in the above query. What would be the best way to do this?
I think an additional LEFT JOIN is a reasonable approach:
SELECT p.id, p.text, u.username, COALESCE(SUM(v.value), 0) AS vote_value,
COALESCE(pv.value, 0)
FROM posts p JOIN
users u
ON p.user_id=u.id LEFT JOIN
topics_votes v
ON p.id = t.post_id LEFT JOIN
post_votes pv
ON pv.user_id = 1234 AND pv.post_id = p.id
WHERE p.category = 1337
GROUP BY p.id, p.text, u.username, pv.value;

Optimise many-to-many join

I have three tables: groups and people and groups_people which forms a many-to-many relationship between groups and people.
Schema:
CREATE TABLE groups (
id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE people (
id SERIAL PRIMARY KEY,
name TEXT,
join_date TIMESTAMP
);
CREATE TABLE groups_people (
group_id INT REFERENCES groups(id),
person_id INT REFERENCES people(id)
);
When I want to query for the latest 10 people who recenlty joined the group which has id = 1:
WITH person_ids AS (SELECT person_id FROM groups_people WHERE group_id = 1)
SELECT * FROM people WHERE id = ANY(SELECT person_id FROM person_ids)
ORDER BY join_date DESC LIMIT 10;
The query needs to scan all of the joined people then ordering them before selecting. That would be slow if the group containing too many people.
Is there anyway to work around it?
Schema (re-)design to allow same person joining multiple group
Since you mentioned that the relationship between groups and people
is many-to-many, I think you may want to move join_date to groups_people
(from people) because the same person can join different groups and each
such event has its own join_date
So I would change the schema to
CREATE TABLE people (
id SERIAL PRIMARY KEY,
name TEXT --, -- change
-- join_date TIMESTAMP -- delete
);
CREATE TABLE groups_people (
group_id INT REFERENCES groups(id),
person_id INT REFERENCES people(id), -- change
join_date TIMESTAMP -- add
);
Query
select
p.id
, p.name
, gp.join_date
from
people as p
, groups_people as gp
where
p.id = gp.person_id
and gp.group_id=1
order by gp.join_date desc
limit 10
Disclaimer: The above query is in MySQL syntax (the question was originally tagged with MySQL)
This seems much easier to write as a simple join with order by and limit:
select p.*
from people p join
groups_people gp
on p.id = gp.person_id
where gp.group_id = 1
order by gp.join_date desc
limit 10; -- or fetch first 10 rows only
Try rewriting using EXISTS
SELECT *
FROM people p
WHERE EXISTS (SELECT 1
FROM groups_people ps
WHERE p.id = ps.person_id and group_id = 1)
ORDER BY join_date DESC
LIMIT 10;