Postgres - how to get proper count with join - sql

Sorry as a newcomer to sql (postgres in this case) I can't tease out the right answer from similar questions. So here's an example
I have two tables:
records:
id | status
----------------
1 | open
2 | open
3 | close
4 | open
events:
id | record_id | role | something_else
---------------------------------------------
1 | 2 | admin | stringA
2 | 1 | user | stringB
3 | 4 | admin | stringC
4 | 2 | admin | stringD
5 | 2 | admin | stringE
6 | 2 | user | stringF
7 | 3 | user | stringG
I basically would like to have a count(status) that reflects how many records have at least one events.role = 'admin' in the events table
in the above example it would be:
status | count
---------------
open | 2
close | 0
Any help much appreciated!

No need for nested queries. You can just use conditional aggregation:
select r.status, count(distinct r.id) filter(where e.role = 'admin') cnt
from records r
inner join events e on e.record_id = r.id
group by r.status
Demo on DB Fiddle:
status | cnt
:----- | --:
close | 0
open | 2

I basically would like to have a count(status) that reflects how many records have at least one events.role = 'admin' in the events table.
I would suggest:
select r.status, count(*) filter (where has_admin)
from (select r.*,
(exists (select 1 from events e where e.record_id = r.id and e.role = 'admin')) as has_admin
from records r
) r
group by r.status;
For your small data sample, the difference between exists and a join doesn't matter. With more data, though, the exists does not multiply the number of rows, which should make it a bit faster. Also, this guarantees that all statuses are included, even those with no events.

Related

What's the best way to create empty/default rows for missing aggregations

I have a table I want to group over two levels. As the output, I need all the grouping value combinations, such that I end up with zeros where non existant combinations occur. For example, say I have this table:
+------+------+
| user | page |
+------+------+
| a | 1 |
| a | 1 |
| a | 2 |
| b | 2 |
| b | 3 |
+------+------+
I'm after output like this:
+------+------+--------+
| user | page | visits |
+------+------+--------+
| a | 1 | 2 |
| a | 2 | 1 |
| a | 3 | 0 |
| b | 1 | 0 |
| b | 2 | 1 |
| b | 3 | 1 |
+------+------+--------+
I can achieve this with the following query, but it seems rather heavy handed:
WITH
users AS (SELECT distinct(user) FROM sometable),
pages AS (SELECT distinct(page) FROM sometable),
users_pages_empty AS (SELECT * FROM users CROSS JOIN pages),
users_pages_full AS (SELECT user, page, count(*) as visits FROM sometable GROUP BY user, page)
SELECT e.user, e.page, coalesce(f.visits, 0) as visits
FROM users_pages_empty e
LEFT JOIN users_pages_full f ON e.user=f.user AND e.page=f.page
I happen to be using AWS Athena, but I think this is more a generic SQL question than an Athena question.
The performance of this query is fine, it's more the readability/complexity I'm not happy with.
Use a cross join to generate the rows and a left join to bring in the existing rows and aggregate:
select u.user, p.page, count(s.user)
from (select distinct user from sometable) u cross join
(select distinct page from sometable) p left join
sometable s
on s.user = u.user and s.page = p.page
group by u.user, p.page
order by u.user, p.page;

Postgres query IN query

Is it possible in Postgres to determine if at least one result of query 1 is inside query 2 results?
For example:
SELECT * FROM items
WHERE
(SELECT id FROM users) IN (SELECT user_id FROM user_items WHERE item_id = 1)
I know that this query can be a nonsense, I'm just asking how to do that check in the where clause. In my real query (more complex), I'm getting:
(Postgrex.Error) ERROR 21000 (cardinality_violation): more than one row returned by a subquery used as an expression
if there is more than one result from query1 (query1 IN query2)
EDIT
select user_id
from notification_token n
join notification_folder f on n.user_id = f.user_id
where ((SELECT tag_id FROM notification_folder_tag WHERE notification_folder_id = f.id) IN (SELECT tag_id FROM event_tag WHERE event_id = 1))
tables:
notification_token
| user_id | notification_token |
--------------------------------------------------
| 1 | token1 |
| 2 | token2 |
| 3 | token3 |
notification_folder
| user_id | data |
--------------------------------------------------
| 1 | "useless string" |
notification_folder_tag
| notification_folder_id | tag_id |
--------------------------------------------------
| 1 | 1 |
| 1 | 2 |
| 2 | 5 |
event_tag
| event_id | tag_id |
--------------------------------------------------
| 1 | 1 |
| 2 | 2 |
| 3 | 8 |
The result that I want is user_id 1 from notification_token.
"Where" should be true because at least one tag_id from the left side of the IN (result 1,2) is contained in the right side of the IN (result 1).
Anyways i get error when the left side of the IN is composed by more than one entry. It works properly with just one entry
Try this
SELECT * FROM items
WHERE
EXISTS (SELECT id FROM users)
IN
(SELECT user_id FROM user_items WHERE item_id = 1);
If this doesn't work, go for relational database queries.
You seem to want items anyone who ordered item_1 has also ordered. If this interpretation is correct, then here is one way to write the query:
select distinct i.*
from items i join
user_items ui
on ui.item_id = i.item_id
where ui.user_id in (select ui2.user_id
from user_items ui2
where ui2.item_id = 1
);

How to join all the COUNT returned in multiple rows to a single row (POSTGRESQL)

I have this query and the the result is in multiple rows, each one is a event and his respective count number.
SELECT u.name, l.event, COUNT (l.event)
FROM log AS l LEFT JOIN user AS u ON u.id = l.userid
GROUP BY u.name, l.event
The result is like this:
------------------------------------------
| user | event | count |
------------------------------------------
| user_1 | event_1 | 12 |
| user_1 | event_2 | 6 |
| user_1 | event_3 | 9 |
| user_2 | event_1 | 16 |
| ... | ... | ... |
The problem is that i need these count as parameters (columns) in a single result row, where each row represents a single user, something like this:
--------------------------------------------------------------
| user | event_1 | event_2 | event_3 |
--------------------------------------------------------------
| user_1 | 12 | 6 | 9 |
| user_2 | 16 | 0 | 13 |
| ... | ... | ... | ... |
Maybe i can do this with a select query? With some kind of function that does a loop or something similar?
Thank you!
EDIT 1: I do not know the events to write them in the code
EDIT 2: I see that there is no trivial way to do this with a simple sql query, so i use python with pandas to do this. Take a look in How to convert a column of string to numerical?
If you know the events, you can use conditional aggregation:
SELECT name,
MAX(CASE WHEN event = 'event_1' THEN cnt ELSE 0 END) as event_1,
MAX(CASE WHEN event = 'event_2' THEN cnt ELSE 0 END) as event_2,
MAX(CASE WHEN event = 'event_3' THEN cnt ELSE 0 END) as event_3
FROM (SELECT u.name, l.event, COUNT(l.event) as cnt
FROM user u LEFT JOIN
log l
ON u.id = l.userid
GROUP BY u.name, l.event
) ul
GROUP BY name;
I switched the LEFT JOIN. It seems more likely that you want to keep all the users, even if there are no matching log messages.
Try this and see if it works..
Select [event_1],[event_2],[event_3] from
(SELECT u.name, l.event FROM log AS l LEFT JOIN user AS u ON u.id = l.userid ) s1
PIVOT (COUNT(l.event) FOR u.name in ([event_1],[event_2],[event_3])) as p1

how to select unique records from a table based on a column which has distinct values in another column

I have below table SUBJ_SKILLS which has records like
TCHR_ID | LINE_NBR | SUBJ | SUBJ_TYPE
--------| ------- | ---------- | ----------
1 | 1 | Maths | R
1 | 2 | 101 | U
2 | 1 | BehaviourialTech | U
3 | 2 | Maths | R
4 | 1 | RegionalLANG | U
5 | 3 | ForeignLANG | U
5 | 4 | Maths | R
6 | 2 | Science | R
7 | 1 | 101 | U
7 | 3 | Physics | R
..
..
I am trying to retrieve records like below (i.e. single teacher who taught multiple different subjects)
TCHR_ID | LINE_NBR | SUBJ | SUBJ_TYPE
--------| ------- | ---------- | ----------
5 | 3 | ForeignLANG | U
5 | 4 | Maths | R
7 | 1 | 101 | U
7 | 3 | Physics | R
1 | 1 | Maths | R
1 | 2 | 101 | U
Here, the line numbers are unique, means that TCHR_ID:5 taught Physics (which was LINE_NBR=1, but was removed later). So, the LINE_NBR are not updated and stay as is.
i also have a look up table (SUBJ_LKUP) for subject and their categories/type like below ('R' for Regular subject and 'U' for Unique subject )
SUBJ | SUBJ_TYPE
----------------- | ------------
Maths | R
Physics | R
ForeignLANG | U
101 | U
Science | R
BehaviourialTech | U
RegionalLANG | U
My approach to resolve this was to create a table which have 2 records for Teacher and use another query on base table (SUBJ_SKILLS) and new table to filter out distinct records. I came up with below queries..
Query-1:
create table tchr_with_2_subj as select SS.TCHR_ID
from SUBJ_SKILLS SS, SUBJ_LKUP SL
where SS.SUBJ = SL.SUBJ
and SL.SUBJ_TYPE IN ('R', 'U') AND SS.TCHR_ID IN
(select SS.TCHR_ID from SUBJ_SKILLS SS)
GROUP BY SS.TCHR_ID HAVING COUNT(*) = 2)
Query-2:
select SS.TCHR_ID from SUBJ_SKILLS SS, tchr_with_2_subj tw2s
where SS.TCHR_ID = tw2s.TCHR_ID
GROUP BY SS.TCHR_ID,SS.SUBJ_TYPE HAVING COUNT(*) > 1)
Question:
1)'IN' condition in Query-1 is causing problems and pulling wrong records.
2) Is there a better way to write query to pull matching records using a single query (i.e. instead of creating a table)
Could someone help me on this pls.
For the answer to your original question, I would use window functions:
select ss.*
from (select ss.*,
min(subj) over (partition by tchr_id) as mins,
max(subj) over (partition by tchr_id) as maxs
from SUBJ_SKILLS ss
) ss
where mins <> maxs;
It is unclear how the subject type fits in, but if you need to include that, similar logic will work.
Your second table can be obtained from your first table with:
select ss.*
from
subj_skills as ss
inner join (
select tchr_id
from subj_skills
group by tchr_id
having count(*) > 1
) as mult on mult.tchr_id=ss.tchr_id;
I'd use analytic functions here, asomething like:
select tchr_id, line_nbr, subj, SUBJ_TYPE
from (select count(distinct subj) over (partition by tchr_id) as grp_cnt,
s.*
from subj_skills s)
where grp_cnt > 1
If you need to filter out invalid records, you can do it in the inner query. If a teacher cannot teach the same subject multiple times (the req 'multiple different subjects' can be translated to 'multiple subjects'), then I'd rather use count(*) instead of count(distinct subj).

Select all users who wrote a minimum amount of messages in a fixed time frame

Table user_message:
+----+---------+-------+------------+
| id | from_id | to_id | time_stamp |
+----+---------+-------+------------+
| 1 | 1 | 2 | 1414700000 |
| 2 | 2 | 1 | 1414700100 |
| 3 | 3 | 1 | 1414701000 |
| 4 | 3 | 2 | 1414701001 |
| 5 | 3 | 4 | 1414701002 |
| 6 | 1 | 3 | 1414701100 |
+----+---------+-------+------------+
I am now trying to get all users who wrote a minimum amount of messages, let's say 3, to other users in a fixed time frame, let's say 5 seconds. As in this example, I'd like to get a result looking similar to this:
+----+----+-------+
| from_id | count |
+---------+-------+
| 3 | 3 |
+---------+-------+
The idea of this is to check the messages for spam. A nice bonus would be to only take messages into account that share the same content.
The following uses a join for this purpose:
select um.*, count(*) as cnt
from user_message um join
user_message um2
on um.from_id = um2.from_id and
um2.time_stamp between um.time_stamp and um.time_stamp + 3
group by um.id
having count(*) >= 3;
For performance, you would want an index on user_message(from_id, time_stamp). Even with the index, if you have a large-ish table, the performance might not be so great.
EDIT:
Actually, another way to write this that might be more efficient is:
select um.*,
(select count(*)
from user_message um2
where um.from_id = um2.from_id and
um2.time_stamp between um.time_stamp and um.time_stamp + 3
) as cnt
from user_message um
having cnt >= 3;
This uses a MySQL extension that allows having in a non-aggregation query.
For every message (u1) find all messages (u2) sent from the same user in this second or the four previous seconds. Keep those u1 that have at least 3 u2. At last group by from_id to show one record per from_id with the maximum number of sent messages.
select from_id, max(cnt) as max_count
from
(
select u1.id, u1.from_id, count(*) as cnt
from user_message u1
join user_message u2
on u2.from_id = u1.from_id
-- and u2.content = u1.content
and u2.time_stamp between u1.time_stamp - 4 and u1.time_stamp
group by u1.id, u1.from_id
having count(*) >= 3
) as init
group by from_id;