Complicated min/max multi-table query - sql

I need to get the min and max score of group ids, but only if they are enabled:
cdu_group_sl: cdu_group_cc: cdu_group_ph:
-------------------- -------------------- --------------------
|id |name |enabled | |id |name |enabled | |id |name |enabled |
-------------------- -------------------- --------------------
|1 |sl_1 |1 | |1 |cc_1 |1 | |1 |ph_1 |0 |
|2 |sl_3 |1 | |2 |cc_2 |0 | |2 |ph_2 |1 |
|3 |sl_4 |1 | |3 |cc_3 |1 | |3 |ph_3 |1 |
-------------------- -------------------- --------------------
Scores are found in a separate table:
cdu_user_progress
----------------------------------
|id |group_type |group_id |score |
----------------------------------
|1 |sl |1 |50 |
|1 |cc |1 |10 |
|1 |ph |1 |20 |
|1 |sl |2 |80 |
|1 |sl |3 |20 |
|1 |cc |3 |30 |
|1 |sl |1 |40 |
|1 |ph |1 |50 |
|1 |cc |1 |40 |
|1 |ph |2 |90 |
----------------------------------
I need to get a max and min score for each type of group for only enabled groups (for each type):
---------------------------------------------
|group_type |group_id |min_score |max_score |
---------------------------------------------
|sl |1 |40 |50 |
|sl |2 |80 |80 |
|sl |3 |20 |20 |
|cc |1 |10 |40 |
|cc |3 |30 |30 |
|ph |1 |20 |50 |
|ph |2 |90 |90 |
---------------------------------------------
Any idea what the query might be??? So far I have:
SELECT * FROM cdu_user_progress
JOIN cdu_group_sl ON (cdu_group_sl.id = cdu_user_progress.group_id AND cdu_user_progress.group_type = 'sl')
JOIN cdu_group_cc ON (cdu_group_cc.id = cdu_user_progress.group_id AND cdu_user_progress.group_type = 'cc')
JOIN cdu_group_ph ON (cdu_group_ph.id = cdu_user_progress.group_id AND cdu_user_progress.group_type = 'ph')
WHERE cdu_user_progress.uid = $student->uid
AND (cdu_user_progress.group_type = 'sl' AND cdu_group_sl.enabled = 1)
AND (cdu_user_progress.group_type = 'cc' AND cdu_group_cc.enabled = 1)
AND (cdu_user_progress.group_type = 'ph' AND cdu_group_ph.enabled = 1)
Probably completely wrong...

what about using a union to pick the groups you are interested in - something like:
select group_type, group_id min(score) min_score, max(score) max_score
from (
select id, 'sl' grp from cdu_group_sl where enabled = 1
union all
select id, 'cc' from cdu_group_cc where enabled = 1
union all
select id, 'ph' from cdu_group_ph where enabled = 1
) grps join cdu_user_progress scr
on grps.id = scr.group_id and grps.grp = scr.group_type
group by scr.group_type, scr.group_id

The following is probably the fastest way to do this query. To optimize this, you should have an index on group_id, enabled on each of the three "sl", "cc", and "ph" tables:
select cup.*
from cdu_user_progress cup
where (cup.group_type = 'sl' and
exists (select 1
from cdu_group_sl sl
where sl.id = cup.group_id and
sl.enabled = 1
)
) or
(cup.group_type = 'cc' and
exists (select 1
from cdu_group_cc cc
where cc.id = cup.group_id and
cc.enabled = 1
)
) or
(cup.group_type = 'ph' and
exists (select 1
from cdu_group_ph ph
where ph.id = cup.group_id and
ph.enabled = 1
)
)
As a note, having three tables with the same structure is usually a sign of a poor database schema. These three tables should probably be combined into a single table, which would make this query much easier to write.

If you are just starting up this project, I would recommend refining your data structure. Based on what you showed, you could benefit from only one cdu_groups table with a reference to a new cdu_group_types table, and removing the group_type column from cdu_user_progress.
If this is an established project, where changing the structure would be too disruptive... then one of the other answers showing a query would be a better/easier fit.
Otherwise, you could simplify things with restructured tables and end up with a query like:
SELECT group_type,
group_id,
MIN(score) as min_score,
MAX(score) as max_score
FROM cdu_user_progress c
INNER JOIN cdu_groups g
ON c.group_id=g.id
INNER JOIN cdu_group_types t
ON g.group_type_id=t.id
WHERE enabled=1
GROUP BY group_type, group_id
This is shown, with expected results, in this SQLFiddle. With this structure you can add new group types as you want (and also cut down on amount of tables and joins). Tables would be (simplified in this code below, no FKs or anything):
CREATE TABLE cdu_user_progress
(id INT, group_id INT, score INT)
CREATE TABLE cdu_group_types
(id INT, group_type VARCHAR(3))
CREATE TABLE cdu_groups
(id INT, group_type_id INT, name VARCHAR(10), enabled BIT NOT NULL DEFAULT 1)
Granted moving data to a new structure may be a pain or not reasonable... but wanted to throw this out there as a possibility or just something to chew on.

Related

In PostgreSQL, conditionally count rows

Background
I'm a novice Postgres user running a local server on a Windows 10 machine. I've got a dataset g that looks like this:
+--+---------+----------------+
|id|treatment|outcome_category|
+--+---------+----------------+
|a |1 |cardiovascular |
|a |0 |cardiovascular |
|b |0 |metabolic |
|b |0 |sensory |
|c |1 |NULL |
|c |0 |cardiovascular |
|c |1 |sensory |
|d |1 |NULL |
|d |0 |cns |
+--+---------+----------------+
The Problem
I'd like to get a count of outcome_category by outcome_category for those id who are "ever treated" -- defined as "id's who have any row where treatment=1".
Here's the desired result:
+----------------+---------+
|outcome_category| count |
+----------------+---------+
|cardiovascular | 3 |
|sensory | 1 |
|cns | 1 |
+----------------+---------+
It would be fine if the result had to contain metabolic, like so:
+----------------+---------+
|outcome_category|treatment|
+----------------+---------+
|cardiovascular | 3 |
|metabolic | 0 |
|sensory | 1 |
|cns | 1 |
+----------------+---------+
Obviously I don't need the rows to be in any particular order, though descending would be nice.
What I've tried
Here's a query I've written:
select treatment, outcome_category, sum(outcome_ct)
from (select max(treatment) as treatment,
outcome_category,
count(outcome_category) as outcome_ct
from g
group by outcome_category) as sub
group by outcome_category, sub.treatment;
But it's a mishmash result:
+---------+----------------+---+
|treatment|outcome_category|sum|
+---------+----------------+---+
|1 |cardiovascular |3 |
|1 |sensory |2 |
|0 |metabolic |1 |
|1 |NULL |0 |
|0 |cns |1 |
+---------+----------------+---+
I'm trying to identify the "ever exposed" id's using that first line in the subquery: select max(treatment) as treatment. But I'm not quite getting at the rest of it.
EDIT
I realized that the toy dataset g I originally gave you above doesn't correspond to the idiosyncrasies of my real dataset. I've updated g to reflect that many id's who are "ever treated" won't have a non-null outcome_category next to a row with treatment=1.
Interesting little problem. You can do:
select
outcome_category,
count(x.id) as count
from g
left join (
select distinct id from g where treatment = 1
) x on x.id = g.id
where outcome_category is not null
group by outcome_category
order by count desc
Result:
outcome_category count
----------------- -----
cardiovascular 3
sensory 1
cns 1
metabolic 0
See running example at db<>fiddle.
This would appear to be just a simple aggregation,
select outcome_category, Count(*) count
from t
where treatment=1
group by outcome_category
order by Count(*) desc
Demo fiddle

In SQL, query a table by transposing column results

Background
Forgive the title of this question, as I'm not really sure how to describe what I'm trying to do.
I have a SQL table, d, that looks like this:
+--+---+------------+------------+
|id|sex|event_type_1|event_type_2|
+--+---+------------+------------+
|a |m |1 |1 |
|b |f |0 |1 |
|c |f |1 |0 |
|d |m |0 |1 |
+--+---+------------+------------+
The Problem
I'm trying to write a query that yields the following summary of counts of event_type_1 and event_type_2 cut (grouped?) by sex:
+-------------+-----+-----+
| | m | f |
+-------------+-----+-----+
|event_type_1 | 1 | 1 |
+-------------+-----+-----+
|event_type_2 | 2 | 1 |
+-------------+-----+-----+
The thing is, this seems to involve some kind of transposition of the 2 event_type columns into rows of the query result that I'm not familiar with as a novice SQL user.
What I've tried
I've so far come up with the following query:
SELECT event_type_1, event_type_2, count(sex)
FROM d
group by event_type_1, event_type_2
But that only gives me this:
+------------+------------+-----+
|event_type_1|event_type_2|count|
+------------+------------+-----+
|1 |1 |1 |
|1 |0 |1 |
|0 |1 |2 |
+------------+------------+-----+
You can use a lateral join to unpivot the data. Then use conditional aggregate to calculate m and f:
select v.which,
count(*) filter (where d.sex = 'm') as m,
count(*) filter (where d.sex = 'f') as f
from d cross join lateral
(values (d.event_type_1, 'event_type_1'),
(d.event_type_2, 'event_type_2')
) v(val, which)
where v.val = 1
group by v.which;
Here is a db<>fiddle.

SQL COUNT ignoring a column

I have a doubt on a SQL query:
I have the following result from a query:
select distinct eb.event_type_id, eb.status from eid.event_backlog eb order by 1
|event_type_id|status |
|-------------|----------|
|1 |SUCCESS |
|2 |SUCCESS |
|2 |ERROR |
|3 |SUCCESS |
|3 |ERROR |
|4 |SUCCESS |
i would like to obtain this result doing a distinct on the status:
|event_type_id|count |
|-------------|-------|
|1 |1 |
|2 |2 |
|3 |2 |
|4 |1 |
but the only way that I see to obtain this result is doing the following query:
select
eb.event_type_id,
count(1)
from
(
select
distinct eb.event_type_id, eb.status
from
eid.event_backlog eb
order by
1) eb
group by
eb.event_type_id
I don't like to use an nestled query, there is another way to obtain what i want?
Simply count(distinct eb.status), i.e.
select
eb.event_type_id,
count(distinct eb.status)
from eid.event_backlog eb
group by
eb.event_type_id

Return rows only if matches all list values

Let's say I have a table customers:
-----------------
|id|name|country|
|1 |Joe |Mexico |
|2 |Mary|USA |
|3 |Jim |France |
-----------------
And a table languages:
-------------
|id|language|
|1 |English |
|2 |Spanish |
|3 |French |
-------------
And a table cust_lang:
------------------
|id|custId|langId|
|1 |1 |1 |
|2 |1 |2 |
|3 |2 |1 |
|4 |3 |3 |
------------------
Given a list: ["English", "Spanish", "Portugese"]
Using a WHERE IN for the list, it will still return customers with ids 1,2 because they match "English" and "Spanish".
However, the results should be 0 rows returned since no customer matches ALL three terms.
I only want the customer ids to return if it matches the cust_lang table.
For instance, Given a list: ["English", "Spanish"]
I would want the results to be customer Id 1, since he alone speaks both languages.
EDIT: #GordonLinoff - That works!!
Now to make it more complex, what's wrong with this additional related query:
Let's assume I also have a table degrees:
-----------
|id|degree|
|1 |PHD |
|2 |BA |
|3 |MD |
-----------
A corresponding join table cust_deg:
------------------
|id|custId|degId |
|1 |1 |1 |
|2 |1 |2 |
|3 |2 |1 |
|4 |3 |3 |
------------------
The following query does not work. However, it is two of the same queries combined. The results should be only rows that match both lists, instead of the one list.
SELECT * FROM customers C
WHERE C.id IN (
SELECT CL.langId FROM cust_lang CL
JOIN languages L on CL.langId = L.id
WHERE L.language IN ("English", "Spanish")
GROUP BY CL.langID
HAVING COUNT(*) = 2)
AND C.id IN (
SELECT CD.custId FROM cust_deg CD
JOIN degrees D ON CD.degID = D.id
WHERE D.degree IN ("PHD", "BA")
GROUP BY CD.custId HAVING COUNT(*) = 2));`
EDIT2: I think i fixed it. I accidentally had an extra select statement in there.
You can do this with group by and having:
select cl.custid
from cust_lang cl join
languages l
on cl.langid = l.id
where l.language in ('English', 'Spanish', 'Portuguese')
group by cl.custid
having count(*) = 3;
If, for example, you only wanted to check for two languages, then you need only change you WHERE ... IN and HAVING conditions, e.g.:
where l.language in ('English', 'Spanish')
and
having count(*) = 2
This is pretty much Gordon's answer but it has the benefit of being a little more flexible on the language list and it doesn't require any change to the having clause.
with my_languages as (
select langId from languages
where language in ('English', 'Spanish')
)
select cl.custId
from cust_lang as cl inner join my_languages as l on l.langId = cl.langId
group by cl.custId
having count(*) = (select count(*) from lang)

selecting records matching condition A and at least X matching B

I have table of data as follows
|table_id|ref_table_id|is_used| date |url|
|--------+------------+-------+-------------------+---|
|1 |1 | | |abc|
|2 |1 | |2016-01-01 00:00:00|abc|
|3 |1 |0 | |abc|
|4 |1 |1 | |abc|
|5 |2 | | | |
|6 |2 | |2016-01-01 00:00:00|abc|
|7 |2 |1 | |abc|
|8 |2 |1 |2016-01-01 00:00:00|abc|
|9 |2 |1 |2016-01-01 00:00:00|abc|
|10 |3 | | | |
|11 |3 | |2016-01-01 00:00:00|abc|
|12 |3 |0 | | |
|13 |3 |0 | | |
|14 |3 |0 |2016-01-01 00:00:00| |
|15 |3 |1 |2016-01-01 00:00:00|abc|
...
|int |int |boolean|timestamp |varchar|
As it is obvious, the combination of null values and filled values in the columns is_used, date, url has no rules.
Now I want to get distinct ref_table_id with conditions
there is at least 1 row that is not used and has empty date and url
there are fewer than X rows that are not used and has filled either
date or url
The table has many rows (~7mil) and groupped ref_table_id can range from 50 rows to 600k rows.
I tried to create this select, which runs for more than 2secs.
select
distinct on (ref_table_id) t1.ref_table_id,
count(1) as my_count
from my_table t1 inner join (
select distinct t2.ref_table_id from my_table t2
where t2.is_used is not true -- null or false
and t2.url is null
and t2.date is null
group by t2.ref_table_id
) tjoin on t1.ref_table_id = tjoin.ref_table_id
where t1.is_used is not true
and (t1.date is not null
or t1.url is not null)
group by t1.ref_table_id
having my_count < X
order by 1,2;
Can I rewrite it using INTERSECT, VIEW or other db features so that it would be faster?
This sounds like aggregation with a having clause:
select ref_table_id
from my_table t
group by ref_table_id
having sum(case when is_used = 0 and date is null and url is null
then 1 else 0 end) > 0 and
sum(case when is_used = 0 and (date is not null or url is not null)
then 1 else 0 end) >= N;
This checks explicitly for is_used to be 0 as the meaning of "not used". I'm not sure what the blanks represent, so the logic may need to be tweaked.
As a note, you can simplify the query by removing the common condition on is_used:
select ref_table_id
from my_table t
where is_used = 0 -- or is_used is NULL ??
group by ref_table_id
having sum(case when date is null and url is null
then 1 else 0 end) > 0 and
sum(case when (date is not null or url is not null)
then 1 else 0 end) >= N;