Conditional count of rows where at least one peer qualifies - sql

Background
I'm a novice SQL user. Using PostgreSQL 13 on Windows 10 locally, I have a table t:
+--+---------+-------+
|id|treatment|outcome|
+--+---------+-------+
|a |1 |0 |
|a |1 |1 |
|b |0 |1 |
|c |1 |0 |
|c |0 |1 |
|c |1 |1 |
+--+---------+-------+
The Problem
I didn't explain myself well initially, so I've rewritten the goal.
Desired result:
+-----------------------+-----+
|ever treated |count|
+-----------------------+-----+
|0 |1 |
|1 |3 |
+-----------------------+-----+
First, identify id that have ever been treated. Being "ever treated" means having any row with treatment = 1.
Second, count rows with outcome = 1 for each of those two groups. From my original table, the ids who are "ever treated" have a total of 3 outcome = 1, and the "never treated", so to speak, have 1 `outcome = 1.
What I've tried
I can get much of the way there, I think, with something like this:
select treatment, count(outcome)
from t
group by treatment;
But that only gets me this result:
+---------+-----+
|treatment|count|
+---------+-----+
|0 |2 |
|1 |4 |
+---------+-----+

For the updated question:
SELECT ever_treated, sum(outcome_ct) AS count
FROM (
SELECT id
, max(treatment) AS ever_treated
, count(*) FILTER (WHERE outcome = 1) AS outcome_ct
FROM t
GROUP BY 1
) sub
GROUP BY 1;
ever_treated | count
--------------+-------
0 | 1
1 | 3
db<>fiddle here
Read:
For those who got no treatment at all (all treatment = 0), we see 1 x outcome = 1.
For those who got any treatment (at least one treatment = 1), we see 3 x outcome = 1.
Would be simpler and faster with proper boolean values instead of integer.

(Answer to updated question)
here is an easy to follow subquery logic that works with integer:
select subq.ever_treated, sum(subq.count) as count
from (select id, max(treatment) as ever_treated, count(*) as count
from t where outcome = 1
group by id) as subq
group by subq.ever_treated;

Related

In PostgreSQL, conditionally count rows

Background
I'm a novice Postgres user running a local server on a Windows 10 machine. I've got a dataset g that looks like this:
+--+---------+----------------+
|id|treatment|outcome_category|
+--+---------+----------------+
|a |1 |cardiovascular |
|a |0 |cardiovascular |
|b |0 |metabolic |
|b |0 |sensory |
|c |1 |NULL |
|c |0 |cardiovascular |
|c |1 |sensory |
|d |1 |NULL |
|d |0 |cns |
+--+---------+----------------+
The Problem
I'd like to get a count of outcome_category by outcome_category for those id who are "ever treated" -- defined as "id's who have any row where treatment=1".
Here's the desired result:
+----------------+---------+
|outcome_category| count |
+----------------+---------+
|cardiovascular | 3 |
|sensory | 1 |
|cns | 1 |
+----------------+---------+
It would be fine if the result had to contain metabolic, like so:
+----------------+---------+
|outcome_category|treatment|
+----------------+---------+
|cardiovascular | 3 |
|metabolic | 0 |
|sensory | 1 |
|cns | 1 |
+----------------+---------+
Obviously I don't need the rows to be in any particular order, though descending would be nice.
What I've tried
Here's a query I've written:
select treatment, outcome_category, sum(outcome_ct)
from (select max(treatment) as treatment,
outcome_category,
count(outcome_category) as outcome_ct
from g
group by outcome_category) as sub
group by outcome_category, sub.treatment;
But it's a mishmash result:
+---------+----------------+---+
|treatment|outcome_category|sum|
+---------+----------------+---+
|1 |cardiovascular |3 |
|1 |sensory |2 |
|0 |metabolic |1 |
|1 |NULL |0 |
|0 |cns |1 |
+---------+----------------+---+
I'm trying to identify the "ever exposed" id's using that first line in the subquery: select max(treatment) as treatment. But I'm not quite getting at the rest of it.
EDIT
I realized that the toy dataset g I originally gave you above doesn't correspond to the idiosyncrasies of my real dataset. I've updated g to reflect that many id's who are "ever treated" won't have a non-null outcome_category next to a row with treatment=1.
Interesting little problem. You can do:
select
outcome_category,
count(x.id) as count
from g
left join (
select distinct id from g where treatment = 1
) x on x.id = g.id
where outcome_category is not null
group by outcome_category
order by count desc
Result:
outcome_category count
----------------- -----
cardiovascular 3
sensory 1
cns 1
metabolic 0
See running example at db<>fiddle.
This would appear to be just a simple aggregation,
select outcome_category, Count(*) count
from t
where treatment=1
group by outcome_category
order by Count(*) desc
Demo fiddle

In SQL, query a table by transposing column results

Background
Forgive the title of this question, as I'm not really sure how to describe what I'm trying to do.
I have a SQL table, d, that looks like this:
+--+---+------------+------------+
|id|sex|event_type_1|event_type_2|
+--+---+------------+------------+
|a |m |1 |1 |
|b |f |0 |1 |
|c |f |1 |0 |
|d |m |0 |1 |
+--+---+------------+------------+
The Problem
I'm trying to write a query that yields the following summary of counts of event_type_1 and event_type_2 cut (grouped?) by sex:
+-------------+-----+-----+
| | m | f |
+-------------+-----+-----+
|event_type_1 | 1 | 1 |
+-------------+-----+-----+
|event_type_2 | 2 | 1 |
+-------------+-----+-----+
The thing is, this seems to involve some kind of transposition of the 2 event_type columns into rows of the query result that I'm not familiar with as a novice SQL user.
What I've tried
I've so far come up with the following query:
SELECT event_type_1, event_type_2, count(sex)
FROM d
group by event_type_1, event_type_2
But that only gives me this:
+------------+------------+-----+
|event_type_1|event_type_2|count|
+------------+------------+-----+
|1 |1 |1 |
|1 |0 |1 |
|0 |1 |2 |
+------------+------------+-----+
You can use a lateral join to unpivot the data. Then use conditional aggregate to calculate m and f:
select v.which,
count(*) filter (where d.sex = 'm') as m,
count(*) filter (where d.sex = 'f') as f
from d cross join lateral
(values (d.event_type_1, 'event_type_1'),
(d.event_type_2, 'event_type_2')
) v(val, which)
where v.val = 1
group by v.which;
Here is a db<>fiddle.

Assign Rank to Row based on Alphabetical Order Using Window Functions in PySpark

I'm trying to assign a rank to the rows of a dataframe using a window function over a string column (user_id), based on alphabetical order. So, for example:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |2
A |1
B |2
C |3
B |2
B |2
C |3
I tried using the following lines of code:
user_window = Window().partitionBy('user_id').orderBy('user_id')
data = (data
.withColumn('profile_row_num', dense_rank().over(user_window))
)
But I'm getting something like:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |1
A |1
B |1
C |1
B |1
B |1
C |1
Partition by user_id is unnecessary. This will cause all user_id to fall into their own partition and get a rank of 1. The code below should do what you wanted:
user_window = Window.orderBy('user_id')
data = data.withColumn('profile_row_num', dense_rank().over(user_window))

SQL count distinct values for each row

I got a table looking like this
+-----+---------+
|Group|Value |
+-----+---------+
|A |1 |
+-----+---------+
|B |2 |
+-----+---------+
|C |1 |
+-----+---------+
|D |3 |
+-----+---------+
And I would like to add a column in my select command that count GROUP based on value, lookin like this:
+-----+---------+---------+
|Group|Value | COUNT |
+-----+---------+---------+
|A |1 |2 |
+-----+---------+---------+
|B |2 |1 |
+-----+---------+---------+
|C |1 |2 |
+-----+---------+---------+
|D |3 |1 |
+-----+---------+---------+
Value 1 got the two groups A and C the other values for each one in this example.
Additional is it possible to consider all values for VALUES and GROUP even if a WHERE filtered out some of them in the select query?
You want a window function:
select t.*, count(*) over (partition by value) as count
from t;
You have a problem if the query has a where clause. The where applies to the window function. So you need a subquery for the count:
select t.*
from (select t.*, count(*) over (partition by value) as count
from t
) t
where . . .;
Or a correlated subquery might be convenient under some circumstances:
select t.*,
(select count(*) from t t2 where t2.value = t.value) as count
from t
where . . .;

Case to check if previous record matches last record

I have a query result like the one below. I wish to add a column in the query result that will flag as 1 if the [FinishTime] of the last record related to same [Machine] has the same [StartTime] as the current record.
So for example, in the table below, there is a flag=1 for row 5 ([Machine]=RD103) because it has the same start-time as for it's last record entry (row 3).
+---+-------+---------+-------+---------+----------------------+
|OID|Machine|StartTime|EndTime|DelayName|Consecutive Delay Flag|
+---+-------+---------+-------+---------+----------------------+
|1 |RD101 |20:00 |20:20 |A |0 |
+---+-------+---------+-------+---------+----------------------+
|2 |RD102 |21:00 |22:00 |A |0 |
+---+-------+---------+-------+---------+----------------------+
|3 |RD103 |23:00 |23:20 |B |0 |
+---+-------+---------+-------+---------+----------------------+
|4 |RD101 |20:20 |20:45 |C |1 |
+---+-------+---------+-------+---------+----------------------+
|5 |RD103 |23:20 |23:25 |A |1 |
+---+-------+---------+-------+---------+----------------------+
This is a great example of what analytic functions do - they don't force you to group your results (in other words - they still produce a single result per row), but you can have values that relate to other rows.
In your case, the LAG function should do the trick:
SELECT oid, machine, starttime, endtime, delayname,
CASE WHEN starttime =
LAG (starttime) OVER (PARTITION BY machine ORDER BY starttime)
THEN 1
ELSE 0
END AS flag
FROM my_table