Conditional count of rows where at least one peer qualifies

Conditional count of rows where at least one peer qualifies - sql

Background
I'm a novice SQL user. Using PostgreSQL 13 on Windows 10 locally, I have a table t:
+--+---------+-------+
|id|treatment|outcome|
+--+---------+-------+
|a |1 |0 |
|a |1 |1 |
|b |0 |1 |
|c |1 |0 |
|c |0 |1 |
|c |1 |1 |
+--+---------+-------+
The Problem
I didn't explain myself well initially, so I've rewritten the goal.
Desired result:
+-----------------------+-----+
|ever treated |count|
+-----------------------+-----+
|0 |1 |
|1 |3 |
+-----------------------+-----+
First, identify id that have ever been treated. Being "ever treated" means having any row with treatment = 1.
Second, count rows with outcome = 1 for each of those two groups. From my original table, the ids who are "ever treated" have a total of 3 outcome = 1, and the "never treated", so to speak, have 1 `outcome = 1.
What I've tried
I can get much of the way there, I think, with something like this:
select treatment, count(outcome)
from t
group by treatment;
But that only gets me this result:
+---------+-----+
|treatment|count|
+---------+-----+
|0 |2 |
|1 |4 |
+---------+-----+

For the updated question:
SELECT ever_treated, sum(outcome_ct) AS count
FROM (
SELECT id
, max(treatment) AS ever_treated
, count(*) FILTER (WHERE outcome = 1) AS outcome_ct
FROM t
GROUP BY 1
) sub
GROUP BY 1;
ever_treated | count
--------------+-------
0 | 1
1 | 3
db<>fiddle here
Read:
For those who got no treatment at all (all treatment = 0), we see 1 x outcome = 1.
For those who got any treatment (at least one treatment = 1), we see 3 x outcome = 1.
Would be simpler and faster with proper boolean values instead of integer.

(Answer to updated question)
here is an easy to follow subquery logic that works with integer:
select subq.ever_treated, sum(subq.count) as count
from (select id, max(treatment) as ever_treated, count(*) as count
from t where outcome = 1
group by id) as subq
group by subq.ever_treated;

Related

In PostgreSQL, conditionally count rows

Background
I'm a novice Postgres user running a local server on a Windows 10 machine. I've got a dataset g that looks like this:
+--+---------+----------------+
|id|treatment|outcome_category|
+--+---------+----------------+
|a |1 |cardiovascular |
|a |0 |cardiovascular |
|b |0 |metabolic |
|b |0 |sensory |
|c |1 |NULL |
|c |0 |cardiovascular |
|c |1 |sensory |
|d |1 |NULL |
|d |0 |cns |
+--+---------+----------------+
The Problem
I'd like to get a count of outcome_category by outcome_category for those id who are "ever treated" -- defined as "id's who have any row where treatment=1".
Here's the desired result:
+----------------+---------+
|outcome_category| count |
+----------------+---------+
|cardiovascular | 3 |
|sensory | 1 |
|cns | 1 |
+----------------+---------+
It would be fine if the result had to contain metabolic, like so:
+----------------+---------+
|outcome_category|treatment|
+----------------+---------+
|cardiovascular | 3 |
|metabolic | 0 |
|sensory | 1 |
|cns | 1 |
+----------------+---------+
Obviously I don't need the rows to be in any particular order, though descending would be nice.
What I've tried
Here's a query I've written:
select treatment, outcome_category, sum(outcome_ct)
from (select max(treatment) as treatment,
outcome_category,
count(outcome_category) as outcome_ct
from g
group by outcome_category) as sub
group by outcome_category, sub.treatment;
But it's a mishmash result:
+---------+----------------+---+
|treatment|outcome_category|sum|
+---------+----------------+---+
|1 |cardiovascular |3 |
|1 |sensory |2 |
|0 |metabolic |1 |
|1 |NULL |0 |
|0 |cns |1 |
+---------+----------------+---+
I'm trying to identify the "ever exposed" id's using that first line in the subquery: select max(treatment) as treatment. But I'm not quite getting at the rest of it.
EDIT
I realized that the toy dataset g I originally gave you above doesn't correspond to the idiosyncrasies of my real dataset. I've updated g to reflect that many id's who are "ever treated" won't have a non-null outcome_category next to a row with treatment=1.

Interesting little problem. You can do:
select
outcome_category,
count(x.id) as count
from g
left join (
select distinct id from g where treatment = 1
) x on x.id = g.id
where outcome_category is not null
group by outcome_category
order by count desc
Result:
outcome_category count
----------------- -----
cardiovascular 3
sensory 1
cns 1
metabolic 0
See running example at db<>fiddle.

This would appear to be just a simple aggregation,
select outcome_category, Count(*) count
from t
where treatment=1
group by outcome_category
order by Count(*) desc
Demo fiddle

In SQL, query a table by transposing column results

Background
Forgive the title of this question, as I'm not really sure how to describe what I'm trying to do.
I have a SQL table, d, that looks like this:
+--+---+------------+------------+
|id|sex|event_type_1|event_type_2|
+--+---+------------+------------+
|a |m |1 |1 |
|b |f |0 |1 |
|c |f |1 |0 |
|d |m |0 |1 |
+--+---+------------+------------+
The Problem
I'm trying to write a query that yields the following summary of counts of event_type_1 and event_type_2 cut (grouped?) by sex:
+-------------+-----+-----+
| | m | f |
+-------------+-----+-----+
|event_type_1 | 1 | 1 |
+-------------+-----+-----+
|event_type_2 | 2 | 1 |
+-------------+-----+-----+
The thing is, this seems to involve some kind of transposition of the 2 event_type columns into rows of the query result that I'm not familiar with as a novice SQL user.
What I've tried
I've so far come up with the following query:
SELECT event_type_1, event_type_2, count(sex)
FROM d
group by event_type_1, event_type_2
But that only gives me this:
+------------+------------+-----+
|event_type_1|event_type_2|count|
+------------+------------+-----+
|1 |1 |1 |
|1 |0 |1 |
|0 |1 |2 |
+------------+------------+-----+

You can use a lateral join to unpivot the data. Then use conditional aggregate to calculate m and f:
select v.which,
count(*) filter (where d.sex = 'm') as m,
count(*) filter (where d.sex = 'f') as f
from d cross join lateral
(values (d.event_type_1, 'event_type_1'),
(d.event_type_2, 'event_type_2')
) v(val, which)
where v.val = 1
group by v.which;
Here is a db<>fiddle.

Assign Rank to Row based on Alphabetical Order Using Window Functions in PySpark

I'm trying to assign a rank to the rows of a dataframe using a window function over a string column (user_id), based on alphabetical order. So, for example:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |2
A |1
B |2
C |3
B |2
B |2
C |3
I tried using the following lines of code:
user_window = Window().partitionBy('user_id').orderBy('user_id')
data = (data
.withColumn('profile_row_num', dense_rank().over(user_window))
)
But I'm getting something like:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |1
A |1
B |1
C |1
B |1
B |1
C |1

Partition by user_id is unnecessary. This will cause all user_id to fall into their own partition and get a rank of 1. The code below should do what you wanted:
user_window = Window.orderBy('user_id')
data = data.withColumn('profile_row_num', dense_rank().over(user_window))

SQL count distinct values for each row

I got a table looking like this
+-----+---------+
|Group|Value |
+-----+---------+
|A |1 |
+-----+---------+
|B |2 |
+-----+---------+
|C |1 |
+-----+---------+
|D |3 |
+-----+---------+
And I would like to add a column in my select command that count GROUP based on value, lookin like this:
+-----+---------+---------+
|Group|Value | COUNT |
+-----+---------+---------+
|A |1 |2 |
+-----+---------+---------+
|B |2 |1 |
+-----+---------+---------+
|C |1 |2 |
+-----+---------+---------+
|D |3 |1 |
+-----+---------+---------+
Value 1 got the two groups A and C the other values for each one in this example.
Additional is it possible to consider all values for VALUES and GROUP even if a WHERE filtered out some of them in the select query?

You want a window function:
select t.*, count(*) over (partition by value) as count
from t;
You have a problem if the query has a where clause. The where applies to the window function. So you need a subquery for the count:
select t.*
from (select t.*, count(*) over (partition by value) as count
from t
) t
where . . .;
Or a correlated subquery might be convenient under some circumstances:
select t.*,
(select count(*) from t t2 where t2.value = t.value) as count
from t
where . . .;

Case to check if previous record matches last record

I have a query result like the one below. I wish to add a column in the query result that will flag as 1 if the [FinishTime] of the last record related to same [Machine] has the same [StartTime] as the current record.
So for example, in the table below, there is a flag=1 for row 5 ([Machine]=RD103) because it has the same start-time as for it's last record entry (row 3).
+---+-------+---------+-------+---------+----------------------+
|OID|Machine|StartTime|EndTime|DelayName|Consecutive Delay Flag|
+---+-------+---------+-------+---------+----------------------+
|1 |RD101 |20:00 |20:20 |A |0 |
+---+-------+---------+-------+---------+----------------------+
|2 |RD102 |21:00 |22:00 |A |0 |
+---+-------+---------+-------+---------+----------------------+
|3 |RD103 |23:00 |23:20 |B |0 |
+---+-------+---------+-------+---------+----------------------+
|4 |RD101 |20:20 |20:45 |C |1 |
+---+-------+---------+-------+---------+----------------------+
|5 |RD103 |23:20 |23:25 |A |1 |
+---+-------+---------+-------+---------+----------------------+

This is a great example of what analytic functions do - they don't force you to group your results (in other words - they still produce a single result per row), but you can have values that relate to other rows.
In your case, the LAG function should do the trick:
SELECT oid, machine, starttime, endtime, delayname,
CASE WHEN starttime =
LAG (starttime) OVER (PARTITION BY machine ORDER BY starttime)
THEN 1
ELSE 0
END AS flag
FROM my_table

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Conditional count of rows where at least one peer qualifies - sql

(Answer to updated question) here is an easy to follow subquery logic that works with integer: select subq.ever_treated, sum(subq.count) as count from (select id, max(treatment) as ever_treated, count(*) as count from t where outcome = 1 group by id) as subq group by subq.ever_treated;

Related

In PostgreSQL, conditionally count rows

In SQL, query a table by transposing column results

Assign Rank to Row based on Alphabetical Order Using Window Functions in PySpark

SQL count distinct values for each row

Case to check if previous record matches last record

Categories

Resources