SQL - count value breaks in a column when specifying break parameters - sql

This builds on a previous question of mine here.
I have a table which tracks service involvement (srvc_invl) for two individuals (name) over a period of time (day).
name day srvc_inv
Liam 1 1
Liam 2 0
Liam 3 1
Liam 4 0
Liam 5 0
Liam 6 1
Liam 7 0
Noel 1 0
Noel 2 0
Noel 3 1
Noel 4 0
Noel 5 1
Noel 6 1
Noel 7 1
My goal is to count the number of unique service involvements per individual. Previously, we accomplished this by counting breaks in service involvement 1's and 0's using a lag function:
select name, count(*)
from (select t.*,
lag(srvc_inv, 1, 0) over (partition by name order by day) as prev_srvc_inv
from t
) t
where prev_srvc_inv = 0 and srvc_inv = 1
group by name;
However, I have just found out that breaks in service involvement can be defined differently based on the program of interest. I.e. for some programs, one-day of non-consecutive service counts as a break, for example:
day srvc_inv
1 1
2 0
3 1
= 2 service episodes
but for other programs, two or more days of non-consecutive service counts as a break, for example:
day srvc_inv
1 1
2 0
3 1
= 1 service episode, but
day srvc_inv
1 1
2 0
3 0
4 1
5 0
= 2 service episodes
Using the table at the top of this post, let us assume we are analyzing a program that considers two days of non-consecutive service involvement to be a service break and thus a distinct service episode.
How would I modify the above query, or write a new query, that would allow me to specify the break number parameters?
My desired output is as follows:
name srvc_episodes
Liam 2
Noel 1
Thank you so so so much for any help anyone can offer on this!

Use a running sum rather than lag(). This gives you more flexibility:
select name, count(*)
from (select t.*,
sum(srvc_inc) over (partition by name
order by day
rows between 2 preceding and 1 preceding
) as sum_srvc_inc_2
from t
) t
where (sum_srvc_inc_2 = 0 or sum_srvc_inc_2 is null) and srvc_inc = 1
group by name;
You would adjust the "2"s for the length of time for the split.

Try this:
SELECT NAME,
SUM(CASE WHEN SRVC_INV = 1
AND (LAG1 = 1 OR LAG2 = 1 OR (LAG1 IS NULL AND LAG2 IS NULL))
THEN 1
ELSE 0
END) AS SERVICE_EPISODES
FROM
(SELECT NAME, SRVC_INV,
LAG(SRVC_INV,1) OVER (PARTITION BY NAME ORDER BY DAY) AS LAG1,
LAG(SRVC_INV,2) OVER (PARTITION BY NAME ORDER BY DAY) AS LAG2
FROM T)
GROUP BY NAME
Cheers!!

Related

sql snowflake, aggregate over window or sth

I have a table below
days
balance
user_id
wanted column
2022/08/01
10
1
1
2022/08/02
11
1
1
2022/08/03
10
1
1
2022/08/03
0
2
1
2022/08/05
3
2
2
2022/08/06
3
2
2
2022/08/07
3
3
3
2022/08/08
0
2
3
since I'm new to SQL couldn't aggregate over window by clauses, correctly.
which means; I want to find unique users that have balance>0 per day.
thanks
update:
exact output wanted:
days
unque users
2022/08/01
1
2022/08/02
1
2022/08/03
1
2022/08/05
2
2022/08/06
2
2022/08/07
3
2022/08/08
3
update: how if I want to accumulate the number of unique users over time? with consideration of new users [means: counting users who didn't exist before], and the balance > 0
everyones help is appreaciated deeply :)
SELECT
*,
COUNT(DISTINCT CASE WHEN balance > 0 THEN USER_ID END) OVER (ORDER BY days)
FROM
your_table

Assign incremental id based on number series in ordered sql table

My table of interview candidates has three columns and looks like this (attempt is what I want to calculate):
candidate_id
interview_stage
stage_reached_at
attempt <- want to calculate
1
1
2019-01-01
1
1
2
2019-01-02
1
1
3
2019-01-03
1
1
1
2019-11-01
2
1
2
2019-11-02
2
1
1
2021-01-01
3
1
2
2021-01-02
3
1
3
2021-01-03
3
1
4
2021-01-04
3
The table represents candidate_id 1 who has had 3 separate interview attempts at a company.
Made it to interview_stage 3 on the 1st attempt
Made it to interview_stage 2 on the 2nd attempt
Made it to interview_stage 4 on the 3d attempt
Question: Can I somehow use the number series if I order by stage_reached_at? As soon as the next step for a particular candidate_id is lower than the row before, I know it's a new process.
I want to be able to group on candidate_id and process_grouping at the end of the day.
Thx in advance.
You can use lag() and then a cumulative sum:
select t.*,
sum(case when prev_interview_stage >= interview_stage then 1 else 0 end) over (partition by candidate_id order by stage_reached_at) as attempt
from (select t.*,
lag(interview_stage) over (partition by candidate_id order by stage_reached_at) as prev_interview_stage
from t
) t;
Note: Your question specifically says "lower". I wonder, though, if you really mean "lower or equal to". If the latter, change the >= to >.

Resetting a Count in SQL

I have data that looks like this:
ID num_of_days
1 0
2 0
2 8
2 9
2 10
2 15
3 10
3 20
I want to add another column that increments in value only if the num_of_days column is divisible by 5 or the ID number increases so my end result would look like this:
ID num_of_days row_num
1 0 1
2 0 2
2 8 2
2 9 2
2 10 3
2 15 4
3 10 5
3 20 6
Any suggestions?
Edit #1:
num_of_days represents the number of days since the customer last saw a doctor between 1 visit and the next.
A customer can see a doctor 1 time or they can see a doctor multiple times.
If it's the first time visiting, the num_of_days = 0.
SQL tables represent unordered sets. Based on your question, I'll assume that the combination of id/num_of_days provides the ordering.
You can use a cumulative sum . . . with lag():
select t.*,
sum(case when prev_id = id and num_of_days % 5 <> 0
then 0 else 1
end) over (order by id, num_of_days)
from (select t.*,
lag(id) over (order by id, num_of_days) as prev_id
from t
) t;
Here is a db<>fiddle.
If you have a different ordering column, then just use that in the order by clauses.

Summing up only the values of previous rows with the same ID

As I am preparing my data for predicting no-shows at a hospital, I ran into the following problem: In the query below I tried to get the number of shows/no-shows relatively shown to the number of appointments (APPTS). INDICATION_NO_SHOW means whether a patient showed up at a appointment. 0 means show, and 1 means no-show.
with t1 as
(
select
PAT_ID
,APPT_TIME
,APPT_ID
,ROW_NUMBER () over(PARTITION BY PAT_ID order by pat_id,APPT_TIME) as [TOTAL_APPTS]
,INDICATION_NO_SHOW
from appointments
)
,
t2 as
(
t1.PAT_ID
,t1.APPT_TIME
,INDICATION_NO_SHOW
,sum(INDICATION_NO_SHOW) over(order by PAT_ID, APPT_TIME ) as TOTAL_NO_SHOWS
,TOTAL_APPT
from t1
)
SELECT *
,(TOTAL_APPT- TOTAL_NO_SHOWS) AS TOTAL_SHOWS
FROM T2
order by PAT_ID, APPT_TIME
This resulted into the following dataset:
PAT ID APPT_TIME INDICATION_NO_SHOW TOTAL_SHOWS TOTAL_NO_SHOWS TOTAL_APPTS
1 1-1-2001 0 1 0 1
1 1-2-2001 0 2 0 2
1 1-3-2001 1 2 1 3
1 1-4-2001 0 3 1 4
2 1-1-2001 0 0 1 1
2 2-1-2001 0 1 1 2
2 2-2-2001 1 1 2 3
2 2-3-2001 0 2 2 4
As you can see my query only worked for patient 1, and then it also counts the no-shows for patient 1 for patient 2. So individually it worked for 1 patient, but not over the whole dataset.
The TOTAL_APPTs column worked out, because it counted the number of appts the patient had at the moment of that given appt. My question is: How do I succesfully get these shows and no-shows succesfully added up (as I did for patient 1)? I'm completely aware why this query doesn't work, I'm just completely in the blue on how to fix it..
I think that you can just use window functions. You seem to be looking for window sums of shows and no shows per patient, so:
select
pat_id,
appt_time,
indication_no_show,
sum(1 - indication_no_show)
over(partition by pat_id order by appt_time) total_shows,
sum(indication_no_show)
over(partition by pat_id order by appt_time) total_no_shows
from appointments

Get Percentile for a user

I have a table such as this:
Id, ReportId, UserId
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 1 3
9 2 3
10 1 4
My table has thousands of records, above is just an example of the table structure simplified for purpose of understanding the problem.
I'm trying to figure out what at what percentile a user sits based on how many reports he has read.
I've been looking into PERCENTILE_CONT and PERCENTILE_DISC functions, but I fail to understand them properly. https://learn.microsoft.com/en-us/sql/t-sql/functions/percentile-cont-transact-sql
What confuses me most is that what it appears to me is that these functions are trying to find the 50th percentile, not percentile for a specific record.
Maybe I'm just not understanding this correctly. Is there a better way?
EDIT:
To clarify. I want to know at what percentile a specific user (in this case user with id 1) sits based on how many reports they have read. If they read the most reports they would be at a higher percentile, what is that percentile? Lets say there are 100 users exactly, then the person with most reports read would be 1st percentile.
Update #2
One of these should do it:
select
a.UserId,
a.reports_read,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY a.reports_read) OVER (partition by UserId) AS percentile_d,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY a.reports_read) OVER (partition by UserId) AS percentile_c,
PERCENT_RANK() OVER(ORDER BY a.reports_read ) percent_rank,
CUME_DIST() OVER(ORDER BY a.reports_read ) AS cumulative_distance
from
(select UserId, count(distinct(ReportId)) as reports_read
from #tmp
group by UserId
) a
It gives the following results:
UserId reports_read percentile_d percentile_c percent_rank cumulative_distance
4 1 1 1 0 0.25
3 2 2 2 0.33333 0.5
2 3 3 3 0.66667 0.75
1 6 6 6 1 1
I hope this helps.